darrenhinde f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements		4 months ago
..
.gitkeep	f4b3d56aa2 Add evaluation framework structure and OpenCode logging documentation	4 months ago
README.md	f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements	4 months ago
index.html	f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements	4 months ago
latest.json	f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements	4 months ago
serve.sh	f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements	4 months ago

📊 Test Results Dashboard

Interactive dashboard for visualizing OpenCode agent test results.

⚡ Quick Reference

# Run tests
cd evals/framework && npm run eval:sdk -- --agent=opencoder

# View dashboard (auto-opens browser, auto-shuts down)
cd evals/results && ./serve.sh

That's it! 🎉

Quick Start

Run Tests:

cd evals/framework
npm run eval:sdk -- --agent=opencoder
npm run eval:sdk -- --agent=openagent

View Dashboard:

Option A: One-Command Solution (Easiest) ⭐

   cd evals/results
   ./serve.sh

Auto-opens browser
Loads dashboard
Auto-shuts down after 15 seconds
Dashboard stays cached in browser!

Custom timeout:

   ./serve.sh 8000 30  # Port 8000, 30 second timeout

Option B: Keep Server Running

   cd evals/results
   python3 -m http.server 8000

Press Ctrl+C to stop manually

Option C: Direct File Access

   open evals/results/index.html

⚠️ Note: Some browsers block loading JSON from local files. If you see an error, use Option A or B.

Features

📈 Overview Stats

Total Tests - Count across all agents
Pass Rate - Percentage of passing tests
Failed Tests - Number of failures
Avg Duration - Average test execution time

📊 Trend Chart

Visual representation of pass rate over time
Shows last 30 days of test runs
Helps identify regressions

🔍 Filters

Agent - Filter by openagent, opencoder, etc.
Category - Developer, business, creative, edge-case
Status - All, passed only, or failed only
Time Range - Latest, today, last 7 days, last 30 days

🔎 Search

Real-time search across test IDs
Case-insensitive matching

📋 Test Table

Sortable Columns - Click any header to sort
Expandable Rows - Click a row to see details
Violation Details - See error messages and severity

🌙 Dark Mode

Toggle with moon/sun icon in header
Preference saved to localStorage
Easy on the eyes for long sessions

📥 Export

Export filtered results to CSV
Includes all test metadata
Perfect for external analysis

File Structure

results/
├── index.html              # Dashboard (open this)
├── serve.sh                # Helper script to start HTTP server
├── latest.json             # Most recent test run
├── history/
│   └── 2025-11/
│       ├── 26-115759-opencoder.json
│       └── 26-115850-openagent.json
├── .gitignore              # Retention policy
└── README.md               # This file

JSON Format

Each result file contains:

{
  "meta": {
    "timestamp": "2025-11-26T11:59:36.365Z",
    "agent": "openagent",
    "model": "opencode/grok-code-fast",
    "framework_version": "0.1.0",
    "git_commit": "f872007"
  },
  "summary": {
    "total": 8,
    "passed": 6,
    "failed": 2,
    "duration_ms": 32450,
    "pass_rate": 0.75
  },
  "by_category": {
    "developer": { "passed": 5, "total": 6 },
    "business": { "passed": 1, "total": 1 },
    "edge-case": { "passed": 0, "total": 1 }
  },
  "tests": [
    {
      "id": "task-simple-001",
      "category": "developer",
      "passed": true,
      "duration_ms": 4200,
      "events": 23,
      "approvals": 2,
      "violations": {
        "total": 0,
        "errors": 0,
        "warnings": 0
      }
    }
  ]
}

Retention Policy

Results are automatically managed:

✅ Latest Run - Always kept (latest.json)
✅ Current Month - All results committed to git
✅ Previous Month - All results committed to git
❌ Older than 60 days - Kept locally, not committed

This keeps the repo size manageable while preserving recent history.

Tips

Quick View Workflow

The fastest way to view results:

cd evals/results && ./serve.sh

✅ Opens browser automatically
✅ Loads all data
✅ Shuts down after 15 seconds
✅ Dashboard stays functional (data cached)
✅ No manual cleanup needed

Want to keep exploring? Press Ctrl+C during countdown to keep server running.

Comparing Agents

Set Time Range to "Latest Run"
Set Agent to "All Agents"
Compare pass rates and durations

Finding Flaky Tests

Set Time Range to "Last 30 Days"
Look for tests that alternate between pass/fail
Check violation details for patterns

Tracking Improvements

Run tests regularly (daily/weekly)
Watch the trend chart for improvements
Export CSV for deeper analysis

Debugging Failures

Filter Status to "Failed Only"
Click on a failed test row
Review violation details
Check error messages and severity

Browser Compatibility

✅ Chrome/Edge (recommended)
✅ Firefox
✅ Safari
⚠️ IE11 (not supported)

Performance

Dashboard Size: ~31KB (no dependencies except Chart.js CDN)
Load Time: < 1 second for 100 tests
Memory: Minimal (pure JavaScript, no frameworks)

How It Works

Auto-Shutdown Feature

The serve.sh script:

Starts HTTP server on port 8000
Opens dashboard in your browser
Waits 15 seconds for data to load
Shuts down server automatically
Dashboard continues working (data cached in browser)

Why does it still work after shutdown?

The browser caches the JSON data
All filtering/sorting happens in JavaScript
No server needed after initial load
Refresh the page to load new data (server will need to restart)

Stopping Manually

If you start the server manually:

# Find the process
lsof -ti:8000

# Kill it
kill $(lsof -ti:8000)

Or just press Ctrl+C in the terminal.

Troubleshooting

Dashboard shows "No results found"

Run tests first: npm run eval:sdk
Check that latest.json exists
Refresh the page

Chart not displaying

Check browser console for errors
Ensure Chart.js CDN is accessible
Try refreshing the page

Dark mode not persisting

Check browser localStorage is enabled
Clear cache and try again

Future Enhancements

Potential improvements:

Historical comparison (compare two runs)
Test duration trends per test
Violation type breakdown chart
Agent performance comparison chart
Auto-refresh option
Shareable URLs with filters
CI/CD badge generation

Contributing

To improve the dashboard:

Edit index.html (all code is in one file)
Test locally by opening in browser
Submit PR with description of changes

License

MIT - Same as OpenCode Agents project