# OpenCode Agent Evaluation Framework - Complete Guide **Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated validation.** Last Updated: November 27, 2025 --- ## 📋 Table of Contents 1. [Quick Start](#quick-start) 2. [What This Framework Does](#what-this-framework-does) 3. [How Tests Work](#how-tests-work) 4. [Writing Tests](#writing-tests) 5. [Validation Features](#validation-features) 6. [Running Tests](#running-tests) 7. [Understanding Results](#understanding-results) 8. [Troubleshooting](#troubleshooting) 9. [Key Learnings](#key-learnings) --- ## 🚀 Quick Start ```bash # Install and build cd evals/framework npm install npm run build # Run all tests (uses free model by default) npm run eval:sdk # Run specific agent npm run eval:sdk -- --agent=openagent npm run eval:sdk -- --agent=opencoder # Debug mode (verbose output, keeps sessions) npm run eval:sdk -- --debug # View results dashboard cd ../results && ./serve.sh ``` --- ## 🎯 What This Framework Does ### Purpose Validates that OpenCode agents follow their defined rules and behaviors through **real execution** with actual sessions, not mocks. ### Key Capabilities ✅ **Real Execution** - Creates actual OpenCode sessions, sends prompts, captures responses ✅ **Event Streaming** - Monitors all events (tool calls, messages, permissions) in real-time ✅ **Automated Validation** - Runs evaluators to check compliance with agent rules ✅ **Content Validation** - Verifies file contents, not just that tools were called ✅ **Subagent Verification** - Validates delegation and subagent behavior ✅ **Enhanced Logging** - Captures full tool inputs/outputs with timing ✅ **Multi-turn Support** - Handles approval workflows and complex conversations ### What Gets Tested | Validation Type | What It Checks | |----------------|----------------| | **Approval Gate** | Agent asks for approval before executing risky operations | | **Context Loading** | Agent loads required context files before execution | | **Delegation** | Agent delegates complex tasks (4+ files) to task-manager | | **Tool Usage** | Agent uses correct tools for the task | | **Behavior** | Agent follows expected behavior patterns | | **Subagent** | Subagents execute correctly when delegated | | **Content** | Files contain expected content and patterns | --- ## 🔧 How Tests Work ### Test Execution Flow ``` ┌─────────────────────────────────────────────────────────────────┐ │ TEST RUNNER │ ├─────────────────────────────────────────────────────────────────┤ │ 1. Clean test_tmp/ directory │ │ 2. Start opencode server (from git root) │ │ 3. For each test: │ │ a. Create session with specified agent │ │ b. Send prompt(s) (single or multi-turn) │ │ c. Capture events via event stream │ │ d. Extract tool inputs/outputs (enhanced logging) │ │ e. Run evaluators on session data │ │ f. Validate behavior expectations │ │ g. Check content expectations │ │ h. Verify subagent behavior (if delegated) │ │ i. Delete session (unless --debug) │ │ 4. Clean test_tmp/ directory │ │ 5. Generate results (JSON + dashboard) │ └─────────────────────────────────────────────────────────────────┘ ``` ### Where Data Lives **During Test Execution:** ``` ~/.local/share/opencode/storage/ ├── session/ # Session metadata (by project hash) ├── message/ # Messages per session (ses_xxx/) ├── part/ # Tool calls, text parts, etc. └── session_diff/ # Session changes ``` **Test Results:** ``` evals/results/ ├── latest.json # Most recent run ├── history/2025-11/ # Historical runs └── index.html # Interactive dashboard ``` ### Event Stream Monitoring The framework listens to the OpenCode event stream and captures: ```typescript // Events captured in real-time - session.created/updated - message.created/updated - part.created/updated (includes tool calls) - permission.request/response // Enhanced with tool details (NEW) - Tool name, input, output - Start time, end time, duration - Success/error status ``` --- ## ✍️ Writing Tests ### Basic Test Structure ```yaml id: my-test-001 name: My Test Name description: What this test validates category: developer # developer, business, creative, edge-case agent: openagent # openagent, opencoder model: opencode/big-pickle # Optional, defaults to free tier # Single prompt (simple tests) prompt: | Create a function called add in math.ts # OR Multi-turn prompts (for approval workflows) prompts: - text: | Create a function called add in math.ts expectContext: true contextFile: ".opencode/context/core/standards/code.md" - text: "Yes, proceed with the plan." delayMs: 2000 # Expected behavior behavior: mustUseTools: [read, write] requiresApproval: true requiresContext: true minToolCalls: 2 # Expected violations (should NOT violate these) expectedViolations: - rule: approval-gate shouldViolate: false severity: error # Approval strategy approvalStrategy: type: auto-approve timeout: 120000 ``` ### Multi-Turn Tests (Critical for OpenAgent) **Why Multi-Turn?** OpenAgent requires approval before execution. Single-turn tests will fail because the agent asks for approval but never receives it. ```yaml # ❌ WRONG - Single turn (agent asks approval, never gets it) prompt: "Create a file at test.txt" # ✅ CORRECT - Multi-turn (agent asks, user approves) prompts: - text: "Create a file at test.txt" - text: "Yes, proceed." delayMs: 2000 ``` --- ## 🎨 Validation Features ### 1. Content Validation (NEW) Validates the **actual content** of files written/edited: ```yaml behavior: mustUseTools: [write] contentExpectations: - filePath: "src/math.ts" mustContain: - "export function add" - ": number" mustNotContain: - "console.log" - "any" minLength: 100 maxLength: 500 ``` **Validation Types:** - `mustContain` - Required patterns (40% weight) - `mustNotContain` - Forbidden patterns (30% weight) - `mustMatch` - Regex pattern (20% weight) - `minLength` - Minimum content length (5% weight) - `maxLength` - Maximum content length (5% weight) ### 2. Subagent Verification (NEW) Validates delegation and subagent behavior: ```yaml behavior: mustUseTools: [task] shouldDelegate: true delegationExpectations: subagentType: "CoderAgent" subagentMustUseTools: [write, read] subagentMinToolCalls: 2 subagentMustComplete: true ``` **Checks:** - Correct subagent type invoked (30% weight) - Subagent used required tools (40% weight) - Minimum tool calls met (20% weight) - Subagent completed successfully (10% weight) ### 3. Enhanced Approval Detection (NEW) More sophisticated approval validation: ```yaml behavior: requiresApproval: true approvalExpectations: minConfidence: high # high, medium, low approvalMustMention: - "file" - "create" requireExplicitApproval: true ``` ### 4. Debug Options (NEW) Enhanced debugging capabilities: ```yaml behavior: debug: logToolDetails: true # Log all tool I/O saveReplayOnFailure: true # Save session for replay exportMarkdown: true # Export to markdown ``` ### 5. Tool Usage Validation ```yaml behavior: # Must use these tools mustUseTools: [read, write] # Must use at least one of these sets mustUseAnyOf: [[bash], [list]] # May use these (optional) mayUseTools: [glob, grep] # Must NOT use these mustNotUseTools: [edit] # Tool call count minToolCalls: 2 maxToolCalls: 10 ``` --- ## 🏃 Running Tests ### Basic Commands ```bash # Run all tests npm run eval:sdk # Run specific agent npm run eval:sdk -- --agent=openagent # Run with debug output npm run eval:sdk -- --debug # Filter tests by pattern npm run eval:sdk -- --agent=openagent --filter="context-loading" ``` ### Batch Execution (Avoid Rate Limits) ```bash # Run in batches of 3 with 10s delays cd evals/framework/scripts/utils ./run-tests-batch.sh openagent 3 10 ``` ### Debug Mode Features When running with `--debug`: - ✅ Full event logging with tool I/O - ✅ Sessions kept for inspection - ✅ Detailed timeline output - ✅ Tool duration tracking **Example Debug Output:** ``` ──────────────────────────────────────────────────────────── 🔧 TOOL: write (completed) ──────────────────────────────────────────────────────────── 📥 INPUT: { "filePath": "test.ts", "content": "export function add..." } 📤 OUTPUT: { "success": true, "bytesWritten": 67 } ⏱️ Duration: 12ms ──────────────────────────────────────────────────────────── ``` --- ## 📊 Understanding Results ### Test Output ``` ============================================================ Running test: ctx-code-001 - Code Task with Context Loading ============================================================ Approval strategy: Auto-approve all permission requests Creating session... Session created: ses_abc123 Agent: openagent Model: anthropic/claude-sonnet-4-5 Sending 2 prompts (multi-turn)... Prompt 1/2: Create a function... Completed Prompt 2/2: Yes, proceed... Completed Running evaluators... ✅ approval-gate: PASSED ✅ context-loading: PASSED ✅ tool-usage: PASSED ✅ behavior: PASSED ✅ content: PASSED Test PASSED Duration: 35142ms Events captured: 116 ``` ### Results Dashboard ```bash cd evals/results ./serve.sh # Open http://localhost:8000 ``` **Dashboard Features:** - Filter by agent, category, status - View violation details - See test trends over time - Export results ### Understanding Violations ``` Violations Detected: 1. [error] missing-required-tool: Required tool 'write' was not used 2. [error] missing-required-patterns: File missing: export function 3. [warning] over-delegation: Delegated for < 4 files (acceptable) ``` **Severity Levels:** - `error` - Test fails - `warning` - Test passes but flagged - `info` - Informational only --- ## 🔍 Troubleshooting ### Common Issues #### 1. Tests Failing with "No tool calls" **Problem:** Agent responds but doesn't execute tools. **Cause:** Single-turn test when multi-turn needed (OpenAgent requires approval). **Solution:** ```yaml # Change from: prompt: "Create a file" # To: prompts: - text: "Create a file" - text: "Yes, proceed." delayMs: 2000 ``` #### 2. Duplicate Test IDs **Problem:** Same test ID appears in multiple files. **Cause:** Old and new test structures both present. **Solution:** Ensure unique test IDs across all test files. ```bash # Check for duplicates find evals/agents/*/tests -name "*.yaml" -exec grep "^id:" {} \; | sort | uniq -d ``` #### 3. Context Not Loading **Problem:** Context loading evaluator fails. **Cause:** Context file read before first prompt sent. **Solution:** Use `expectContext: true` on the prompt that needs context: ```yaml prompts: - text: "Create a function" expectContext: true contextFile: ".opencode/context/core/standards/code.md" ``` #### 4. Content Validation Fails **Problem:** Content expectations not met. **Cause:** File content doesn't match expectations. **Debug:** ```bash # Run with debug to see actual content npm run eval:sdk -- --debug --filter="your-test" # Check the file that was written cat evals/test_tmp/your-file.ts ``` --- ## 🎓 Key Learnings ### 1. Duplicate Test IDs Are Dangerous **Problem:** When multiple test files have the same `id`, the test runner loads both but only one executes (unpredictably). **Solution:** Always ensure unique test IDs. Use a naming convention: ``` {category}-{feature}-{number} ctx-code-001 ctx-docs-002 ``` ### 2. Multi-Turn is Essential for OpenAgent **Problem:** OpenAgent asks for approval before execution. Single-turn tests fail because the agent never receives approval. **Solution:** Always use multi-turn prompts for OpenAgent: ```yaml prompts: - text: "Do the task" - text: "Yes, proceed." delayMs: 2000 ``` ### 3. Content Validation > Tool Usage **Problem:** Checking IF a tool was called doesn't verify WHAT it did. **Solution:** Use content expectations to validate actual output: ```yaml behavior: mustUseTools: [write] # Checks IF write was called contentExpectations: # Checks WHAT was written - filePath: "test.ts" mustContain: ["export", "function"] ``` ### 4. Enhanced Logging is Foundational **Problem:** Without tool I/O logging, debugging failures is difficult. **Solution:** Enhanced event logging captures everything: - Tool inputs and outputs - Duration per tool - Error details - Enables content validation and subagent verification ### 5. Backward Compatibility Matters **Problem:** Adding new features can break existing tests. **Solution:** Make all new fields optional: ```typescript contentExpectations?: ContentExpectation[]; // Optional delegationExpectations?: DelegationExpectation; // Optional ``` --- ## 📁 Directory Structure ``` evals/ ├── framework/ # Test framework code │ ├── src/ │ │ ├── evaluators/ # Validation logic │ │ │ ├── approval-gate-evaluator.ts │ │ │ ├── context-loading-evaluator.ts │ │ │ ├── delegation-evaluator.ts │ │ │ ├── tool-usage-evaluator.ts │ │ │ ├── behavior-evaluator.ts │ │ │ ├── subagent-evaluator.ts # NEW │ │ │ └── content-evaluator.ts # NEW │ │ ├── sdk/ # Test execution │ │ │ ├── test-runner.ts │ │ │ ├── test-executor.ts │ │ │ ├── event-stream-handler.ts # Enhanced │ │ │ ├── event-logger.ts # Enhanced │ │ │ └── test-case-schema.ts # Updated │ │ └── types/ # TypeScript types │ └── package.json │ ├── agents/ # Agent-specific tests │ ├── openagent/ │ │ └── tests/ │ │ ├── 01-critical-rules/ │ │ ├── 02-workflow-stages/ │ │ ├── 03-delegation/ │ │ ├── 04-execution-paths/ │ │ ├── 05-edge-cases/ │ │ └── 06-integration/ │ └── opencoder/ │ └── tests/ │ ├── results/ # Test results │ ├── latest.json │ ├── history/ │ └── index.html # Dashboard │ └── test_tmp/ # Temporary test files ``` --- ## 🚀 Next Steps ### For Test Writers 1. **Start Simple** - Write basic tests first, add complexity later 2. **Use Multi-Turn** - Always for OpenAgent approval workflows 3. **Validate Content** - Don't just check tools, check outputs 4. **Test Incrementally** - Run tests frequently during development ### For Framework Developers **Remaining Enhancements:** 1. **Task 03: Enhanced Approval Detection** (~1 hour) - High/medium/low confidence levels - Capture actual approval text - Reduce false positives/negatives 2. **Task 04: Session Replay Utility** (~1.5 hours) - Replay failed sessions for debugging - Console/markdown/HTML output - CLI: `npm run replay ` 3. **Task 07: Integration Testing** (~1 hour) - End-to-end integration tests - Verify all features work together - Performance benchmarks ### For Production Use 1. **Run Full Test Suite** - Verify all tests pass 2. **Update Agent Docs** - Document new validation features 3. **Create Migration Guide** - Help users update existing tests 4. **Monitor Pass Rates** - Track test health over time --- ## 📚 Additional Resources - **Test Examples**: `evals/agents/openagent/tests/06-integration/medium/03-full-validation-example.yaml` - **Framework Code**: `evals/framework/src/` - **Results Dashboard**: `evals/results/index.html` - **Session Storage**: `~/.local/share/opencode/storage/` --- ## 🤝 Contributing When adding new tests: 1. ✅ Use unique test IDs 2. ✅ Use multi-turn for approval workflows 3. ✅ Add content expectations when validating outputs 4. ✅ Include clear descriptions 5. ✅ Test locally before committing 6. ✅ Update this guide if adding new features --- **Last Updated:** November 27, 2025 **Framework Version:** 0.1.0 **Status:** Production Ready ✅