Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated validation.
Last Updated: November 27, 2025
# Install and build
cd evals/framework
npm install
npm run build
# Run all tests (uses free model by default)
npm run eval:sdk
# Run specific agent
npm run eval:sdk -- --agent=openagent
npm run eval:sdk -- --agent=opencoder
# Debug mode (verbose output, keeps sessions)
npm run eval:sdk -- --debug
# View results dashboard
cd ../results && ./serve.sh
Validates that OpenCode agents follow their defined rules and behaviors through real execution with actual sessions, not mocks.
✅ Real Execution - Creates actual OpenCode sessions, sends prompts, captures responses ✅ Event Streaming - Monitors all events (tool calls, messages, permissions) in real-time ✅ Automated Validation - Runs evaluators to check compliance with agent rules ✅ Content Validation - Verifies file contents, not just that tools were called ✅ Subagent Verification - Validates delegation and subagent behavior ✅ Enhanced Logging - Captures full tool inputs/outputs with timing ✅ Multi-turn Support - Handles approval workflows and complex conversations
| Validation Type | What It Checks |
|---|---|
| Approval Gate | Agent asks for approval before executing risky operations |
| Context Loading | Agent loads required context files before execution |
| Delegation | Agent delegates complex tasks (4+ files) to task-manager |
| Tool Usage | Agent uses correct tools for the task |
| Behavior | Agent follows expected behavior patterns |
| Subagent | Subagents execute correctly when delegated |
| Content | Files contain expected content and patterns |
┌─────────────────────────────────────────────────────────────────┐
│ TEST RUNNER │
├─────────────────────────────────────────────────────────────────┤
│ 1. Clean test_tmp/ directory │
│ 2. Start opencode server (from git root) │
│ 3. For each test: │
│ a. Create session with specified agent │
│ b. Send prompt(s) (single or multi-turn) │
│ c. Capture events via event stream │
│ d. Extract tool inputs/outputs (enhanced logging) │
│ e. Run evaluators on session data │
│ f. Validate behavior expectations │
│ g. Check content expectations │
│ h. Verify subagent behavior (if delegated) │
│ i. Delete session (unless --debug) │
│ 4. Clean test_tmp/ directory │
│ 5. Generate results (JSON + dashboard) │
└─────────────────────────────────────────────────────────────────┘
During Test Execution:
~/.local/share/opencode/storage/
├── session/ # Session metadata (by project hash)
├── message/ # Messages per session (ses_xxx/)
├── part/ # Tool calls, text parts, etc.
└── session_diff/ # Session changes
Test Results:
evals/results/
├── latest.json # Most recent run
├── history/2025-11/ # Historical runs
└── index.html # Interactive dashboard
The framework listens to the OpenCode event stream and captures:
// Events captured in real-time
- session.created/updated
- message.created/updated
- part.created/updated (includes tool calls)
- permission.request/response
// Enhanced with tool details (NEW)
- Tool name, input, output
- Start time, end time, duration
- Success/error status
id: my-test-001
name: My Test Name
description: What this test validates
category: developer # developer, business, creative, edge-case
agent: openagent # openagent, opencoder
model: opencode/big-pickle # Optional, defaults to free tier
# Single prompt (simple tests)
prompt: |
Create a function called add in math.ts
# OR Multi-turn prompts (for approval workflows)
prompts:
- text: |
Create a function called add in math.ts
expectContext: true
contextFile: ".opencode/context/core/standards/code.md"
- text: "Yes, proceed with the plan."
delayMs: 2000
# Expected behavior
behavior:
mustUseTools: [read, write]
requiresApproval: true
requiresContext: true
minToolCalls: 2
# Expected violations (should NOT violate these)
expectedViolations:
- rule: approval-gate
shouldViolate: false
severity: error
# Approval strategy
approvalStrategy:
type: auto-approve
timeout: 120000
Why Multi-Turn? OpenAgent requires approval before execution. Single-turn tests will fail because the agent asks for approval but never receives it.
# ❌ WRONG - Single turn (agent asks approval, never gets it)
prompt: "Create a file at test.txt"
# ✅ CORRECT - Multi-turn (agent asks, user approves)
prompts:
- text: "Create a file at test.txt"
- text: "Yes, proceed."
delayMs: 2000
Validates the actual content of files written/edited:
behavior:
mustUseTools: [write]
contentExpectations:
- filePath: "src/math.ts"
mustContain:
- "export function add"
- ": number"
mustNotContain:
- "console.log"
- "any"
minLength: 100
maxLength: 500
Validation Types:
mustContain - Required patterns (40% weight)mustNotContain - Forbidden patterns (30% weight)mustMatch - Regex pattern (20% weight)minLength - Minimum content length (5% weight)maxLength - Maximum content length (5% weight)Validates delegation and subagent behavior:
behavior:
mustUseTools: [task]
shouldDelegate: true
delegationExpectations:
subagentType: "CoderAgent"
subagentMustUseTools: [write, read]
subagentMinToolCalls: 2
subagentMustComplete: true
Checks:
More sophisticated approval validation:
behavior:
requiresApproval: true
approvalExpectations:
minConfidence: high # high, medium, low
approvalMustMention:
- "file"
- "create"
requireExplicitApproval: true
Enhanced debugging capabilities:
behavior:
debug:
logToolDetails: true # Log all tool I/O
saveReplayOnFailure: true # Save session for replay
exportMarkdown: true # Export to markdown
behavior:
# Must use these tools
mustUseTools: [read, write]
# Must use at least one of these sets
mustUseAnyOf: [[bash], [list]]
# May use these (optional)
mayUseTools: [glob, grep]
# Must NOT use these
mustNotUseTools: [edit]
# Tool call count
minToolCalls: 2
maxToolCalls: 10
# Run all tests
npm run eval:sdk
# Run specific agent
npm run eval:sdk -- --agent=openagent
# Run with debug output
npm run eval:sdk -- --debug
# Filter tests by pattern
npm run eval:sdk -- --agent=openagent --filter="context-loading"
# Run in batches of 3 with 10s delays
cd evals/framework/scripts/utils
./run-tests-batch.sh openagent 3 10
When running with --debug:
Example Debug Output:
────────────────────────────────────────────────────────────
🔧 TOOL: write (completed)
────────────────────────────────────────────────────────────
📥 INPUT:
{
"filePath": "test.ts",
"content": "export function add..."
}
📤 OUTPUT:
{
"success": true,
"bytesWritten": 67
}
⏱️ Duration: 12ms
────────────────────────────────────────────────────────────
============================================================
Running test: ctx-code-001 - Code Task with Context Loading
============================================================
Approval strategy: Auto-approve all permission requests
Creating session...
Session created: ses_abc123
Agent: openagent
Model: anthropic/claude-sonnet-4-5
Sending 2 prompts (multi-turn)...
Prompt 1/2: Create a function...
Completed
Prompt 2/2: Yes, proceed...
Completed
Running evaluators...
✅ approval-gate: PASSED
✅ context-loading: PASSED
✅ tool-usage: PASSED
✅ behavior: PASSED
✅ content: PASSED
Test PASSED
Duration: 35142ms
Events captured: 116
cd evals/results
./serve.sh
# Open http://localhost:8000
Dashboard Features:
Violations Detected:
1. [error] missing-required-tool: Required tool 'write' was not used
2. [error] missing-required-patterns: File missing: export function
3. [warning] over-delegation: Delegated for < 4 files (acceptable)
Severity Levels:
error - Test failswarning - Test passes but flaggedinfo - Informational onlyProblem: Agent responds but doesn't execute tools.
Cause: Single-turn test when multi-turn needed (OpenAgent requires approval).
Solution:
# Change from:
prompt: "Create a file"
# To:
prompts:
- text: "Create a file"
- text: "Yes, proceed."
delayMs: 2000
Problem: Same test ID appears in multiple files.
Cause: Old and new test structures both present.
Solution: Ensure unique test IDs across all test files.
# Check for duplicates
find evals/agents/*/tests -name "*.yaml" -exec grep "^id:" {} \; | sort | uniq -d
Problem: Context loading evaluator fails.
Cause: Context file read before first prompt sent.
Solution: Use expectContext: true on the prompt that needs context:
prompts:
- text: "Create a function"
expectContext: true
contextFile: ".opencode/context/core/standards/code.md"
Problem: Content expectations not met.
Cause: File content doesn't match expectations.
Debug:
# Run with debug to see actual content
npm run eval:sdk -- --debug --filter="your-test"
# Check the file that was written
cat evals/test_tmp/your-file.ts
Problem: When multiple test files have the same id, the test runner loads both but only one executes (unpredictably).
Solution: Always ensure unique test IDs. Use a naming convention:
{category}-{feature}-{number}
ctx-code-001
ctx-docs-002
Problem: OpenAgent asks for approval before execution. Single-turn tests fail because the agent never receives approval.
Solution: Always use multi-turn prompts for OpenAgent:
prompts:
- text: "Do the task"
- text: "Yes, proceed."
delayMs: 2000
Problem: Checking IF a tool was called doesn't verify WHAT it did.
Solution: Use content expectations to validate actual output:
behavior:
mustUseTools: [write] # Checks IF write was called
contentExpectations: # Checks WHAT was written
- filePath: "test.ts"
mustContain: ["export", "function"]
Problem: Without tool I/O logging, debugging failures is difficult.
Solution: Enhanced event logging captures everything:
Problem: Adding new features can break existing tests.
Solution: Make all new fields optional:
contentExpectations?: ContentExpectation[]; // Optional
delegationExpectations?: DelegationExpectation; // Optional
evals/
├── framework/ # Test framework code
│ ├── src/
│ │ ├── evaluators/ # Validation logic
│ │ │ ├── approval-gate-evaluator.ts
│ │ │ ├── context-loading-evaluator.ts
│ │ │ ├── delegation-evaluator.ts
│ │ │ ├── tool-usage-evaluator.ts
│ │ │ ├── behavior-evaluator.ts
│ │ │ ├── subagent-evaluator.ts # NEW
│ │ │ └── content-evaluator.ts # NEW
│ │ ├── sdk/ # Test execution
│ │ │ ├── test-runner.ts
│ │ │ ├── test-executor.ts
│ │ │ ├── event-stream-handler.ts # Enhanced
│ │ │ ├── event-logger.ts # Enhanced
│ │ │ └── test-case-schema.ts # Updated
│ │ └── types/ # TypeScript types
│ └── package.json
│
├── agents/ # Agent-specific tests
│ ├── openagent/
│ │ └── tests/
│ │ ├── 01-critical-rules/
│ │ ├── 02-workflow-stages/
│ │ ├── 03-delegation/
│ │ ├── 04-execution-paths/
│ │ ├── 05-edge-cases/
│ │ └── 06-integration/
│ └── opencoder/
│ └── tests/
│
├── results/ # Test results
│ ├── latest.json
│ ├── history/
│ └── index.html # Dashboard
│
└── test_tmp/ # Temporary test files
Remaining Enhancements:
Task 03: Enhanced Approval Detection (~1 hour)
Task 04: Session Replay Utility (~1.5 hours)
npm run replay <session-id>Task 07: Integration Testing (~1 hour)
evals/agents/openagent/tests/06-integration/medium/03-full-validation-example.yamlevals/framework/src/evals/results/index.html~/.local/share/opencode/storage/When adding new tests:
Last Updated: November 27, 2025
Framework Version: 0.1.0
Status: Production Ready ✅