Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated validation.
Last Updated: November 27, 2025
# Install and build
cd evals/framework
npm install
npm run build
# Run all tests (uses free model by default)
npm run eval:sdk
# Run specific agent
npm run eval:sdk -- --agent=openagent
npm run eval:sdk -- --agent=opencoder
# Debug mode (verbose output, keeps sessions)
npm run eval:sdk -- --debug
# View results dashboard
cd ../results && ./serve.sh
Validates that OpenCode agents follow their defined rules and behaviors through real execution with actual sessions, not mocks.
โ Real Execution - Creates actual OpenCode sessions, sends prompts, captures responses โ Event Streaming - Monitors all events (tool calls, messages, permissions) in real-time โ Automated Validation - Runs evaluators to check compliance with agent rules โ Content Validation - Verifies file contents, not just that tools were called โ Subagent Verification - Validates delegation and subagent behavior โ Enhanced Logging - Captures full tool inputs/outputs with timing โ Multi-turn Support - Handles approval workflows and complex conversations
| Validation Type | What It Checks |
|---|---|
| Approval Gate | Agent asks for approval before executing risky operations |
| Context Loading | Agent loads required context files before execution |
| Delegation | Agent delegates complex tasks (4+ files) to task-manager |
| Tool Usage | Agent uses correct tools for the task |
| Behavior | Agent follows expected behavior patterns |
| Subagent | Subagents execute correctly when delegated |
| Content | Files contain expected content and patterns |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TEST RUNNER โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 1. Clean test_tmp/ directory โ
โ 2. Start opencode server (from git root) โ
โ 3. For each test: โ
โ a. Create session with specified agent โ
โ b. Send prompt(s) (single or multi-turn) โ
โ c. Capture events via event stream โ
โ d. Extract tool inputs/outputs (enhanced logging) โ
โ e. Run evaluators on session data โ
โ f. Validate behavior expectations โ
โ g. Check content expectations โ
โ h. Verify subagent behavior (if delegated) โ
โ i. Delete session (unless --debug) โ
โ 4. Clean test_tmp/ directory โ
โ 5. Generate results (JSON + dashboard) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
During Test Execution:
~/.local/share/opencode/storage/
โโโ session/ # Session metadata (by project hash)
โโโ message/ # Messages per session (ses_xxx/)
โโโ part/ # Tool calls, text parts, etc.
โโโ session_diff/ # Session changes
Test Results:
evals/results/
โโโ latest.json # Most recent run
โโโ history/2025-11/ # Historical runs
โโโ index.html # Interactive dashboard
The framework listens to the OpenCode event stream and captures:
// Events captured in real-time
- session.created/updated
- message.created/updated
- part.created/updated (includes tool calls)
- permission.request/response
// Enhanced with tool details (NEW)
- Tool name, input, output
- Start time, end time, duration
- Success/error status
id: my-test-001
name: My Test Name
description: What this test validates
category: developer # developer, business, creative, edge-case
agent: openagent # openagent, opencoder
model: opencode/big-pickle # Optional, defaults to free tier
# Single prompt (simple tests)
prompt: |
Create a function called add in math.ts
# OR Multi-turn prompts (for approval workflows)
prompts:
- text: |
Create a function called add in math.ts
expectContext: true
contextFile: ".opencode/context/core/standards/code.md"
- text: "Yes, proceed with the plan."
delayMs: 2000
# Expected behavior
behavior:
mustUseTools: [read, write]
requiresApproval: true
requiresContext: true
minToolCalls: 2
# Expected violations (should NOT violate these)
expectedViolations:
- rule: approval-gate
shouldViolate: false
severity: error
# Approval strategy
approvalStrategy:
type: auto-approve
timeout: 120000
Why Multi-Turn? OpenAgent requires approval before execution. Single-turn tests will fail because the agent asks for approval but never receives it.
# โ WRONG - Single turn (agent asks approval, never gets it)
prompt: "Create a file at test.txt"
# โ
CORRECT - Multi-turn (agent asks, user approves)
prompts:
- text: "Create a file at test.txt"
- text: "Yes, proceed."
delayMs: 2000
Validates the actual content of files written/edited:
behavior:
mustUseTools: [write]
contentExpectations:
- filePath: "src/math.ts"
mustContain:
- "export function add"
- ": number"
mustNotContain:
- "console.log"
- "any"
minLength: 100
maxLength: 500
Validation Types:
mustContain - Required patterns (40% weight)mustNotContain - Forbidden patterns (30% weight)mustMatch - Regex pattern (20% weight)minLength - Minimum content length (5% weight)maxLength - Maximum content length (5% weight)Validates delegation and subagent behavior:
behavior:
mustUseTools: [task]
shouldDelegate: true
delegationExpectations:
subagentType: "CoderAgent"
subagentMustUseTools: [write, read]
subagentMinToolCalls: 2
subagentMustComplete: true
Checks:
More sophisticated approval validation:
behavior:
requiresApproval: true
approvalExpectations:
minConfidence: high # high, medium, low
approvalMustMention:
- "file"
- "create"
requireExplicitApproval: true
Enhanced debugging capabilities:
behavior:
debug:
logToolDetails: true # Log all tool I/O
saveReplayOnFailure: true # Save session for replay
exportMarkdown: true # Export to markdown
behavior:
# Must use these tools
mustUseTools: [read, write]
# Must use at least one of these sets
mustUseAnyOf: [[bash], [list]]
# May use these (optional)
mayUseTools: [glob, grep]
# Must NOT use these
mustNotUseTools: [edit]
# Tool call count
minToolCalls: 2
maxToolCalls: 10
# Run all tests
npm run eval:sdk
# Run specific agent
npm run eval:sdk -- --agent=openagent
# Run with debug output
npm run eval:sdk -- --debug
# Filter tests by pattern
npm run eval:sdk -- --agent=openagent --filter="context-loading"
# Run in batches of 3 with 10s delays
cd evals/framework/scripts/utils
./run-tests-batch.sh openagent 3 10
When running with --debug:
Example Debug Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ง TOOL: write (completed)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฅ INPUT:
{
"filePath": "test.ts",
"content": "export function add..."
}
๐ค OUTPUT:
{
"success": true,
"bytesWritten": 67
}
โฑ๏ธ Duration: 12ms
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
============================================================
Running test: ctx-code-001 - Code Task with Context Loading
============================================================
Approval strategy: Auto-approve all permission requests
Creating session...
Session created: ses_abc123
Agent: openagent
Model: anthropic/claude-sonnet-4-5
Sending 2 prompts (multi-turn)...
Prompt 1/2: Create a function...
Completed
Prompt 2/2: Yes, proceed...
Completed
Running evaluators...
โ
approval-gate: PASSED
โ
context-loading: PASSED
โ
tool-usage: PASSED
โ
behavior: PASSED
โ
content: PASSED
Test PASSED
Duration: 35142ms
Events captured: 116
cd evals/results
./serve.sh
# Open http://localhost:8000
Dashboard Features:
Violations Detected:
1. [error] missing-required-tool: Required tool 'write' was not used
2. [error] missing-required-patterns: File missing: export function
3. [warning] over-delegation: Delegated for < 4 files (acceptable)
Severity Levels:
error - Test failswarning - Test passes but flaggedinfo - Informational onlyProblem: Agent responds but doesn't execute tools.
Cause: Single-turn test when multi-turn needed (OpenAgent requires approval).
Solution:
# Change from:
prompt: "Create a file"
# To:
prompts:
- text: "Create a file"
- text: "Yes, proceed."
delayMs: 2000
Problem: Same test ID appears in multiple files.
Cause: Old and new test structures both present.
Solution: Ensure unique test IDs across all test files.
# Check for duplicates
find evals/agents/*/tests -name "*.yaml" -exec grep "^id:" {} \; | sort | uniq -d
Problem: Context loading evaluator fails.
Cause: Context file read before first prompt sent.
Solution: Use expectContext: true on the prompt that needs context:
prompts:
- text: "Create a function"
expectContext: true
contextFile: ".opencode/context/core/standards/code.md"
Problem: Content expectations not met.
Cause: File content doesn't match expectations.
Debug:
# Run with debug to see actual content
npm run eval:sdk -- --debug --filter="your-test"
# Check the file that was written
cat evals/test_tmp/your-file.ts
Problem: When multiple test files have the same id, the test runner loads both but only one executes (unpredictably).
Solution: Always ensure unique test IDs. Use a naming convention:
{category}-{feature}-{number}
ctx-code-001
ctx-docs-002
Problem: OpenAgent asks for approval before execution. Single-turn tests fail because the agent never receives approval.
Solution: Always use multi-turn prompts for OpenAgent:
prompts:
- text: "Do the task"
- text: "Yes, proceed."
delayMs: 2000
Problem: Checking IF a tool was called doesn't verify WHAT it did.
Solution: Use content expectations to validate actual output:
behavior:
mustUseTools: [write] # Checks IF write was called
contentExpectations: # Checks WHAT was written
- filePath: "test.ts"
mustContain: ["export", "function"]
Problem: Without tool I/O logging, debugging failures is difficult.
Solution: Enhanced event logging captures everything:
Problem: Adding new features can break existing tests.
Solution: Make all new fields optional:
contentExpectations?: ContentExpectation[]; // Optional
delegationExpectations?: DelegationExpectation; // Optional
evals/
โโโ framework/ # Test framework code
โ โโโ src/
โ โ โโโ evaluators/ # Validation logic
โ โ โ โโโ approval-gate-evaluator.ts
โ โ โ โโโ context-loading-evaluator.ts
โ โ โ โโโ delegation-evaluator.ts
โ โ โ โโโ tool-usage-evaluator.ts
โ โ โ โโโ behavior-evaluator.ts
โ โ โ โโโ subagent-evaluator.ts # NEW
โ โ โ โโโ content-evaluator.ts # NEW
โ โ โโโ sdk/ # Test execution
โ โ โ โโโ test-runner.ts
โ โ โ โโโ test-executor.ts
โ โ โ โโโ event-stream-handler.ts # Enhanced
โ โ โ โโโ event-logger.ts # Enhanced
โ โ โ โโโ test-case-schema.ts # Updated
โ โ โโโ types/ # TypeScript types
โ โโโ package.json
โ
โโโ agents/ # Agent-specific tests
โ โโโ openagent/
โ โ โโโ tests/
โ โ โโโ 01-critical-rules/
โ โ โโโ 02-workflow-stages/
โ โ โโโ 03-delegation/
โ โ โโโ 04-execution-paths/
โ โ โโโ 05-edge-cases/
โ โ โโโ 06-integration/
โ โโโ opencoder/
โ โโโ tests/
โ
โโโ results/ # Test results
โ โโโ latest.json
โ โโโ history/
โ โโโ index.html # Dashboard
โ
โโโ test_tmp/ # Temporary test files
Remaining Enhancements:
Task 03: Enhanced Approval Detection (~1 hour)
Task 04: Session Replay Utility (~1.5 hours)
npm run replay <session-id>Task 07: Integration Testing (~1 hour)
evals/agents/openagent/tests/06-integration/medium/03-full-validation-example.yamlevals/framework/src/evals/results/index.html~/.local/share/opencode/storage/When adding new tests:
Last Updated: November 27, 2025
Framework Version: 0.1.0
Status: Production Ready โ