This document explains exactly how the evaluation tests work, what they verify, and how to be confident they're testing what we think they're testing.
┌─────────────────────────────────────────────────────────────────┐
│ TEST RUNNER │
├─────────────────────────────────────────────────────────────────┤
│ 1. Clean test_tmp/ directory │
│ 2. Start opencode server (from git root) │
│ 3. For each test: │
│ a. Create session │
│ b. Send prompt(s) with agent selection │
│ c. Capture events via event stream │
│ d. Run evaluators on session data │
│ e. Check behavior expectations │
│ f. Delete session (unless --debug) │
│ 4. Clean test_tmp/ directory │
│ 5. Print results │
└─────────────────────────────────────────────────────────────────┘
When a test specifies agent: opencoder, we verify:
// In test-runner.ts line 340-362
const sessionInfo = await this.client.getSession(sessionId);
const firstMessage = messages[0].info;
const actualAgent = firstMessage.agent;
if (actualAgent !== testCase.agent) {
errors.push(`Agent mismatch: expected '${testCase.agent}', got '${actualAgent}'`);
}
Output you'll see:
Agent: opencoder
Validating agent: opencoder...
✅ Agent verified: opencoder
The BehaviorEvaluator checks which tools were actually called:
// In behavior-evaluator.ts
const toolCalls = this.getToolCalls(timeline);
const toolsUsed = toolCalls.map(tc => tc.data?.tool);
// Check mustUseTools
for (const requiredTool of this.behavior.mustUseTools) {
if (!toolsUsed.includes(requiredTool)) {
violations.push({
type: 'missing-required-tool',
message: `Required tool '${requiredTool}' was not used`
});
}
}
Output you'll see:
============================================================
BEHAVIOR VALIDATION
============================================================
Timeline Events: 10
Tool Calls: 2
Tools Used: glob, read
Tool Call Details:
1. glob: {"pattern":"**/*.ts","path":"/Users/.../src"}
2. read: {"filePath":"/Users/.../src/utils/math.ts"}
We capture real events from the opencode server:
// In event-stream-handler.ts
for await (const event of response.stream) {
const serverEvent = {
type: event.type, // 'tool.call', 'message.created', etc.
properties: event.properties,
timestamp: Date.now(),
};
// Trigger handlers
}
Event types captured:
session.created - Session startedmessage.created / message.updated - Agent messagespart.created / part.updated - Tool calls, text outputpermission.request / permission.response - Approval flowFor agents that require approval (like openagent):
// In test-runner.ts
this.eventHandler.onPermission(async (event) => {
const approved = await approvalStrategy.shouldApprove(event);
approvalsGiven++;
this.log(`Permission ${approved ? 'APPROVED' : 'DENIED'}: ${event.properties.tool}`);
return approved;
});
# Example test file
id: bash-execution-001
name: Direct Tool Execution
agent: opencoder # Which agent to use
model: anthropic/claude-sonnet-4-5 # Which model
prompt: |
List the files in the current directory using ls.
behavior:
mustUseAnyOf: [[bash], [list]] # Either tool is acceptable
minToolCalls: 1 # At least 1 tool call
mustNotContain: # Text that should NOT appear
- "Approval needed"
expectedViolations:
- rule: approval-gate
shouldViolate: true # Opencoder WILL trigger this (expected)
severity: error
approvalStrategy:
type: auto-approve # Auto-approve tool permissions
timeout: 30000
agent: opencoder
prompt: "List files in current directory"
behavior:
mustUseAnyOf: [[bash], [list]]
expectedViolations:
- rule: approval-gate
shouldViolate: true # Expected - no text approval
agent: openagent
prompts:
- text: "List files in current directory"
- text: "Yes, proceed with the plan"
delayMs: 2000
behavior:
mustUseTools: [bash]
expectedViolations:
- rule: approval-gate
shouldViolate: false # Should ask for approval
Tests that create files use evals/test_tmp/:
prompt: |
Create a file at evals/test_tmp/test.txt with content "Hello"
The test runner cleans this directory:
// In run-sdk-tests.ts
function cleanupTestTmp(testTmpDir: string): void {
const preserveFiles = ['README.md', '.gitignore'];
// Remove everything else
}
npm run eval:sdk -- --agent=opencoder --debug
This shows:
Look for the BEHAVIOR VALIDATION section:
Tool Call Details:
1. glob: {"pattern":"**/*.ts","path":"..."}
2. read: {"filePath":"..."}
Look for:
Agent: opencoder
Validating agent: opencoder...
✅ Agent verified: opencoder
Events captured: 23
If this is 0 or very low, something is wrong.
# Sessions are kept in debug mode
ls ~/.local/share/opencode/storage/session/
The SDK might not return the agent field. This is a warning, not an error.
Event stream connection failed. Check server is running.
Agent used a different tool. Consider using mustUseAnyOf for flexibility.
Update test prompts to use evals/test_tmp/ path.
cd evals/framework
# All tests for specific agent
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder
# Specific test pattern
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder --pattern="developer/*.yaml"
# Debug mode (keeps sessions, verbose output)
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder --debug
# Custom model
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder --model=anthropic/claude-sonnet-4-5
======================================================================
TEST RESULTS
======================================================================
1. ✅ file-read-001 - File Read Operation
Duration: 18397ms # How long the test took
Events: 23 # Events captured from server
Approvals: 0 # Permission requests handled
Context Loading: ⊘ ... # Context file status
Violations: 0 (0 errors) # Rule violations found
======================================================================
SUMMARY: 4/4 tests passed (0 failed)
======================================================================
Before trusting test results, verify: