How the Eval Tests Work

This document explains exactly how the evaluation tests work, what they verify, and how to be confident they're testing what we think they're testing.

Test Execution Flow

┌─────────────────────────────────────────────────────────────────┐
│                        TEST RUNNER                               │
├─────────────────────────────────────────────────────────────────┤
│  1. Clean test_tmp/ directory                                    │
│  2. Start opencode server (from git root)                        │
│  3. For each test:                                               │
│     a. Create session                                            │
│     b. Send prompt(s) with agent selection                       │
│     c. Capture events via event stream                           │
│     d. Run evaluators on session data                            │
│     e. Check behavior expectations                               │
│     f. Delete session (unless --debug)                           │
│  4. Clean test_tmp/ directory                                    │
│  5. Print results                                                │
└─────────────────────────────────────────────────────────────────┘

How We Verify Agent Behavior

1. Agent Selection Verification

When a test specifies agent: opencoder, we verify:

// In test-runner.ts line 340-362
const sessionInfo = await this.client.getSession(sessionId);
const firstMessage = messages[0].info;
const actualAgent = firstMessage.agent;

if (actualAgent !== testCase.agent) {
  errors.push(`Agent mismatch: expected '${testCase.agent}', got '${actualAgent}'`);
}

Output you'll see:

Agent: opencoder
Validating agent: opencoder...
  ✅ Agent verified: opencoder

2. Tool Usage Verification

The BehaviorEvaluator checks which tools were actually called:

// In behavior-evaluator.ts
const toolCalls = this.getToolCalls(timeline);
const toolsUsed = toolCalls.map(tc => tc.data?.tool);

// Check mustUseTools
for (const requiredTool of this.behavior.mustUseTools) {
  if (!toolsUsed.includes(requiredTool)) {
    violations.push({
      type: 'missing-required-tool',
      message: `Required tool '${requiredTool}' was not used`
    });
  }
}

Output you'll see:

============================================================
BEHAVIOR VALIDATION
============================================================
Timeline Events: 10
Tool Calls: 2
Tools Used: glob, read

Tool Call Details:
  1. glob: {"pattern":"**/*.ts","path":"/Users/.../src"}
  2. read: {"filePath":"/Users/.../src/utils/math.ts"}

3. Event Stream Capture

We capture real events from the opencode server:

// In event-stream-handler.ts
for await (const event of response.stream) {
  const serverEvent = {
    type: event.type,  // 'tool.call', 'message.created', etc.
    properties: event.properties,
    timestamp: Date.now(),
  };
  // Trigger handlers
}

Event types captured:

session.created - Session started
message.created / message.updated - Agent messages
part.created / part.updated - Tool calls, text output
permission.request / permission.response - Approval flow

4. Approval Flow Verification

For agents that require approval (like openagent):

// In test-runner.ts
this.eventHandler.onPermission(async (event) => {
  const approved = await approvalStrategy.shouldApprove(event);
  approvalsGiven++;
  this.log(`Permission ${approved ? 'APPROVED' : 'DENIED'}: ${event.properties.tool}`);
  return approved;
});

Test File Structure

# Example test file
id: bash-execution-001
name: Direct Tool Execution
agent: opencoder                    # Which agent to use
model: anthropic/claude-sonnet-4-5  # Which model

prompt: |
  List the files in the current directory using ls.

behavior:
  mustUseAnyOf: [[bash], [list]]    # Either tool is acceptable
  minToolCalls: 1                    # At least 1 tool call
  mustNotContain:                    # Text that should NOT appear
    - "Approval needed"

expectedViolations:
  - rule: approval-gate
    shouldViolate: true              # Opencoder WILL trigger this (expected)
    severity: error

approvalStrategy:
  type: auto-approve                 # Auto-approve tool permissions

timeout: 30000

Key Differences Between Agents

Opencoder (Direct Execution)

Executes tools immediately
Uses tool permission system only
No text-based approval workflow
Tests use single prompts

agent: opencoder
prompt: "List files in current directory"
behavior:
  mustUseAnyOf: [[bash], [list]]
expectedViolations:
  - rule: approval-gate
    shouldViolate: true  # Expected - no text approval

OpenAgent (Approval Workflow)

Outputs "Proposed Plan" first
Waits for user approval in text
Then executes tools
Tests use multi-turn prompts

agent: openagent
prompts:
  - text: "List files in current directory"
  - text: "Yes, proceed with the plan"
    delayMs: 2000
behavior:
  mustUseTools: [bash]
expectedViolations:
  - rule: approval-gate
    shouldViolate: false  # Should ask for approval

File Cleanup

Tests that create files use evals/test_tmp/:

prompt: |
  Create a file at evals/test_tmp/test.txt with content "Hello"

The test runner cleans this directory:

Before tests start
After tests complete

// In run-sdk-tests.ts
function cleanupTestTmp(testTmpDir: string): void {
  const preserveFiles = ['README.md', '.gitignore'];
  // Remove everything else
}

How to Verify Tests Are Working

1. Run with --debug flag

npm run eval:sdk -- --agent=opencoder --debug

This shows:

All events captured
Tool call details
Agent verification
Keeps sessions for inspection

2. Check Tool Call Details

Look for the BEHAVIOR VALIDATION section:

Tool Call Details:
  1. glob: {"pattern":"**/*.ts","path":"..."}
  2. read: {"filePath":"..."}

3. Verify Agent Selection

Look for:

Agent: opencoder
Validating agent: opencoder...
  ✅ Agent verified: opencoder

4. Check Event Count

Events captured: 23

If this is 0 or very low, something is wrong.

5. Inspect Session (debug mode)

# Sessions are kept in debug mode
ls ~/.local/share/opencode/storage/session/

Common Issues

"Agent not set in message"

The SDK might not return the agent field. This is a warning, not an error.

"0 events captured"

Event stream connection failed. Check server is running.

"Tool X was not used"

Agent used a different tool. Consider using mustUseAnyOf for flexibility.

Files created in wrong location

Update test prompts to use evals/test_tmp/ path.

Running Tests

cd evals/framework

# All tests for specific agent
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder

# Specific test pattern
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder --pattern="developer/*.yaml"

# Debug mode (keeps sessions, verbose output)
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder --debug

# Custom model
npx tsx src/sdk/run-sdk-tests.ts --agent=opencoder --model=anthropic/claude-sonnet-4-5

Test Results Interpretation

======================================================================
TEST RESULTS
======================================================================

1. ✅ file-read-001 - File Read Operation
   Duration: 18397ms          # How long the test took
   Events: 23                  # Events captured from server
   Approvals: 0                # Permission requests handled
   Context Loading: ⊘ ...      # Context file status
   Violations: 0 (0 errors)    # Rule violations found

======================================================================
SUMMARY: 4/4 tests passed (0 failed)
======================================================================

Confidence Checklist

Before trusting test results, verify:

Agent verified message shows correct agent
Events captured > 0
Tool Call Details show expected tools
Duration is reasonable (not instant = timeout)
No unexpected errors in output
test_tmp/ is being cleaned up

HOW_TESTS_WORK.md 8.5 KB Histórico Raw