EVAL_FRAMEWORK_GUIDE.md 17 KB

OpenCode Agent Evaluation Framework - Complete Guide

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated validation.

Last Updated: November 27, 2025


๐Ÿ“‹ Table of Contents

  1. Quick Start
  2. What This Framework Does
  3. How Tests Work
  4. Writing Tests
  5. Validation Features
  6. Running Tests
  7. Understanding Results
  8. Troubleshooting
  9. Key Learnings

๐Ÿš€ Quick Start

# Install and build
cd evals/framework
npm install
npm run build

# Run all tests (uses free model by default)
npm run eval:sdk

# Run specific agent
npm run eval:sdk -- --agent=openagent
npm run eval:sdk -- --agent=opencoder

# Debug mode (verbose output, keeps sessions)
npm run eval:sdk -- --debug

# View results dashboard
cd ../results && ./serve.sh

๐ŸŽฏ What This Framework Does

Purpose

Validates that OpenCode agents follow their defined rules and behaviors through real execution with actual sessions, not mocks.

Key Capabilities

โœ… Real Execution - Creates actual OpenCode sessions, sends prompts, captures responses โœ… Event Streaming - Monitors all events (tool calls, messages, permissions) in real-time โœ… Automated Validation - Runs evaluators to check compliance with agent rules โœ… Content Validation - Verifies file contents, not just that tools were called โœ… Subagent Verification - Validates delegation and subagent behavior โœ… Enhanced Logging - Captures full tool inputs/outputs with timing โœ… Multi-turn Support - Handles approval workflows and complex conversations

What Gets Tested

Validation Type What It Checks
Approval Gate Agent asks for approval before executing risky operations
Context Loading Agent loads required context files before execution
Delegation Agent delegates complex tasks (4+ files) to task-manager
Tool Usage Agent uses correct tools for the task
Behavior Agent follows expected behavior patterns
Subagent Subagents execute correctly when delegated
Content Files contain expected content and patterns

๐Ÿ”ง How Tests Work

Test Execution Flow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        TEST RUNNER                               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1. Clean test_tmp/ directory                                    โ”‚
โ”‚  2. Start opencode server (from git root)                        โ”‚
โ”‚  3. For each test:                                               โ”‚
โ”‚     a. Create session with specified agent                       โ”‚
โ”‚     b. Send prompt(s) (single or multi-turn)                     โ”‚
โ”‚     c. Capture events via event stream                           โ”‚
โ”‚     d. Extract tool inputs/outputs (enhanced logging)            โ”‚
โ”‚     e. Run evaluators on session data                            โ”‚
โ”‚     f. Validate behavior expectations                            โ”‚
โ”‚     g. Check content expectations                                โ”‚
โ”‚     h. Verify subagent behavior (if delegated)                   โ”‚
โ”‚     i. Delete session (unless --debug)                           โ”‚
โ”‚  4. Clean test_tmp/ directory                                    โ”‚
โ”‚  5. Generate results (JSON + dashboard)                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Where Data Lives

During Test Execution:

~/.local/share/opencode/storage/
โ”œโ”€โ”€ session/          # Session metadata (by project hash)
โ”œโ”€โ”€ message/          # Messages per session (ses_xxx/)
โ”œโ”€โ”€ part/             # Tool calls, text parts, etc.
โ””โ”€โ”€ session_diff/     # Session changes

Test Results:

evals/results/
โ”œโ”€โ”€ latest.json           # Most recent run
โ”œโ”€โ”€ history/2025-11/      # Historical runs
โ””โ”€โ”€ index.html            # Interactive dashboard

Event Stream Monitoring

The framework listens to the OpenCode event stream and captures:

// Events captured in real-time
- session.created/updated
- message.created/updated
- part.created/updated (includes tool calls)
- permission.request/response

// Enhanced with tool details (NEW)
- Tool name, input, output
- Start time, end time, duration
- Success/error status

โœ๏ธ Writing Tests

Basic Test Structure

id: my-test-001
name: My Test Name
description: What this test validates

category: developer  # developer, business, creative, edge-case
agent: openagent     # openagent, opencoder
model: opencode/big-pickle  # Optional, defaults to free tier

# Single prompt (simple tests)
prompt: |
  Create a function called add in math.ts

# OR Multi-turn prompts (for approval workflows)
prompts:
  - text: |
      Create a function called add in math.ts
    expectContext: true
    contextFile: ".opencode/context/core/standards/code.md"
  
  - text: "Yes, proceed with the plan."
    delayMs: 2000

# Expected behavior
behavior:
  mustUseTools: [read, write]
  requiresApproval: true
  requiresContext: true
  minToolCalls: 2

# Expected violations (should NOT violate these)
expectedViolations:
  - rule: approval-gate
    shouldViolate: false
    severity: error

# Approval strategy
approvalStrategy:
  type: auto-approve

timeout: 120000

Multi-Turn Tests (Critical for OpenAgent)

Why Multi-Turn? OpenAgent requires approval before execution. Single-turn tests will fail because the agent asks for approval but never receives it.

# โŒ WRONG - Single turn (agent asks approval, never gets it)
prompt: "Create a file at test.txt"

# โœ… CORRECT - Multi-turn (agent asks, user approves)
prompts:
  - text: "Create a file at test.txt"
  - text: "Yes, proceed."
    delayMs: 2000

๐ŸŽจ Validation Features

1. Content Validation (NEW)

Validates the actual content of files written/edited:

behavior:
  mustUseTools: [write]
  
  contentExpectations:
    - filePath: "src/math.ts"
      mustContain:
        - "export function add"
        - ": number"
      mustNotContain:
        - "console.log"
        - "any"
      minLength: 100
      maxLength: 500

Validation Types:

  • mustContain - Required patterns (40% weight)
  • mustNotContain - Forbidden patterns (30% weight)
  • mustMatch - Regex pattern (20% weight)
  • minLength - Minimum content length (5% weight)
  • maxLength - Maximum content length (5% weight)

2. Subagent Verification (NEW)

Validates delegation and subagent behavior:

behavior:
  mustUseTools: [task]
  shouldDelegate: true
  
  delegationExpectations:
    subagentType: "CoderAgent"
    subagentMustUseTools: [write, read]
    subagentMinToolCalls: 2
    subagentMustComplete: true

Checks:

  • Correct subagent type invoked (30% weight)
  • Subagent used required tools (40% weight)
  • Minimum tool calls met (20% weight)
  • Subagent completed successfully (10% weight)

3. Enhanced Approval Detection (NEW)

More sophisticated approval validation:

behavior:
  requiresApproval: true
  
  approvalExpectations:
    minConfidence: high  # high, medium, low
    approvalMustMention:
      - "file"
      - "create"
    requireExplicitApproval: true

4. Debug Options (NEW)

Enhanced debugging capabilities:

behavior:
  debug:
    logToolDetails: true        # Log all tool I/O
    saveReplayOnFailure: true   # Save session for replay
    exportMarkdown: true        # Export to markdown

5. Tool Usage Validation

behavior:
  # Must use these tools
  mustUseTools: [read, write]
  
  # Must use at least one of these sets
  mustUseAnyOf: [[bash], [list]]
  
  # May use these (optional)
  mayUseTools: [glob, grep]
  
  # Must NOT use these
  mustNotUseTools: [edit]
  
  # Tool call count
  minToolCalls: 2
  maxToolCalls: 10

๐Ÿƒ Running Tests

Basic Commands

# Run all tests
npm run eval:sdk

# Run specific agent
npm run eval:sdk -- --agent=openagent

# Run with debug output
npm run eval:sdk -- --debug

# Filter tests by pattern
npm run eval:sdk -- --agent=openagent --filter="context-loading"

Batch Execution (Avoid Rate Limits)

# Run in batches of 3 with 10s delays
cd evals/framework/scripts/utils
./run-tests-batch.sh openagent 3 10

Debug Mode Features

When running with --debug:

  • โœ… Full event logging with tool I/O
  • โœ… Sessions kept for inspection
  • โœ… Detailed timeline output
  • โœ… Tool duration tracking

Example Debug Output:

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
๐Ÿ”ง TOOL: write (completed)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

๐Ÿ“ฅ INPUT:
{
  "filePath": "test.ts",
  "content": "export function add..."
}

๐Ÿ“ค OUTPUT:
{
  "success": true,
  "bytesWritten": 67
}

โฑ๏ธ  Duration: 12ms
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

๐Ÿ“Š Understanding Results

Test Output

============================================================
Running test: ctx-code-001 - Code Task with Context Loading
============================================================
Approval strategy: Auto-approve all permission requests
Creating session...
Session created: ses_abc123
Agent: openagent
Model: anthropic/claude-sonnet-4-5

Sending 2 prompts (multi-turn)...
Prompt 1/2: Create a function...
  Completed
Prompt 2/2: Yes, proceed...
  Completed

Running evaluators...
  โœ… approval-gate: PASSED
  โœ… context-loading: PASSED
  โœ… tool-usage: PASSED
  โœ… behavior: PASSED
  โœ… content: PASSED

Test PASSED
Duration: 35142ms
Events captured: 116

Results Dashboard

cd evals/results
./serve.sh
# Open http://localhost:8000

Dashboard Features:

  • Filter by agent, category, status
  • View violation details
  • See test trends over time
  • Export results

Understanding Violations

Violations Detected:
  1. [error] missing-required-tool: Required tool 'write' was not used
  2. [error] missing-required-patterns: File missing: export function
  3. [warning] over-delegation: Delegated for < 4 files (acceptable)

Severity Levels:

  • error - Test fails
  • warning - Test passes but flagged
  • info - Informational only

๐Ÿ” Troubleshooting

Common Issues

1. Tests Failing with "No tool calls"

Problem: Agent responds but doesn't execute tools.

Cause: Single-turn test when multi-turn needed (OpenAgent requires approval).

Solution:

# Change from:
prompt: "Create a file"

# To:
prompts:
  - text: "Create a file"
  - text: "Yes, proceed."
    delayMs: 2000

2. Duplicate Test IDs

Problem: Same test ID appears in multiple files.

Cause: Old and new test structures both present.

Solution: Ensure unique test IDs across all test files.

# Check for duplicates
find evals/agents/*/tests -name "*.yaml" -exec grep "^id:" {} \; | sort | uniq -d

3. Context Not Loading

Problem: Context loading evaluator fails.

Cause: Context file read before first prompt sent.

Solution: Use expectContext: true on the prompt that needs context:

prompts:
  - text: "Create a function"
    expectContext: true
    contextFile: ".opencode/context/core/standards/code.md"

4. Content Validation Fails

Problem: Content expectations not met.

Cause: File content doesn't match expectations.

Debug:

# Run with debug to see actual content
npm run eval:sdk -- --debug --filter="your-test"

# Check the file that was written
cat evals/test_tmp/your-file.ts

๐ŸŽ“ Key Learnings

1. Duplicate Test IDs Are Dangerous

Problem: When multiple test files have the same id, the test runner loads both but only one executes (unpredictably).

Solution: Always ensure unique test IDs. Use a naming convention:

{category}-{feature}-{number}
ctx-code-001
ctx-docs-002

2. Multi-Turn is Essential for OpenAgent

Problem: OpenAgent asks for approval before execution. Single-turn tests fail because the agent never receives approval.

Solution: Always use multi-turn prompts for OpenAgent:

prompts:
  - text: "Do the task"
  - text: "Yes, proceed."
    delayMs: 2000

3. Content Validation > Tool Usage

Problem: Checking IF a tool was called doesn't verify WHAT it did.

Solution: Use content expectations to validate actual output:

behavior:
  mustUseTools: [write]  # Checks IF write was called
  
  contentExpectations:   # Checks WHAT was written
    - filePath: "test.ts"
      mustContain: ["export", "function"]

4. Enhanced Logging is Foundational

Problem: Without tool I/O logging, debugging failures is difficult.

Solution: Enhanced event logging captures everything:

  • Tool inputs and outputs
  • Duration per tool
  • Error details
  • Enables content validation and subagent verification

5. Backward Compatibility Matters

Problem: Adding new features can break existing tests.

Solution: Make all new fields optional:

contentExpectations?: ContentExpectation[];  // Optional
delegationExpectations?: DelegationExpectation;  // Optional

๐Ÿ“ Directory Structure

evals/
โ”œโ”€โ”€ framework/                    # Test framework code
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ evaluators/          # Validation logic
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ approval-gate-evaluator.ts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ context-loading-evaluator.ts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ delegation-evaluator.ts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ tool-usage-evaluator.ts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ behavior-evaluator.ts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ subagent-evaluator.ts      # NEW
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ content-evaluator.ts       # NEW
โ”‚   โ”‚   โ”œโ”€โ”€ sdk/                 # Test execution
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ test-runner.ts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ test-executor.ts
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ event-stream-handler.ts    # Enhanced
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ event-logger.ts            # Enhanced
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ test-case-schema.ts        # Updated
โ”‚   โ”‚   โ””โ”€โ”€ types/               # TypeScript types
โ”‚   โ””โ”€โ”€ package.json
โ”‚
โ”œโ”€โ”€ agents/                       # Agent-specific tests
โ”‚   โ”œโ”€โ”€ openagent/
โ”‚   โ”‚   โ””โ”€โ”€ tests/
โ”‚   โ”‚       โ”œโ”€โ”€ 01-critical-rules/
โ”‚   โ”‚       โ”œโ”€โ”€ 02-workflow-stages/
โ”‚   โ”‚       โ”œโ”€โ”€ 03-delegation/
โ”‚   โ”‚       โ”œโ”€โ”€ 04-execution-paths/
โ”‚   โ”‚       โ”œโ”€โ”€ 05-edge-cases/
โ”‚   โ”‚       โ””โ”€โ”€ 06-integration/
โ”‚   โ””โ”€โ”€ opencoder/
โ”‚       โ””โ”€โ”€ tests/
โ”‚
โ”œโ”€โ”€ results/                      # Test results
โ”‚   โ”œโ”€โ”€ latest.json
โ”‚   โ”œโ”€โ”€ history/
โ”‚   โ””โ”€โ”€ index.html               # Dashboard
โ”‚
โ””โ”€โ”€ test_tmp/                     # Temporary test files

๐Ÿš€ Next Steps

For Test Writers

  1. Start Simple - Write basic tests first, add complexity later
  2. Use Multi-Turn - Always for OpenAgent approval workflows
  3. Validate Content - Don't just check tools, check outputs
  4. Test Incrementally - Run tests frequently during development

For Framework Developers

Remaining Enhancements:

  1. Task 03: Enhanced Approval Detection (~1 hour)

    • High/medium/low confidence levels
    • Capture actual approval text
    • Reduce false positives/negatives
  2. Task 04: Session Replay Utility (~1.5 hours)

    • Replay failed sessions for debugging
    • Console/markdown/HTML output
    • CLI: npm run replay <session-id>
  3. Task 07: Integration Testing (~1 hour)

    • End-to-end integration tests
    • Verify all features work together
    • Performance benchmarks

For Production Use

  1. Run Full Test Suite - Verify all tests pass
  2. Update Agent Docs - Document new validation features
  3. Create Migration Guide - Help users update existing tests
  4. Monitor Pass Rates - Track test health over time

๐Ÿ“š Additional Resources

  • Test Examples: evals/agents/openagent/tests/06-integration/medium/03-full-validation-example.yaml
  • Framework Code: evals/framework/src/
  • Results Dashboard: evals/results/index.html
  • Session Storage: ~/.local/share/opencode/storage/

๐Ÿค Contributing

When adding new tests:

  1. โœ… Use unique test IDs
  2. โœ… Use multi-turn for approval workflows
  3. โœ… Add content expectations when validating outputs
  4. โœ… Include clear descriptions
  5. โœ… Test locally before committing
  6. โœ… Update this guide if adding new features

Last Updated: November 27, 2025
Framework Version: 0.1.0
Status: Production Ready โœ