OpenCode Agent Evaluation Framework - Complete Guide

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated validation.

Last Updated: November 27, 2025

📋 Table of Contents

Quick Start
What This Framework Does
How Tests Work
Writing Tests
Validation Features
Running Tests
Understanding Results
Troubleshooting
Key Learnings

🚀 Quick Start

# Install and build
cd evals/framework
npm install
npm run build

# Run all tests (uses free model by default)
npm run eval:sdk

# Run specific agent
npm run eval:sdk -- --agent=openagent
npm run eval:sdk -- --agent=opencoder

# Debug mode (verbose output, keeps sessions)
npm run eval:sdk -- --debug

# View results dashboard
cd ../results && ./serve.sh

🎯 What This Framework Does

Purpose

Validates that OpenCode agents follow their defined rules and behaviors through real execution with actual sessions, not mocks.

Key Capabilities

✅ Real Execution - Creates actual OpenCode sessions, sends prompts, captures responses ✅ Event Streaming - Monitors all events (tool calls, messages, permissions) in real-time ✅ Automated Validation - Runs evaluators to check compliance with agent rules ✅ Content Validation - Verifies file contents, not just that tools were called ✅ Subagent Verification - Validates delegation and subagent behavior ✅ Enhanced Logging - Captures full tool inputs/outputs with timing ✅ Multi-turn Support - Handles approval workflows and complex conversations

What Gets Tested

Validation Type	What It Checks
Approval Gate	Agent asks for approval before executing risky operations
Context Loading	Agent loads required context files before execution
Delegation	Agent delegates complex tasks (4+ files) to task-manager
Tool Usage	Agent uses correct tools for the task
Behavior	Agent follows expected behavior patterns
Subagent	Subagents execute correctly when delegated
Content	Files contain expected content and patterns

🔧 How Tests Work

Test Execution Flow

┌─────────────────────────────────────────────────────────────────┐
│                        TEST RUNNER                               │
├─────────────────────────────────────────────────────────────────┤
│  1. Clean test_tmp/ directory                                    │
│  2. Start opencode server (from git root)                        │
│  3. For each test:                                               │
│     a. Create session with specified agent                       │
│     b. Send prompt(s) (single or multi-turn)                     │
│     c. Capture events via event stream                           │
│     d. Extract tool inputs/outputs (enhanced logging)            │
│     e. Run evaluators on session data                            │
│     f. Validate behavior expectations                            │
│     g. Check content expectations                                │
│     h. Verify subagent behavior (if delegated)                   │
│     i. Delete session (unless --debug)                           │
│  4. Clean test_tmp/ directory                                    │
│  5. Generate results (JSON + dashboard)                          │
└─────────────────────────────────────────────────────────────────┘

Where Data Lives

During Test Execution:

~/.local/share/opencode/storage/
├── session/          # Session metadata (by project hash)
├── message/          # Messages per session (ses_xxx/)
├── part/             # Tool calls, text parts, etc.
└── session_diff/     # Session changes

Test Results:

evals/results/
├── latest.json           # Most recent run
├── history/2025-11/      # Historical runs
└── index.html            # Interactive dashboard

Event Stream Monitoring

The framework listens to the OpenCode event stream and captures:

// Events captured in real-time
- session.created/updated
- message.created/updated
- part.created/updated (includes tool calls)
- permission.request/response

// Enhanced with tool details (NEW)
- Tool name, input, output
- Start time, end time, duration
- Success/error status

✍️ Writing Tests

Basic Test Structure

id: my-test-001
name: My Test Name
description: What this test validates

category: developer  # developer, business, creative, edge-case
agent: openagent     # openagent, opencoder
model: opencode/big-pickle  # Optional, defaults to free tier

# Single prompt (simple tests)
prompt: |
  Create a function called add in math.ts

# OR Multi-turn prompts (for approval workflows)
prompts:
  - text: |
      Create a function called add in math.ts
    expectContext: true
    contextFile: ".opencode/context/core/standards/code.md"
  
  - text: "Yes, proceed with the plan."
    delayMs: 2000

# Expected behavior
behavior:
  mustUseTools: [read, write]
  requiresApproval: true
  requiresContext: true
  minToolCalls: 2

# Expected violations (should NOT violate these)
expectedViolations:
  - rule: approval-gate
    shouldViolate: false
    severity: error

# Approval strategy
approvalStrategy:
  type: auto-approve

timeout: 120000

Multi-Turn Tests (Critical for OpenAgent)

Why Multi-Turn? OpenAgent requires approval before execution. Single-turn tests will fail because the agent asks for approval but never receives it.

# ❌ WRONG - Single turn (agent asks approval, never gets it)
prompt: "Create a file at test.txt"

# ✅ CORRECT - Multi-turn (agent asks, user approves)
prompts:
  - text: "Create a file at test.txt"
  - text: "Yes, proceed."
    delayMs: 2000

🎨 Validation Features

1. Content Validation (NEW)

Validates the actual content of files written/edited:

behavior:
  mustUseTools: [write]
  
  contentExpectations:
    - filePath: "src/math.ts"
      mustContain:
        - "export function add"
        - ": number"
      mustNotContain:
        - "console.log"
        - "any"
      minLength: 100
      maxLength: 500

Validation Types:

mustContain - Required patterns (40% weight)
mustNotContain - Forbidden patterns (30% weight)
mustMatch - Regex pattern (20% weight)
minLength - Minimum content length (5% weight)
maxLength - Maximum content length (5% weight)

2. Subagent Verification (NEW)

Validates delegation and subagent behavior:

behavior:
  mustUseTools: [task]
  shouldDelegate: true
  
  delegationExpectations:
    subagentType: "CoderAgent"
    subagentMustUseTools: [write, read]
    subagentMinToolCalls: 2
    subagentMustComplete: true

Checks:

Correct subagent type invoked (30% weight)
Subagent used required tools (40% weight)
Minimum tool calls met (20% weight)
Subagent completed successfully (10% weight)

3. Enhanced Approval Detection (NEW)

More sophisticated approval validation:

behavior:
  requiresApproval: true
  
  approvalExpectations:
    minConfidence: high  # high, medium, low
    approvalMustMention:
      - "file"
      - "create"
    requireExplicitApproval: true

4. Debug Options (NEW)

Enhanced debugging capabilities:

behavior:
  debug:
    logToolDetails: true        # Log all tool I/O
    saveReplayOnFailure: true   # Save session for replay
    exportMarkdown: true        # Export to markdown

5. Tool Usage Validation

behavior:
  # Must use these tools
  mustUseTools: [read, write]
  
  # Must use at least one of these sets
  mustUseAnyOf: [[bash], [list]]
  
  # May use these (optional)
  mayUseTools: [glob, grep]
  
  # Must NOT use these
  mustNotUseTools: [edit]
  
  # Tool call count
  minToolCalls: 2
  maxToolCalls: 10

🏃 Running Tests

Basic Commands

# Run all tests
npm run eval:sdk

# Run specific agent
npm run eval:sdk -- --agent=openagent

# Run with debug output
npm run eval:sdk -- --debug

# Filter tests by pattern
npm run eval:sdk -- --agent=openagent --filter="context-loading"

Batch Execution (Avoid Rate Limits)

# Run in batches of 3 with 10s delays
cd evals/framework/scripts/utils
./run-tests-batch.sh openagent 3 10

Debug Mode Features

When running with --debug:

✅ Full event logging with tool I/O
✅ Sessions kept for inspection
✅ Detailed timeline output
✅ Tool duration tracking

Example Debug Output:

────────────────────────────────────────────────────────────
🔧 TOOL: write (completed)
────────────────────────────────────────────────────────────

📥 INPUT:
{
  "filePath": "test.ts",
  "content": "export function add..."
}

📤 OUTPUT:
{
  "success": true,
  "bytesWritten": 67
}

⏱️  Duration: 12ms
────────────────────────────────────────────────────────────

📊 Understanding Results

Test Output

============================================================
Running test: ctx-code-001 - Code Task with Context Loading
============================================================
Approval strategy: Auto-approve all permission requests
Creating session...
Session created: ses_abc123
Agent: openagent
Model: anthropic/claude-sonnet-4-5

Sending 2 prompts (multi-turn)...
Prompt 1/2: Create a function...
  Completed
Prompt 2/2: Yes, proceed...
  Completed

Running evaluators...
  ✅ approval-gate: PASSED
  ✅ context-loading: PASSED
  ✅ tool-usage: PASSED
  ✅ behavior: PASSED
  ✅ content: PASSED

Test PASSED
Duration: 35142ms
Events captured: 116

Results Dashboard

cd evals/results
./serve.sh
# Open http://localhost:8000

Dashboard Features:

Filter by agent, category, status
View violation details
See test trends over time
Export results

Understanding Violations

Violations Detected:
  1. [error] missing-required-tool: Required tool 'write' was not used
  2. [error] missing-required-patterns: File missing: export function
  3. [warning] over-delegation: Delegated for < 4 files (acceptable)

Severity Levels:

error - Test fails
warning - Test passes but flagged
info - Informational only

🔍 Troubleshooting

Common Issues

1. Tests Failing with "No tool calls"

Problem: Agent responds but doesn't execute tools.

Cause: Single-turn test when multi-turn needed (OpenAgent requires approval).

Solution:

# Change from:
prompt: "Create a file"

# To:
prompts:
  - text: "Create a file"
  - text: "Yes, proceed."
    delayMs: 2000

2. Duplicate Test IDs

Problem: Same test ID appears in multiple files.

Cause: Old and new test structures both present.

Solution: Ensure unique test IDs across all test files.

# Check for duplicates
find evals/agents/*/tests -name "*.yaml" -exec grep "^id:" {} \; | sort | uniq -d

3. Context Not Loading

Problem: Context loading evaluator fails.

Cause: Context file read before first prompt sent.

Solution: Use expectContext: true on the prompt that needs context:

prompts:
  - text: "Create a function"
    expectContext: true
    contextFile: ".opencode/context/core/standards/code.md"

4. Content Validation Fails

Problem: Content expectations not met.

Cause: File content doesn't match expectations.

Debug:

# Run with debug to see actual content
npm run eval:sdk -- --debug --filter="your-test"

# Check the file that was written
cat evals/test_tmp/your-file.ts

🎓 Key Learnings

1. Duplicate Test IDs Are Dangerous

Problem: When multiple test files have the same id, the test runner loads both but only one executes (unpredictably).

Solution: Always ensure unique test IDs. Use a naming convention:

{category}-{feature}-{number}
ctx-code-001
ctx-docs-002

2. Multi-Turn is Essential for OpenAgent

Problem: OpenAgent asks for approval before execution. Single-turn tests fail because the agent never receives approval.

Solution: Always use multi-turn prompts for OpenAgent:

prompts:
  - text: "Do the task"
  - text: "Yes, proceed."
    delayMs: 2000

3. Content Validation > Tool Usage

Problem: Checking IF a tool was called doesn't verify WHAT it did.

Solution: Use content expectations to validate actual output:

behavior:
  mustUseTools: [write]  # Checks IF write was called
  
  contentExpectations:   # Checks WHAT was written
    - filePath: "test.ts"
      mustContain: ["export", "function"]

4. Enhanced Logging is Foundational

Problem: Without tool I/O logging, debugging failures is difficult.

Solution: Enhanced event logging captures everything:

Tool inputs and outputs
Duration per tool
Error details
Enables content validation and subagent verification

5. Backward Compatibility Matters

Problem: Adding new features can break existing tests.

Solution: Make all new fields optional:

contentExpectations?: ContentExpectation[];  // Optional
delegationExpectations?: DelegationExpectation;  // Optional

📁 Directory Structure

evals/
├── framework/                    # Test framework code
│   ├── src/
│   │   ├── evaluators/          # Validation logic
│   │   │   ├── approval-gate-evaluator.ts
│   │   │   ├── context-loading-evaluator.ts
│   │   │   ├── delegation-evaluator.ts
│   │   │   ├── tool-usage-evaluator.ts
│   │   │   ├── behavior-evaluator.ts
│   │   │   ├── subagent-evaluator.ts      # NEW
│   │   │   └── content-evaluator.ts       # NEW
│   │   ├── sdk/                 # Test execution
│   │   │   ├── test-runner.ts
│   │   │   ├── test-executor.ts
│   │   │   ├── event-stream-handler.ts    # Enhanced
│   │   │   ├── event-logger.ts            # Enhanced
│   │   │   └── test-case-schema.ts        # Updated
│   │   └── types/               # TypeScript types
│   └── package.json
│
├── agents/                       # Agent-specific tests
│   ├── openagent/
│   │   └── tests/
│   │       ├── 01-critical-rules/
│   │       ├── 02-workflow-stages/
│   │       ├── 03-delegation/
│   │       ├── 04-execution-paths/
│   │       ├── 05-edge-cases/
│   │       └── 06-integration/
│   └── opencoder/
│       └── tests/
│
├── results/                      # Test results
│   ├── latest.json
│   ├── history/
│   └── index.html               # Dashboard
│
└── test_tmp/                     # Temporary test files

🚀 Next Steps

For Test Writers

Start Simple - Write basic tests first, add complexity later
Use Multi-Turn - Always for OpenAgent approval workflows
Validate Content - Don't just check tools, check outputs
Test Incrementally - Run tests frequently during development

For Framework Developers

Remaining Enhancements:

Task 03: Enhanced Approval Detection (~1 hour)
- High/medium/low confidence levels
- Capture actual approval text
- Reduce false positives/negatives
Task 04: Session Replay Utility (~1.5 hours)
- Replay failed sessions for debugging
- Console/markdown/HTML output
- CLI: npm run replay <session-id>
Task 07: Integration Testing (~1 hour)
- End-to-end integration tests
- Verify all features work together
- Performance benchmarks

For Production Use

Run Full Test Suite - Verify all tests pass
Update Agent Docs - Document new validation features
Create Migration Guide - Help users update existing tests
Monitor Pass Rates - Track test health over time

📚 Additional Resources

Test Examples: evals/agents/openagent/tests/06-integration/medium/03-full-validation-example.yaml
Framework Code: evals/framework/src/
Results Dashboard: evals/results/index.html
Session Storage: ~/.local/share/opencode/storage/

🤝 Contributing

When adding new tests:

✅ Use unique test IDs
✅ Use multi-turn for approval workflows
✅ Add content expectations when validating outputs
✅ Include clear descriptions
✅ Test locally before committing
✅ Update this guide if adding new features

Last Updated: November 27, 2025
Framework Version: 0.1.0
Status: Production Ready ✅

EVAL_FRAMEWORK_GUIDE.md 17 KB История Исходник