Agent Evaluation Framework

Test and validate agent behavior with automated evaluations.

Quick Start

cd evals/framework

# Run golden tests (baseline - 8 tests, ~2-3 min)
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"

# Run a specific test
npm run eval:sdk -- --agent=openagent --pattern="**/smoke-test.yaml"

# Run with debug output (includes multi-agent logging)
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug

✨ New Features

Multi-Agent Logging (Dec 2025)

Beautiful hierarchical logging shows parent-child delegation chains:

┌────────────────────────────────────────────────────────────┐
│ 🎯 PARENT: OpenAgent (ses_xxx...)                          │
└────────────────────────────────────────────────────────────┘
  
  ┌────────────────────────────────────────────────────────────┐
  │ 🎯 CHILD: simple-responder (ses_yyy...)                    │
  │    Parent: ses_xxx...                                      │
  │    Depth: 1                                                │
  └────────────────────────────────────────────────────────────┘
  
  ✅ CHILD COMPLETE (2.9s)
✅ PARENT COMPLETE (20.9s)

Enable with --debug flag. See MULTI_AGENT_LOGGING_COMPLETE.md for details.

Performance Improvements (Dec 2025)

10-20% faster tests - Grace period reduced from 5s to 2s
Performance metrics - Automatic collection of tool latencies, inference time
37 unit tests - Complete test coverage for logging system

Golden Tests

8 curated tests that validate core agent behaviors:

Test	What It Validates
01-smoke-test	Agent & subagent delegation (multi-agent)
02-context-loading	Agent reads context before answering
03-read-before-write	Agent inspects before modifying
04-write-with-approval	Agent asks before writing
05-multi-turn-context	Agent remembers conversation
06-task-breakdown	Agent reads standards before implementing
07-tool-selection	Agent uses dedicated tools (not bash)
08-error-handling	Agent handles errors gracefully

# Run all golden tests
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"

Creating Custom Tests

See CREATING_TESTS.md for:

Test templates (copy and modify)
Behavior options (mustUseTools, requiresApproval, etc.)
NEW: expectedContextFiles - Explicitly specify which context files to validate
Expected violations
Examples

New Feature: Explicit Context File Validation

You can now explicitly specify which context files the agent must read:

behavior:
  requiresContext: true
  expectedContextFiles:
    - .opencode/context/core/standards/code.md
    - standards/code.md

See agents/shared/tests/EXPLICIT_CONTEXT_FILES.md for detailed guide.

Quick example:

id: my-test
name: "My Test"
description: What this tests.
category: developer

prompts:
  - text: Read evals/test_tmp/README.md and summarize it.

approvalStrategy:
  type: auto-approve

behavior:
  mustUseTools: [read]

expectedViolations:
  - rule: approval-gate
    shouldViolate: false
    severity: error

timeout: 60000

Evaluators

Evaluator	What It Checks
approval-gate	Approval requested before risky operations
context-loading	Context files loaded before acting (supports explicit file specification)
execution-balance	Read operations before write operations
tool-usage	Dedicated tools used instead of bash
behavior	Expected tools used, forbidden tools avoided
delegation	Complex tasks delegated to subagents
stop-on-failure	Agent stops on errors instead of auto-fixing

Directory Structure

evals/
├── README.md                    # This file
├── CREATING_TESTS.md           # How to create custom tests
├── framework/                   # Test runner and evaluators
│   ├── src/
│   │   ├── sdk/                # Test execution
│   │   └── evaluators/         # Rule validators
│   └── README.md               # Technical details
├── agents/
│   ├── shared/tests/
│   │   ├── golden/             # 8 baseline tests
│   │   └── templates/          # Test templates
│   └── core/openagent/tests/   # Agent-specific tests
├── results/                     # Test results
│   ├── latest.json
│   └── index.html              # Dashboard
└── test_tmp/                    # Temp files (auto-cleaned)

CLI Options

npm run eval:sdk -- [options]

Options:
  --agent=NAME           Agent to test (openagent, opencoder, core/openagent)
  --subagent=NAME        Test a subagent (coder-agent, tester, reviewer, etc.)
                         Default: Standalone mode (forces mode: primary)
  --delegate             Test subagent via parent delegation (requires --subagent)
  --pattern=GLOB         Test file pattern (default: **/*.yaml)
  --debug                Enable debug output, keep sessions for inspection
  --verbose              Show full conversation (prompts + responses) after each test
                         (automatically enables --debug)
  --model=PROVIDER/MODEL Override model (default: opencode/big-pickle)
  --timeout=MS           Test timeout (default: 60000)
  --prompt-variant=NAME  Use specific prompt variant (gpt, gemini, grok, llama)
                         Auto-detects recommended model from prompt metadata
  --no-evaluators        Skip running evaluators (faster iteration)
  --core                 Run core test suite only (7 tests, ~5-8 min)

Examples

# Run golden tests with verbose output (see full conversations)
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --verbose

# Test subagent standalone (forces mode: primary)
npm run eval:sdk -- --subagent=coder-agent

# Test subagent via delegation (uses parent agent)
npm run eval:sdk -- --subagent=coder-agent --delegate

# Test with a specific model
npm run eval:sdk -- --agent=openagent --model=anthropic/claude-3-5-sonnet-20241022

# Test with a prompt variant (auto-detects model)
npm run eval:sdk -- --agent=openagent --prompt-variant=llama

# Quick iteration without evaluators
npm run eval:sdk -- --agent=openagent --pattern="**/01-smoke-test.yaml" --no-evaluators

Quick Commands (Makefile)

From the project root, you can use these shortcuts:

# Full pipeline: build, validate, run golden tests
make test-evals

# Just run golden tests (8 tests, ~3-5 min)
make test-golden

# Quick smoke test (1 test, ~30s)
make test-smoke

# Run with verbose output (see full conversations)
make test-verbose

# Test specific agent
make test-agent AGENT=opencoder

# Test subagent (standalone mode)
make test-subagent SUBAGENT=coder-agent

# Test subagent (delegation mode)
make test-subagent-delegate SUBAGENT=coder-agent

# Test with specific model
make test-model MODEL=anthropic/claude-3-5-sonnet-20241022

# Test with prompt variant
make test-variant VARIANT=llama

# View results
make view-results    # Open dashboard in browser
make show-results    # Show summary in terminal

For detailed subagent testing guide, see SUBAGENT_TESTING.md

Results

Results are saved to evals/results/:

latest.json - Most recent run
history/ - Historical results (by month)
index.html - Dashboard (open in browser)

# View dashboard
make view-results
# Or manually:
cd evals/results && python -m http.server 8080
# Open http://localhost:8080

README.md 8.1 KB History Raw