Test and validate agent behavior with automated evaluations.
cd evals/framework
# Run golden tests (baseline - 8 tests, ~2-3 min)
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"
# Run a specific test
npm run eval:sdk -- --agent=openagent --pattern="**/smoke-test.yaml"
# Run with debug output (includes multi-agent logging)
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug
Beautiful hierarchical logging shows parent-child delegation chains:
┌────────────────────────────────────────────────────────────┐
│ 🎯 PARENT: OpenAgent (ses_xxx...) │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ 🎯 CHILD: simple-responder (ses_yyy...) │
│ Parent: ses_xxx... │
│ Depth: 1 │
└────────────────────────────────────────────────────────────┘
✅ CHILD COMPLETE (2.9s)
✅ PARENT COMPLETE (20.9s)
Enable with --debug flag. See MULTI_AGENT_LOGGING_COMPLETE.md for details.
8 curated tests that validate core agent behaviors:
| Test | What It Validates |
|---|---|
| 01-smoke-test | Agent & subagent delegation (multi-agent) |
| 02-context-loading | Agent reads context before answering |
| 03-read-before-write | Agent inspects before modifying |
| 04-write-with-approval | Agent asks before writing |
| 05-multi-turn-context | Agent remembers conversation |
| 06-task-breakdown | Agent reads standards before implementing |
| 07-tool-selection | Agent uses dedicated tools (not bash) |
| 08-error-handling | Agent handles errors gracefully |
# Run all golden tests
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"
See CREATING_TESTS.md for:
expectedContextFiles - Explicitly specify which context files to validateYou can now explicitly specify which context files the agent must read:
behavior:
requiresContext: true
expectedContextFiles:
- .opencode/context/core/standards/code.md
- standards/code.md
See agents/shared/tests/EXPLICIT_CONTEXT_FILES.md for detailed guide.
Quick example:
id: my-test
name: "My Test"
description: What this tests.
category: developer
prompts:
- text: Read evals/test_tmp/README.md and summarize it.
approvalStrategy:
type: auto-approve
behavior:
mustUseTools: [read]
expectedViolations:
- rule: approval-gate
shouldViolate: false
severity: error
timeout: 60000
| Evaluator | What It Checks |
|---|---|
| approval-gate | Approval requested before risky operations |
| context-loading | Context files loaded before acting (supports explicit file specification) |
| execution-balance | Read operations before write operations |
| tool-usage | Dedicated tools used instead of bash |
| behavior | Expected tools used, forbidden tools avoided |
| delegation | Complex tasks delegated to subagents |
| stop-on-failure | Agent stops on errors instead of auto-fixing |
evals/
├── README.md # This file
├── CREATING_TESTS.md # How to create custom tests
├── framework/ # Test runner and evaluators
│ ├── src/
│ │ ├── sdk/ # Test execution
│ │ └── evaluators/ # Rule validators
│ └── README.md # Technical details
├── agents/
│ ├── shared/tests/
│ │ ├── golden/ # 8 baseline tests
│ │ └── templates/ # Test templates
│ └── core/openagent/tests/ # Agent-specific tests
├── results/ # Test results
│ ├── latest.json
│ └── index.html # Dashboard
└── test_tmp/ # Temp files (auto-cleaned)
npm run eval:sdk -- [options]
Options:
--agent=NAME Agent to test (openagent, opencoder, core/openagent)
--subagent=NAME Test a subagent (coder-agent, tester, reviewer, etc.)
Default: Standalone mode (forces mode: primary)
--delegate Test subagent via parent delegation (requires --subagent)
--pattern=GLOB Test file pattern (default: **/*.yaml)
--debug Enable debug output, keep sessions for inspection
--verbose Show full conversation (prompts + responses) after each test
(automatically enables --debug)
--model=PROVIDER/MODEL Override model (default: opencode/big-pickle)
--timeout=MS Test timeout (default: 60000)
--prompt-variant=NAME Use specific prompt variant (gpt, gemini, grok, llama)
Auto-detects recommended model from prompt metadata
--no-evaluators Skip running evaluators (faster iteration)
--core Run core test suite only (7 tests, ~5-8 min)
# Run golden tests with verbose output (see full conversations)
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --verbose
# Test subagent standalone (forces mode: primary)
npm run eval:sdk -- --subagent=coder-agent
# Test subagent via delegation (uses parent agent)
npm run eval:sdk -- --subagent=coder-agent --delegate
# Test with a specific model
npm run eval:sdk -- --agent=openagent --model=anthropic/claude-3-5-sonnet-20241022
# Test with a prompt variant (auto-detects model)
npm run eval:sdk -- --agent=openagent --prompt-variant=llama
# Quick iteration without evaluators
npm run eval:sdk -- --agent=openagent --pattern="**/01-smoke-test.yaml" --no-evaluators
From the project root, you can use these shortcuts:
# Full pipeline: build, validate, run golden tests
make test-evals
# Just run golden tests (8 tests, ~3-5 min)
make test-golden
# Quick smoke test (1 test, ~30s)
make test-smoke
# Run with verbose output (see full conversations)
make test-verbose
# Test specific agent
make test-agent AGENT=opencoder
# Test subagent (standalone mode)
make test-subagent SUBAGENT=coder-agent
# Test subagent (delegation mode)
make test-subagent-delegate SUBAGENT=coder-agent
# Test with specific model
make test-model MODEL=anthropic/claude-3-5-sonnet-20241022
# Test with prompt variant
make test-variant VARIANT=llama
# View results
make view-results # Open dashboard in browser
make show-results # Show summary in terminal
For detailed subagent testing guide, see SUBAGENT_TESTING.md
Results are saved to evals/results/:
latest.json - Most recent runhistory/ - Historical results (by month)index.html - Dashboard (open in browser)# View dashboard
make view-results
# Or manually:
cd evals/results && python -m http.server 8080
# Open http://localhost:8080