# Agent Evaluation Framework Test and validate agent behavior with automated evaluations. ## Quick Start ```bash cd evals/framework # Run golden tests (baseline - 8 tests, ~2-3 min) npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" # Run a specific test npm run eval:sdk -- --agent=openagent --pattern="**/smoke-test.yaml" # Run with debug output (includes multi-agent logging) npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug ``` ## ✨ New Features ### Multi-Agent Logging (Dec 2025) Beautiful hierarchical logging shows parent-child delegation chains: ``` ┌────────────────────────────────────────────────────────────┐ │ 🎯 PARENT: OpenAgent (ses_xxx...) │ └────────────────────────────────────────────────────────────┘ ┌────────────────────────────────────────────────────────────┐ │ 🎯 CHILD: simple-responder (ses_yyy...) │ │ Parent: ses_xxx... │ │ Depth: 1 │ └────────────────────────────────────────────────────────────┘ ✅ CHILD COMPLETE (2.9s) ✅ PARENT COMPLETE (20.9s) ``` Enable with `--debug` flag. See [MULTI_AGENT_LOGGING_COMPLETE.md](MULTI_AGENT_LOGGING_COMPLETE.md) for details. ### Performance Improvements (Dec 2025) - **10-20% faster tests** - Grace period reduced from 5s to 2s - **Performance metrics** - Automatic collection of tool latencies, inference time - **37 unit tests** - Complete test coverage for logging system ## Golden Tests 8 curated tests that validate core agent behaviors: | Test | What It Validates | |------|-------------------| | 01-smoke-test | **Agent & subagent delegation** (multi-agent) | | 02-context-loading | Agent reads context before answering | | 03-read-before-write | Agent inspects before modifying | | 04-write-with-approval | Agent asks before writing | | 05-multi-turn-context | Agent remembers conversation | | 06-task-breakdown | Agent reads standards before implementing | | 07-tool-selection | Agent uses dedicated tools (not bash) | | 08-error-handling | Agent handles errors gracefully | ```bash # Run all golden tests npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" ``` ## Creating Custom Tests See **[CREATING_TESTS.md](CREATING_TESTS.md)** for: - Test templates (copy and modify) - Behavior options (mustUseTools, requiresApproval, etc.) - **NEW:** `expectedContextFiles` - Explicitly specify which context files to validate - Expected violations - Examples ### New Feature: Explicit Context File Validation You can now explicitly specify which context files the agent must read: ```yaml behavior: requiresContext: true expectedContextFiles: - .opencode/context/core/standards/code.md - standards/code.md ``` See **[agents/shared/tests/EXPLICIT_CONTEXT_FILES.md](agents/shared/tests/EXPLICIT_CONTEXT_FILES.md)** for detailed guide. Quick example: ```yaml id: my-test name: "My Test" description: What this tests. category: developer prompts: - text: Read evals/test_tmp/README.md and summarize it. approvalStrategy: type: auto-approve behavior: mustUseTools: [read] expectedViolations: - rule: approval-gate shouldViolate: false severity: error timeout: 60000 ``` ## Evaluators | Evaluator | What It Checks | |-----------|----------------| | **approval-gate** | Approval requested before risky operations | | **context-loading** | Context files loaded before acting (supports explicit file specification) | | **execution-balance** | Read operations before write operations | | **tool-usage** | Dedicated tools used instead of bash | | **behavior** | Expected tools used, forbidden tools avoided | | **delegation** | Complex tasks delegated to subagents | | **stop-on-failure** | Agent stops on errors instead of auto-fixing | ## Directory Structure ``` evals/ ├── README.md # This file ├── CREATING_TESTS.md # How to create custom tests ├── framework/ # Test runner and evaluators │ ├── src/ │ │ ├── sdk/ # Test execution │ │ └── evaluators/ # Rule validators │ └── README.md # Technical details ├── agents/ │ ├── shared/tests/ │ │ ├── golden/ # 8 baseline tests │ │ └── templates/ # Test templates │ └── core/openagent/tests/ # Agent-specific tests ├── results/ # Test results │ ├── latest.json │ └── index.html # Dashboard └── test_tmp/ # Temp files (auto-cleaned) ``` ## CLI Options ```bash npm run eval:sdk -- [options] Options: --agent=NAME Agent to test (openagent, opencoder, core/openagent) --subagent=NAME Test a subagent (coder-agent, tester, reviewer, etc.) Default: Standalone mode (forces mode: primary) --delegate Test subagent via parent delegation (requires --subagent) --pattern=GLOB Test file pattern (default: **/*.yaml) --debug Enable debug output, keep sessions for inspection --verbose Show full conversation (prompts + responses) after each test (automatically enables --debug) --model=PROVIDER/MODEL Override model (default: opencode/big-pickle) --timeout=MS Test timeout (default: 60000) --prompt-variant=NAME Use specific prompt variant (gpt, gemini, grok, llama) Auto-detects recommended model from prompt metadata --no-evaluators Skip running evaluators (faster iteration) --core Run core test suite only (7 tests, ~5-8 min) ``` ### Examples ```bash # Run golden tests with verbose output (see full conversations) npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --verbose # Test subagent standalone (forces mode: primary) npm run eval:sdk -- --subagent=coder-agent # Test subagent via delegation (uses parent agent) npm run eval:sdk -- --subagent=coder-agent --delegate # Test with a specific model npm run eval:sdk -- --agent=openagent --model=anthropic/claude-3-5-sonnet-20241022 # Test with a prompt variant (auto-detects model) npm run eval:sdk -- --agent=openagent --prompt-variant=llama # Quick iteration without evaluators npm run eval:sdk -- --agent=openagent --pattern="**/01-smoke-test.yaml" --no-evaluators ``` ## Quick Commands (Makefile) From the project root, you can use these shortcuts: ```bash # Full pipeline: build, validate, run golden tests make test-evals # Just run golden tests (8 tests, ~3-5 min) make test-golden # Quick smoke test (1 test, ~30s) make test-smoke # Run with verbose output (see full conversations) make test-verbose # Test specific agent make test-agent AGENT=opencoder # Test subagent (standalone mode) make test-subagent SUBAGENT=coder-agent # Test subagent (delegation mode) make test-subagent-delegate SUBAGENT=coder-agent # Test with specific model make test-model MODEL=anthropic/claude-3-5-sonnet-20241022 # Test with prompt variant make test-variant VARIANT=llama # View results make view-results # Open dashboard in browser make show-results # Show summary in terminal ``` **For detailed subagent testing guide, see [SUBAGENT_TESTING.md](./SUBAGENT_TESTING.md)** ## Results Results are saved to `evals/results/`: - `latest.json` - Most recent run - `history/` - Historical results (by month) - `index.html` - Dashboard (open in browser) ```bash # View dashboard make view-results # Or manually: cd evals/results && python -m http.server 8080 # Open http://localhost:8080 ```