# OpenCode Agent Evaluation Framework Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection. --- ## πŸš€ Quick Start ```bash cd evals/framework npm install npm run build # Run all tests (free model by default) npm run eval:sdk # Run specific agent npm run eval:sdk -- --agent=openagent npm run eval:sdk -- --agent=opencoder # View results dashboard cd ../results && ./serve.sh ``` **πŸ“– New to the framework?** Start with [GETTING_STARTED.md](GETTING_STARTED.md) --- ## πŸ“Š Current Status ### Test Coverage | Agent | Tests | Pass Rate | Status | |-------|-------|-----------|--------| | **OpenAgent** | 22 tests | 100% | βœ… Production Ready | | **Opencoder** | 4 tests | 100% | βœ… Production Ready | ### Recent Achievements (Nov 26, 2025) βœ… **Context Loading Tests** - 5 comprehensive tests (3 simple, 2 complex multi-turn) βœ… **Smart Timeout System** - Activity monitoring with absolute max timeout βœ… **Fixed Context Evaluator** - Properly detects context files in multi-turn sessions βœ… **Batch Test Runner** - Run tests in controlled batches to avoid API limits βœ… **Results Dashboard** - Interactive web dashboard with filtering and charts --- ## πŸ“ Directory Structure ``` evals/ β”œβ”€β”€ framework/ # Core evaluation framework β”‚ β”œβ”€β”€ src/ β”‚ β”‚ β”œβ”€β”€ sdk/ # SDK-based test runner β”‚ β”‚ β”œβ”€β”€ collector/ # Session data collection β”‚ β”‚ β”œβ”€β”€ evaluators/ # Rule violation detection β”‚ β”‚ └── types/ # TypeScript types β”‚ β”œβ”€β”€ docs/ # Framework documentation β”‚ β”œβ”€β”€ scripts/utils/run-tests-batch.sh # Batch test runner β”‚ └── README.md # Framework docs β”‚ β”œβ”€β”€ agents/ # Agent-specific test suites β”‚ β”œβ”€β”€ openagent/ # OpenAgent tests β”‚ β”‚ β”œβ”€β”€ tests/ β”‚ β”‚ β”‚ β”œβ”€β”€ context-loading/ # Context loading tests (NEW) β”‚ β”‚ β”‚ β”œβ”€β”€ developer/ # Developer workflow tests β”‚ β”‚ β”‚ β”œβ”€β”€ business/ # Business analysis tests β”‚ β”‚ β”‚ └── edge-case/ # Edge case tests β”‚ β”‚ β”œβ”€β”€ CONTEXT_LOADING_COVERAGE.md β”‚ β”‚ β”œβ”€β”€ IMPLEMENTATION_SUMMARY.md β”‚ β”‚ └── README.md β”‚ β”‚ β”‚ β”œβ”€β”€ opencoder/ # Opencoder tests β”‚ β”‚ β”œβ”€β”€ tests/developer/ β”‚ β”‚ └── README.md β”‚ β”‚ β”‚ └── shared/ # Shared test utilities β”‚ β”œβ”€β”€ results/ # Test results & dashboard β”‚ β”œβ”€β”€ history/ # Historical results (60-day retention) β”‚ β”œβ”€β”€ index.html # Interactive dashboard β”‚ β”œβ”€β”€ serve.sh # One-command server β”‚ β”œβ”€β”€ latest.json # Latest test results β”‚ └── README.md β”‚ β”œβ”€β”€ test_tmp/ # Temporary test files (auto-cleaned) β”‚ β”œβ”€β”€ GETTING_STARTED.md # Quick start guide (START HERE) β”œβ”€β”€ HOW_TESTS_WORK.md # Detailed test execution guide β”œβ”€β”€ ARCHITECTURE.md # System architecture review └── README.md # This file ``` --- ## 🎯 Key Features ### βœ… SDK-Based Execution - Uses official `@opencode-ai/sdk` for real agent interaction - Real-time event streaming (10+ events per test) - Actual session recording to disk ### βœ… Cost-Aware Testing - **FREE by default** - Uses `opencode/grok-code-fast` (OpenCode Zen) - Override per-test or via CLI: `--model=provider/model` - No accidental API costs during development ### βœ… Smart Timeout System (NEW) - Activity monitoring - extends timeout while agent is working - Base timeout: 300s (5 min) of inactivity - Absolute max: 600s (10 min) hard limit - Prevents false timeouts on complex multi-turn tests ### βœ… Context Loading Validation (NEW) - 5 comprehensive tests covering simple and complex scenarios - Verifies context files loaded before execution - Multi-turn conversation support - Proper file path extraction from SDK events ### βœ… Rule-Based Validation - 4 evaluators check compliance with agent rules - Tests behavior (tool usage, approvals) not style - Model-agnostic test design ### βœ… Results Tracking & Visualization - Type-safe JSON result generation - Interactive web dashboard with filtering - Pass rate trend charts - CSV export functionality - 60-day retention policy --- ## πŸ“š Documentation | Document | Purpose | Audience | |----------|---------|----------| | **[GETTING_STARTED.md](GETTING_STARTED.md)** | Quick start guide | New users | | **[HOW_TESTS_WORK.md](HOW_TESTS_WORK.md)** | Test execution details | Test authors | | **[ARCHITECTURE.md](ARCHITECTURE.md)** | System architecture | Developers | | **[framework/SDK_EVAL_README.md](framework/SDK_EVAL_README.md)** | Complete SDK guide | All users | | **[framework/docs/test-design-guide.md](framework/docs/test-design-guide.md)** | Test design philosophy | Test authors | | **[agents/openagent/CONTEXT_LOADING_COVERAGE.md](agents/openagent/CONTEXT_LOADING_COVERAGE.md)** | Context loading tests | OpenAgent users | | **[agents/openagent/IMPLEMENTATION_SUMMARY.md](agents/openagent/IMPLEMENTATION_SUMMARY.md)** | Recent implementation | Developers | --- ## πŸ”§ Agent Differences | Feature | OpenAgent | Opencoder | |---------|-----------|-----------| | **Approval** | Text-based + tool permissions | Tool permissions only | | **Workflow** | Analyzeβ†’Approveβ†’Executeβ†’Validate | Direct execution | | **Context** | Mandatory before execution | On-demand | | **Test Style** | Multi-turn (approval flow) | Single prompt | | **Timeout** | 300s (smart timeout) | 60s (standard) | --- ## 🎨 Usage Examples ### Run Tests ```bash # All tests with free model npm run eval:sdk # Specific category npm run eval:sdk -- --pattern="context-loading/*.yaml" # Custom model npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022 # Debug single test npm run eval:sdk -- --pattern="ctx-simple-coding-standards.yaml" --debug # Batch execution (avoid API limits) ./scripts/utils/run-tests-batch.sh openagent 3 10 ``` ### View Results ```bash # Interactive dashboard (one command!) cd results && ./serve.sh # View JSON cat results/latest.json # Historical results ls results/history/2025-11/ ``` ### Create New Test ```yaml # Example: context-loading/my-test.yaml id: my-test-001 name: "My Test" description: What this test validates category: developer agent: openagent model: anthropic/claude-sonnet-4-5 prompt: "Your test prompt here" behavior: mustUseTools: [read] requiresContext: true minToolCalls: 1 expectedViolations: - rule: context-loading shouldViolate: false severity: error approvalStrategy: type: auto-approve timeout: 60000 tags: - context-loading ``` See [GETTING_STARTED.md](GETTING_STARTED.md) for more examples. --- ## πŸ—οΈ Framework Components ### SDK Test Runner - **ServerManager** - Start/stop opencode server - **ClientManager** - Session and prompt management - **EventStreamHandler** - Real-time event capture - **TestRunner** - Test orchestration with evaluators - **ApprovalStrategies** - Auto-approve, deny, smart rules ### Evaluators - **ApprovalGateEvaluator** - Checks approval before tool execution - **ContextLoadingEvaluator** - Verifies context files loaded first (FIXED) - **DelegationEvaluator** - Validates delegation for 4+ files - **ToolUsageEvaluator** - Checks bash vs specialized tools - **BehaviorEvaluator** - Validates test-specific behavior expectations ### Results System - **ResultSaver** - Type-safe JSON generation - **Dashboard** - Interactive web visualization - **Helper Scripts** - Easy deployment (`serve.sh`) --- ## πŸ”¬ Test Schema (v2) ```yaml # Behavior expectations (what agent should do) behavior: mustUseTools: [read, write] # Required tools mustUseAnyOf: [[bash], [list]] # Alternative tools requiresApproval: true # Must ask for approval requiresContext: true # Must load context minToolCalls: 2 # Minimum tool calls # Expected violations (what rules to check) expectedViolations: - rule: approval-gate shouldViolate: false # Should NOT violate severity: error - rule: context-loading shouldViolate: false severity: error ``` --- ## πŸ“ˆ Recent Improvements ### November 26, 2025 1. **Context Loading Tests** (5 tests, 100% passing) - 3 simple tests (single prompt, read-only) - 2 complex tests (multi-turn with file creation) - Comprehensive coverage of context loading scenarios 2. **Smart Timeout System** - Activity monitoring prevents false timeouts - Base timeout: 300s inactivity - Absolute max: 600s hard limit - Handles complex multi-turn tests gracefully 3. **Fixed Context Loading Evaluator** - Corrected file path extraction (`tool.data.state.input.filePath`) - Multi-turn session support - Checks context for ALL executions, not just first 4. **Batch Test Runner** - `run-tests-batch.sh` script - Configurable batch size and delays - Prevents API rate limits 5. **Results Dashboard** - Interactive web UI with filtering - Pass rate trend charts - CSV export - One-command deployment --- ## 🎯 Achievements βœ… Full SDK integration with `@opencode-ai/sdk@1.0.90` βœ… Real-time event streaming (12+ events per test) βœ… 5 evaluators integrated and working βœ… YAML-based test definitions with Zod validation βœ… CLI runner with detailed reporting βœ… Free model by default (no API costs) βœ… Model-agnostic test design βœ… Both positive and negative test support βœ… Smart timeout with activity monitoring βœ… Context loading validation (100% coverage) βœ… Results tracking and visualization βœ… Batch execution support **Status:** βœ… Production-ready for OpenAgent & Opencoder evaluation --- ## 🀝 Contributing See [../docs/contributing/CONTRIBUTING.md](../docs/contributing/CONTRIBUTING.md) --- ## πŸ“„ License MIT --- ## πŸ†˜ Support - **Getting Started**: [GETTING_STARTED.md](GETTING_STARTED.md) - **How Tests Work**: [HOW_TESTS_WORK.md](HOW_TESTS_WORK.md) - **Architecture**: [ARCHITECTURE.md](ARCHITECTURE.md) - **Issues**: Check documentation or create an issue --- **Last Updated**: 2025-11-26 **Framework Version**: 0.1.0 **Test Coverage**: 26 tests (22 OpenAgent, 4 Opencoder) **Pass Rate**: 100%