# OpenCode SDK Evaluation Framework End-to-end testing framework for OpenCode agents using real SDK execution. ## Quick Start ```bash # Install dependencies cd evals/framework npm install # Run all tests with free model (no API costs) npm run eval:sdk # Run with debug output npm run eval:sdk -- --debug # Run without evaluators (faster) npm run eval:sdk -- --no-evaluators ``` ## Model Configuration ### Using Free Models (Recommended for Development) The framework defaults to **OpenCode Zen's free tier** to avoid API costs during development: ```bash # Default: Uses opencode/grok-code-fast (free) npm run eval:sdk ``` ### Using Paid Models Override the model for production evaluation: ```bash # Use Claude 3.5 Sonnet npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022 # Use GPT-4 Turbo npm run eval:sdk -- --model=openai/gpt-4-turbo ``` ### Per-Test Model Override Test cases can specify their own model in the YAML file: ```yaml id: my-test-001 name: My Test # ... other fields ... model: anthropic/claude-3-5-sonnet-20241022 # Override default ``` ## Available Models ### Free Tier (OpenCode Zen) - `opencode/grok-code-fast` - **FREE** - Grok Code Fast model - Cost: $0.00 input, $0.00 output - Good for: Development, testing, rapid iteration - **Default model** ### Paid Tiers - `anthropic/claude-3-5-sonnet-20241022` - Claude 3.5 Sonnet - Best for: Complex reasoning, code generation - `openai/gpt-4-turbo` - GPT-4 Turbo - Best for: General purpose tasks See [OpenCode Zen docs](https://opencode.ai/zen) for full model list. ## CLI Options ```bash npm run eval:sdk -- [options] Options: --debug Enable verbose logging --no-evaluators Skip running evaluators (faster testing) --model= Override default model --pattern= Run specific test files --timeout= Test timeout (default: 60000) Examples: npm run eval:sdk -- --debug npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022 npm run eval:sdk -- --pattern="developer/*.yaml" npm run eval:sdk -- --pattern="edge-case/*.yaml" --no-evaluators ``` ## Test Case Structure Create test cases in `evals/opencode/openagent/sdk-tests/`: ```yaml id: my-test-001 name: My Test Case description: What this test does category: developer # or business, creative, edge-case prompt: | Your test prompt here # Optional: Override default model for this test model: anthropic/claude-3-5-sonnet-20241022 approvalStrategy: type: auto-approve # or auto-deny, smart expected: pass: true minMessages: 2 toolCalls: - bash - write timeout: 60000 tags: - approval-gate - file-creation ``` ## Approval Strategies ### Auto-Approve Automatically approves all permission requests: ```yaml approvalStrategy: type: auto-approve ``` ### Auto-Deny Automatically denies all permission requests (for testing approval gates): ```yaml approvalStrategy: type: auto-deny ``` ### Smart Strategy Rule-based approval with fine-grained control: ```yaml approvalStrategy: type: smart config: allowedTools: - bash - read deniedTools: - write - edit maxApprovals: 5 defaultDecision: true ``` ## Evaluators The framework runs 4 evaluators on recorded sessions: 1. **ApprovalGateEvaluator** - Ensures approval before tool execution 2. **ContextLoadingEvaluator** - Verifies context files loaded first 3. **DelegationEvaluator** - Validates delegation for 4+ files 4. **ToolUsageEvaluator** - Checks for bash anti-patterns Disable evaluators for faster iteration: ```bash npm run eval:sdk -- --no-evaluators ``` ## Test Results After running tests, you'll see a summary: ``` ====================================================================== TEST RESULTS ====================================================================== 1. ✅ dev-install-deps-001 - Install Dependencies Duration: 17148ms Events: 36 Approvals: 0 2. ❌ biz-data-analysis-001 - Business Data Analysis Duration: 17512ms Events: 18 Approvals: 0 Errors: - Expected at least 2 messages, got 1 ====================================================================== SUMMARY: 1/2 tests passed (1 failed) ====================================================================== ``` ## Directory Structure ``` evals/framework/ ├── src/sdk/ │ ├── server-manager.ts # Start/stop opencode server │ ├── client-manager.ts # SDK wrapper │ ├── event-stream-handler.ts # Event streaming │ ├── test-runner.ts # Test orchestration │ ├── test-case-schema.ts # Zod schema │ ├── test-case-loader.ts # YAML loader │ ├── run-sdk-tests.ts # CLI entry point │ └── approval/ │ ├── auto-approve-strategy.ts │ ├── auto-deny-strategy.ts │ └── smart-approval-strategy.ts │ evals/opencode/openagent/sdk-tests/ ├── developer/ │ ├── install-dependencies.yaml │ └── create-component.yaml ├── business/ │ └── data-analysis.yaml ├── creative/ └── edge-case/ └── just-do-it.yaml ``` ## Cost Management ### Development (Free) Use the default free model for all development and testing: ```bash # No costs - uses opencode/grok-code-fast npm run eval:sdk ``` ### Production (Paid) Switch to paid models only when running production evaluations: ```bash # Use Claude for final evaluation npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022 ``` ### Per-Test Basis Some tests might need specific models. Set them in the YAML: ```yaml # expensive-test.yaml model: anthropic/claude-3-5-sonnet-20241022 # Only this test uses Claude # cheap-test.yaml # Uses default free model ``` ## Troubleshooting ### "Session not found" error with evaluators Sessions need time to be written to disk. Try running without evaluators first: ```bash npm run eval:sdk -- --no-evaluators ``` ### Tests timing out Increase the timeout: ```bash npm run eval:sdk -- --timeout=120000 # 2 minutes ``` ### Model authentication errors Ensure you're authenticated with the provider: ```bash opencode auth login # Select provider and enter API key ``` For free OpenCode Zen models, sign up at [opencode.ai/auth](https://opencode.ai/auth). ## Next Steps 1. **Add More Tests** - Create test cases in `sdk-tests/` 2. **CI Integration** - Add to GitHub Actions workflow 3. **Custom Evaluators** - Extend the evaluation framework 4. **HTML Reports** - Generate visual test reports ## Contributing When adding new test cases: 1. Place in appropriate category directory 2. Use descriptive IDs (`category-name-001`) 3. Add clear descriptions and expected results 4. Test with free model first 5. Document any model-specific requirements