End-to-end testing framework for OpenCode agents using real SDK execution.
# Install dependencies
cd evals/framework
npm install
# Run all tests with free model (no API costs)
npm run eval:sdk
# Run with debug output
npm run eval:sdk -- --debug
# Run without evaluators (faster)
npm run eval:sdk -- --no-evaluators
The framework defaults to OpenCode Zen's free tier to avoid API costs during development:
# Default: Uses opencode/grok-code-fast (free)
npm run eval:sdk
Override the model for production evaluation:
# Use Claude 3.5 Sonnet
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
# Use GPT-4 Turbo
npm run eval:sdk -- --model=openai/gpt-4-turbo
Test cases can specify their own model in the YAML file:
id: my-test-001
name: My Test
# ... other fields ...
model: anthropic/claude-3-5-sonnet-20241022 # Override default
opencode/grok-code-fast - FREE - Grok Code Fast model
anthropic/claude-3-5-sonnet-20241022 - Claude 3.5 Sonnet
openai/gpt-4-turbo - GPT-4 Turbo
See OpenCode Zen docs for full model list.
npm run eval:sdk -- [options]
Options:
--debug Enable verbose logging
--no-evaluators Skip running evaluators (faster testing)
--model=<provider/model> Override default model
--pattern=<glob> Run specific test files
--timeout=<ms> Test timeout (default: 60000)
Examples:
npm run eval:sdk -- --debug
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
npm run eval:sdk -- --pattern="developer/*.yaml"
npm run eval:sdk -- --pattern="edge-case/*.yaml" --no-evaluators
Create test cases in evals/opencode/openagent/sdk-tests/:
id: my-test-001
name: My Test Case
description: What this test does
category: developer # or business, creative, edge-case
prompt: |
Your test prompt here
# Optional: Override default model for this test
model: anthropic/claude-3-5-sonnet-20241022
approvalStrategy:
type: auto-approve # or auto-deny, smart
expected:
pass: true
minMessages: 2
toolCalls:
- bash
- write
timeout: 60000
tags:
- approval-gate
- file-creation
Automatically approves all permission requests:
approvalStrategy:
type: auto-approve
Automatically denies all permission requests (for testing approval gates):
approvalStrategy:
type: auto-deny
Rule-based approval with fine-grained control:
approvalStrategy:
type: smart
config:
allowedTools:
- bash
- read
deniedTools:
- write
- edit
maxApprovals: 5
defaultDecision: true
The framework runs 4 evaluators on recorded sessions:
Disable evaluators for faster iteration:
npm run eval:sdk -- --no-evaluators
After running tests, you'll see a summary:
======================================================================
TEST RESULTS
======================================================================
1. ✅ dev-install-deps-001 - Install Dependencies
Duration: 17148ms
Events: 36
Approvals: 0
2. ❌ biz-data-analysis-001 - Business Data Analysis
Duration: 17512ms
Events: 18
Approvals: 0
Errors:
- Expected at least 2 messages, got 1
======================================================================
SUMMARY: 1/2 tests passed (1 failed)
======================================================================
evals/framework/
├── src/sdk/
│ ├── server-manager.ts # Start/stop opencode server
│ ├── client-manager.ts # SDK wrapper
│ ├── event-stream-handler.ts # Event streaming
│ ├── test-runner.ts # Test orchestration
│ ├── test-case-schema.ts # Zod schema
│ ├── test-case-loader.ts # YAML loader
│ ├── run-sdk-tests.ts # CLI entry point
│ └── approval/
│ ├── auto-approve-strategy.ts
│ ├── auto-deny-strategy.ts
│ └── smart-approval-strategy.ts
│
evals/opencode/openagent/sdk-tests/
├── developer/
│ ├── install-dependencies.yaml
│ └── create-component.yaml
├── business/
│ └── data-analysis.yaml
├── creative/
└── edge-case/
└── just-do-it.yaml
Use the default free model for all development and testing:
# No costs - uses opencode/grok-code-fast
npm run eval:sdk
Switch to paid models only when running production evaluations:
# Use Claude for final evaluation
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
Some tests might need specific models. Set them in the YAML:
# expensive-test.yaml
model: anthropic/claude-3-5-sonnet-20241022 # Only this test uses Claude
# cheap-test.yaml
# Uses default free model
Sessions need time to be written to disk. Try running without evaluators first:
npm run eval:sdk -- --no-evaluators
Increase the timeout:
npm run eval:sdk -- --timeout=120000 # 2 minutes
Ensure you're authenticated with the provider:
opencode auth login
# Select provider and enter API key
For free OpenCode Zen models, sign up at opencode.ai/auth.
sdk-tests/When adding new test cases:
category-name-001)