SDK_EVAL_README.md 6.7 KB

OpenCode SDK Evaluation Framework

End-to-end testing framework for OpenCode agents using real SDK execution.

Quick Start

# Install dependencies
cd evals/framework
npm install

# Run all tests with free model (no API costs)
npm run eval:sdk

# Run with debug output
npm run eval:sdk -- --debug

# Run without evaluators (faster)
npm run eval:sdk -- --no-evaluators

Model Configuration

Using Free Models (Recommended for Development)

The framework defaults to OpenCode Zen's free tier to avoid API costs during development:

# Default: Uses opencode/grok-code-fast (free)
npm run eval:sdk

Using Paid Models

Override the model for production evaluation:

# Use Claude 3.5 Sonnet
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# Use GPT-4 Turbo
npm run eval:sdk -- --model=openai/gpt-4-turbo

Per-Test Model Override

Test cases can specify their own model in the YAML file:

id: my-test-001
name: My Test
# ... other fields ...
model: anthropic/claude-3-5-sonnet-20241022  # Override default

Available Models

Free Tier (OpenCode Zen)

  • opencode/grok-code-fast - FREE - Grok Code Fast model
    • Cost: $0.00 input, $0.00 output
    • Good for: Development, testing, rapid iteration
    • Default model

Paid Tiers

  • anthropic/claude-3-5-sonnet-20241022 - Claude 3.5 Sonnet

    • Best for: Complex reasoning, code generation
  • openai/gpt-4-turbo - GPT-4 Turbo

    • Best for: General purpose tasks

See OpenCode Zen docs for full model list.

CLI Options

npm run eval:sdk -- [options]

Options:
  --debug                    Enable verbose logging
  --no-evaluators           Skip running evaluators (faster testing)
  --model=<provider/model>  Override default model
  --pattern=<glob>          Run specific test files
  --timeout=<ms>            Test timeout (default: 60000)

Examples:
  npm run eval:sdk -- --debug
  npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
  npm run eval:sdk -- --pattern="developer/*.yaml"
  npm run eval:sdk -- --pattern="edge-case/*.yaml" --no-evaluators

Test Case Structure

Create test cases in evals/opencode/openagent/sdk-tests/:

id: my-test-001
name: My Test Case
description: What this test does
category: developer  # or business, creative, edge-case

prompt: |
  Your test prompt here

# Optional: Override default model for this test
model: anthropic/claude-3-5-sonnet-20241022

approvalStrategy:
  type: auto-approve  # or auto-deny, smart

expected:
  pass: true
  minMessages: 2
  toolCalls:
    - bash
    - write
  
timeout: 60000
tags:
  - approval-gate
  - file-creation

Approval Strategies

Auto-Approve

Automatically approves all permission requests:

approvalStrategy:
  type: auto-approve

Auto-Deny

Automatically denies all permission requests (for testing approval gates):

approvalStrategy:
  type: auto-deny

Smart Strategy

Rule-based approval with fine-grained control:

approvalStrategy:
  type: smart
  config:
    allowedTools:
      - bash
      - read
    deniedTools:
      - write
      - edit
    maxApprovals: 5
    defaultDecision: true

Evaluators

The framework runs 4 evaluators on recorded sessions:

  1. ApprovalGateEvaluator - Ensures approval before tool execution
  2. ContextLoadingEvaluator - Verifies context files loaded first
  3. DelegationEvaluator - Validates delegation for 4+ files
  4. ToolUsageEvaluator - Checks for bash anti-patterns

Disable evaluators for faster iteration:

npm run eval:sdk -- --no-evaluators

Test Results

After running tests, you'll see a summary:

======================================================================
TEST RESULTS
======================================================================

1. ✅ dev-install-deps-001 - Install Dependencies
   Duration: 17148ms
   Events: 36
   Approvals: 0

2. ❌ biz-data-analysis-001 - Business Data Analysis
   Duration: 17512ms
   Events: 18
   Approvals: 0
   Errors:
     - Expected at least 2 messages, got 1

======================================================================
SUMMARY: 1/2 tests passed (1 failed)
======================================================================

Directory Structure

evals/framework/
├── src/sdk/
│   ├── server-manager.ts       # Start/stop opencode server
│   ├── client-manager.ts       # SDK wrapper
│   ├── event-stream-handler.ts # Event streaming
│   ├── test-runner.ts          # Test orchestration
│   ├── test-case-schema.ts     # Zod schema
│   ├── test-case-loader.ts     # YAML loader
│   ├── run-sdk-tests.ts        # CLI entry point
│   └── approval/
│       ├── auto-approve-strategy.ts
│       ├── auto-deny-strategy.ts
│       └── smart-approval-strategy.ts
│
evals/opencode/openagent/sdk-tests/
├── developer/
│   ├── install-dependencies.yaml
│   └── create-component.yaml
├── business/
│   └── data-analysis.yaml
├── creative/
└── edge-case/
    └── just-do-it.yaml

Cost Management

Development (Free)

Use the default free model for all development and testing:

# No costs - uses opencode/grok-code-fast
npm run eval:sdk

Production (Paid)

Switch to paid models only when running production evaluations:

# Use Claude for final evaluation
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

Per-Test Basis

Some tests might need specific models. Set them in the YAML:

# expensive-test.yaml
model: anthropic/claude-3-5-sonnet-20241022  # Only this test uses Claude

# cheap-test.yaml
# Uses default free model

Troubleshooting

"Session not found" error with evaluators

Sessions need time to be written to disk. Try running without evaluators first:

npm run eval:sdk -- --no-evaluators

Tests timing out

Increase the timeout:

npm run eval:sdk -- --timeout=120000  # 2 minutes

Model authentication errors

Ensure you're authenticated with the provider:

opencode auth login
# Select provider and enter API key

For free OpenCode Zen models, sign up at opencode.ai/auth.

Next Steps

  1. Add More Tests - Create test cases in sdk-tests/
  2. CI Integration - Add to GitHub Actions workflow
  3. Custom Evaluators - Extend the evaluation framework
  4. HTML Reports - Generate visual test reports

Contributing

When adding new test cases:

  1. Place in appropriate category directory
  2. Use descriptive IDs (category-name-001)
  3. Add clear descriptions and expected results
  4. Test with free model first
  5. Document any model-specific requirements