darrenhinde e43eaff317 feat(evals): add BehaviorEvaluator and improve test validation		4 months ago
..
agents	e43eaff317 feat(evals): add BehaviorEvaluator and improve test validation	4 months ago
framework	e43eaff317 feat(evals): add BehaviorEvaluator and improve test validation	4 months ago
results	f4b3d56aa2 Add evaluation framework structure and OpenCode logging documentation	4 months ago
README.md	cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure	4 months ago

OpenCode Agent Evaluation Framework

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.

Quick Start

cd evals/framework
npm install
npm run build

# Run all tests (uses free model by default)
npm run eval:sdk

# Run with specific model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# Run specific tests only
npm run eval:sdk -- --pattern="developer/*.yaml"

# Debug mode
npm run eval:sdk -- --debug

Directory Structure

evals/
├── framework/                    # Core evaluation framework
│   ├── src/
│   │   ├── sdk/                 # SDK-based test runner
│   │   │   ├── server-manager.ts
│   │   │   ├── client-manager.ts
│   │   │   ├── event-stream-handler.ts
│   │   │   ├── test-runner.ts
│   │   │   ├── test-case-schema.ts
│   │   │   ├── test-case-loader.ts
│   │   │   ├── run-sdk-tests.ts        # CLI entry point
│   │   │   ├── show-test-details.ts    # Debug tool
│   │   │   └── approval/               # Approval strategies
│   │   ├── collector/           # Session data collection
│   │   ├── evaluators/          # Rule violation detection
│   │   └── types/               # TypeScript types
│   ├── docs/
│   │   └── test-design-guide.md # Test design philosophy
│   ├── SDK_EVAL_README.md       # Comprehensive SDK guide
│   ├── README.md                # Framework documentation
│   └── package.json
│
├── agents/openagent/          # OpenAgent-specific tests
│   ├── tests/               # YAML test cases
│   │   ├── developer/           # Developer workflow tests
│   │   ├── business/            # Business analysis tests
│   │   ├── creative/            # Content creation tests
│   │   └── edge-case/           # Edge case tests
│   ├── tests/simple/            # Synthetic test data
│   ├── docs/
│   │   ├── OPENAGENT_RULES.md   # Rules from openagent.md
│   │   └── TEST_SCENARIOS.md    # Test scenario catalog
│   ├── README.md                # OpenAgent test overview
│   └── TEST_RESULTS.md          # Test results summary
│
└── results/                     # Test outputs (gitignored)

Key Features

✅ SDK-Based Execution

Uses official @opencode-ai/sdk for real agent interaction
Real-time event streaming (10+ events per test)
Actual session recording to disk

✅ Cost-Aware Testing

FREE by default - Uses opencode/grok-code-fast (OpenCode Zen)
Override per-test or via CLI: --model=provider/model
No accidental API costs during development

✅ Rule-Based Validation

4 evaluators check compliance with openagent.md rules
Tests behavior (tool usage, approvals) not style (message counts)
Model-agnostic test design

✅ Flexible Approval Handling

Auto-approve for happy path testing
Auto-deny for violation detection
Smart strategies with custom rules

Documentation

Document	Purpose	Audience
SDK_EVAL_README.md	Complete SDK testing guide	All users
docs/test-design-guide.md	Test design philosophy	Test authors
openagent/docs/OPENAGENT_RULES.md	Rules reference	Test authors
openagent/docs/TEST_SCENARIOS.md	Test scenario catalog	Test authors

Usage Examples

Run SDK Tests

# All tests with free model
npm run eval:sdk

# Specific category
npm run eval:sdk -- --pattern="developer/*.yaml"

# Custom model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# Debug single test
npx tsx src/sdk/show-test-details.ts developer/install-dependencies.yaml

Create New Tests

# Example: developer/my-test.yaml
id: dev-my-test-001
name: My Test
description: What this test does

category: developer
prompt: "Your test prompt here"

# Behavior expectations (preferred)
behavior:
  mustUseTools: [bash]
  requiresApproval: true

# Expected violations
expectedViolations:
  - rule: approval-gate
    shouldViolate: false    # Should NOT violate
    severity: error

approvalStrategy:
  type: auto-approve

timeout: 60000
tags:
  - approval-gate
  - v2-schema

See test-design-guide.md for best practices.

Framework Components

SDK Test Runner

ServerManager - Start/stop opencode server
ClientManager - Session and prompt management
EventStreamHandler - Real-time event capture
TestRunner - Test orchestration with evaluators
ApprovalStrategies - Auto-approve, deny, smart rules

Evaluators

ApprovalGateEvaluator - Checks approval before tool execution
ContextLoadingEvaluator - Verifies context files loaded first
DelegationEvaluator - Validates delegation for 4+ files
ToolUsageEvaluator - Checks bash vs specialized tools

Test Schema (v2)

behavior:              # What agent should do
  mustUseTools: []
  requiresApproval: bool
  shouldDelegate: bool

expectedViolations:    # What rules to check
  - rule: approval-gate
    shouldViolate: false

See SDK_EVAL_README.md for complete API.

Test Results

npm run eval:sdk

# Output:
======================================================================
TEST RESULTS
======================================================================

1. ✅ dev-install-deps-002 - Install Dependencies (v2)
   Duration: 10659ms
   Events: 12
   Approvals: 0

2. ❌ biz-data-analysis-001 - Business Data Analysis
   Duration: 17512ms
   Events: 18
   Errors:
     - Expected tool calls but no approvals requested

======================================================================
SUMMARY: 1/2 tests passed (1 failed)
======================================================================

Model Configuration

Free Tier (Default)

# Uses opencode/grok-code-fast (free)
npm run eval:sdk

Paid Models

# Claude 3.5 Sonnet
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# GPT-4 Turbo
npm run eval:sdk -- --model=openai/gpt-4-turbo

Per-Test Override

# In test YAML file
model: anthropic/claude-3-5-sonnet-20241022

Development

Run Framework Tests

cd evals/framework
npm test

Build Framework

npm run build

Add New Evaluator

Create in src/evaluators/
Extend BaseEvaluator
Implement evaluate() method
Register in EvaluatorRunner

Debug Tests

# Show detailed test execution
npx tsx src/sdk/show-test-details.ts path/to/test.yaml

# Check session files
ls ~/.local/share/opencode/storage/session/

CI/CD Integration

# .github/workflows/eval.yml
name: Agent Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
      - run: cd evals/framework && npm install
      - run: npm run eval:sdk -- --no-evaluators

Configuration

Default Model

Edit src/sdk/test-runner.ts:

defaultModel: config.defaultModel || 'opencode/grok-code-fast'

Evaluators

Enable/disable in TestRunner:

runEvaluators: config.runEvaluators ?? true

Achievements

✅ Full SDK integration with @opencode-ai/sdk@1.0.90
✅ Real-time event streaming (12+ events per test)
✅ 4 evaluators integrated and working
✅ YAML-based test definitions with Zod validation
✅ CLI runner with detailed reporting
✅ Free model by default (no API costs)
✅ Model-agnostic test design
✅ Both positive and negative test support

Status: Production-ready for OpenAgent evaluation

Contributing

See CONTRIBUTING.md

License

MIT