# OpenCode Agent Evaluation Framework

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.

## Quick Start

```bash
cd evals/framework
npm install
npm run build

# Run all tests (uses free model by default)
npm run eval:sdk

# Run with specific model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# Run specific tests only
npm run eval:sdk -- --pattern="developer/*.yaml"

# Debug mode
npm run eval:sdk -- --debug
```

## Directory Structure

```
evals/
├── framework/                    # Core evaluation framework
│   ├── src/
│   │   ├── sdk/                 # SDK-based test runner
│   │   │   ├── server-manager.ts
│   │   │   ├── client-manager.ts
│   │   │   ├── event-stream-handler.ts
│   │   │   ├── test-runner.ts
│   │   │   ├── test-case-schema.ts
│   │   │   ├── test-case-loader.ts
│   │   │   ├── run-sdk-tests.ts        # CLI entry point
│   │   │   ├── show-test-details.ts    # Debug tool
│   │   │   └── approval/               # Approval strategies
│   │   ├── collector/           # Session data collection
│   │   ├── evaluators/          # Rule violation detection
│   │   └── types/               # TypeScript types
│   ├── docs/
│   │   └── test-design-guide.md # Test design philosophy
│   ├── SDK_EVAL_README.md       # Comprehensive SDK guide
│   ├── README.md                # Framework documentation
│   └── package.json
│
├── agents/openagent/          # OpenAgent-specific tests
│   ├── tests/               # YAML test cases
│   │   ├── developer/           # Developer workflow tests
│   │   ├── business/            # Business analysis tests
│   │   ├── creative/            # Content creation tests
│   │   └── edge-case/           # Edge case tests
│   ├── tests/simple/            # Synthetic test data
│   ├── docs/
│   │   ├── OPENAGENT_RULES.md   # Rules from openagent.md
│   │   └── TEST_SCENARIOS.md    # Test scenario catalog
│   ├── README.md                # OpenAgent test overview
│   └── TEST_RESULTS.md          # Test results summary
│
└── results/                     # Test outputs (gitignored)
```

## Key Features

### ✅ SDK-Based Execution
- Uses official `@opencode-ai/sdk` for real agent interaction
- Real-time event streaming (10+ events per test)
- Actual session recording to disk

### ✅ Cost-Aware Testing
- **FREE by default** - Uses `opencode/grok-code-fast` (OpenCode Zen)
- Override per-test or via CLI: `--model=provider/model`
- No accidental API costs during development

### ✅ Rule-Based Validation
- 4 evaluators check compliance with openagent.md rules
- Tests behavior (tool usage, approvals) not style (message counts)
- Model-agnostic test design

### ✅ Flexible Approval Handling
- Auto-approve for happy path testing
- Auto-deny for violation detection
- Smart strategies with custom rules

## Documentation

| Document | Purpose | Audience |
|----------|---------|----------|
| **[SDK_EVAL_README.md](framework/SDK_EVAL_README.md)** | Complete SDK testing guide | All users |
| **[docs/test-design-guide.md](framework/docs/test-design-guide.md)** | Test design philosophy | Test authors |
| **[openagent/docs/OPENAGENT_RULES.md](agents/openagent/docs/OPENAGENT_RULES.md)** | Rules reference | Test authors |
| **[openagent/docs/TEST_SCENARIOS.md](agents/openagent/docs/TEST_SCENARIOS.md)** | Test scenario catalog | Test authors |

## Usage Examples

### Run SDK Tests

```bash
# All tests with free model
npm run eval:sdk

# Specific category
npm run eval:sdk -- --pattern="developer/*.yaml"

# Custom model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# Debug single test
npx tsx src/sdk/show-test-details.ts developer/install-dependencies.yaml
```

### Create New Tests

```yaml
# Example: developer/my-test.yaml
id: dev-my-test-001
name: My Test
description: What this test does

category: developer
prompt: "Your test prompt here"

# Behavior expectations (preferred)
behavior:
  mustUseTools: [bash]
  requiresApproval: true

# Expected violations
expectedViolations:
  - rule: approval-gate
    shouldViolate: false    # Should NOT violate
    severity: error

approvalStrategy:
  type: auto-approve

timeout: 60000
tags:
  - approval-gate
  - v2-schema
```

See [test-design-guide.md](framework/docs/test-design-guide.md) for best practices.

## Framework Components

### SDK Test Runner
- **ServerManager** - Start/stop opencode server
- **ClientManager** - Session and prompt management
- **EventStreamHandler** - Real-time event capture
- **TestRunner** - Test orchestration with evaluators
- **ApprovalStrategies** - Auto-approve, deny, smart rules

### Evaluators
- **ApprovalGateEvaluator** - Checks approval before tool execution
- **ContextLoadingEvaluator** - Verifies context files loaded first
- **DelegationEvaluator** - Validates delegation for 4+ files
- **ToolUsageEvaluator** - Checks bash vs specialized tools

### Test Schema (v2)
```yaml
behavior:              # What agent should do
  mustUseTools: []
  requiresApproval: bool
  shouldDelegate: bool

expectedViolations:    # What rules to check
  - rule: approval-gate
    shouldViolate: false
```

See [SDK_EVAL_README.md](framework/SDK_EVAL_README.md) for complete API.

## Test Results

```bash
npm run eval:sdk

# Output:
======================================================================
TEST RESULTS
======================================================================

1. ✅ dev-install-deps-002 - Install Dependencies (v2)
   Duration: 10659ms
   Events: 12
   Approvals: 0

2. ❌ biz-data-analysis-001 - Business Data Analysis
   Duration: 17512ms
   Events: 18
   Errors:
     - Expected tool calls but no approvals requested

======================================================================
SUMMARY: 1/2 tests passed (1 failed)
======================================================================
```

## Model Configuration

### Free Tier (Default)
```bash
# Uses opencode/grok-code-fast (free)
npm run eval:sdk
```

### Paid Models
```bash
# Claude 3.5 Sonnet
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# GPT-4 Turbo
npm run eval:sdk -- --model=openai/gpt-4-turbo
```

### Per-Test Override
```yaml
# In test YAML file
model: anthropic/claude-3-5-sonnet-20241022
```

## Development

### Run Framework Tests
```bash
cd evals/framework
npm test
```

### Build Framework
```bash
npm run build
```

### Add New Evaluator
1. Create in `src/evaluators/`
2. Extend `BaseEvaluator`
3. Implement `evaluate()` method
4. Register in `EvaluatorRunner`

### Debug Tests
```bash
# Show detailed test execution
npx tsx src/sdk/show-test-details.ts path/to/test.yaml

# Check session files
ls ~/.local/share/opencode/storage/session/
```

## CI/CD Integration

```yaml
# .github/workflows/eval.yml
name: Agent Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
      - run: cd evals/framework && npm install
      - run: npm run eval:sdk -- --no-evaluators
```

## Configuration

### Default Model
Edit `src/sdk/test-runner.ts`:
```typescript
defaultModel: config.defaultModel || 'opencode/grok-code-fast'
```

### Evaluators
Enable/disable in `TestRunner`:
```typescript
runEvaluators: config.runEvaluators ?? true
```

## Achievements

✅ Full SDK integration with `@opencode-ai/sdk@1.0.90`  
✅ Real-time event streaming (12+ events per test)  
✅ 4 evaluators integrated and working  
✅ YAML-based test definitions with Zod validation  
✅ CLI runner with detailed reporting  
✅ Free model by default (no API costs)  
✅ Model-agnostic test design  
✅ Both positive and negative test support  

**Status:** Production-ready for OpenAgent evaluation

## Contributing

See [CONTRIBUTING.md](../docs/contributing/CONTRIBUTING.md)

## License

MIT