darrenhinde 9949220e03 fix(evals): enable SDK mode for server when agent is specified		4 months ago
..
agents	9675e30ca4 chore: cleanup documentation - remove temporary and redundant files	4 months ago
framework	9949220e03 fix(evals): enable SDK mode for server when agent is specified	4 months ago
results	9675e30ca4 chore: cleanup documentation - remove temporary and redundant files	4 months ago
test_tmp	9675e30ca4 chore: cleanup documentation - remove temporary and redundant files	4 months ago
ARCHITECTURE.md	f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements	4 months ago
GETTING_STARTED.md	f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements	4 months ago
HOW_TESTS_WORK.md	0d1718e551 fix(evals): use test_tmp directory for test artifacts and add cleanup	4 months ago
README.md	f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements	4 months ago

OpenCode Agent Evaluation Framework

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.

🚀 Quick Start

cd evals/framework
npm install
npm run build

# Run all tests (free model by default)
npm run eval:sdk

# Run specific agent
npm run eval:sdk -- --agent=openagent
npm run eval:sdk -- --agent=opencoder

# View results dashboard
cd ../results && ./serve.sh

📖 New to the framework? Start with GETTING_STARTED.md

📊 Current Status

Test Coverage

Agent	Tests	Pass Rate	Status
OpenAgent	22 tests	100%	✅ Production Ready
Opencoder	4 tests	100%	✅ Production Ready

Recent Achievements (Nov 26, 2025)

✅ Context Loading Tests - 5 comprehensive tests (3 simple, 2 complex multi-turn)
✅ Smart Timeout System - Activity monitoring with absolute max timeout
✅ Fixed Context Evaluator - Properly detects context files in multi-turn sessions
✅ Batch Test Runner - Run tests in controlled batches to avoid API limits
✅ Results Dashboard - Interactive web dashboard with filtering and charts

📁 Directory Structure

evals/
├── framework/                    # Core evaluation framework
│   ├── src/
│   │   ├── sdk/                 # SDK-based test runner
│   │   ├── collector/           # Session data collection
│   │   ├── evaluators/          # Rule violation detection
│   │   └── types/               # TypeScript types
│   ├── docs/                    # Framework documentation
│   ├── scripts/utils/run-tests-batch.sh       # Batch test runner
│   └── README.md                # Framework docs
│
├── agents/                      # Agent-specific test suites
│   ├── openagent/               # OpenAgent tests
│   │   ├── tests/
│   │   │   ├── context-loading/ # Context loading tests (NEW)
│   │   │   ├── developer/       # Developer workflow tests
│   │   │   ├── business/        # Business analysis tests
│   │   │   └── edge-case/       # Edge case tests
│   │   ├── CONTEXT_LOADING_COVERAGE.md
│   │   ├── IMPLEMENTATION_SUMMARY.md
│   │   └── README.md
│   │
│   ├── opencoder/               # Opencoder tests
│   │   ├── tests/developer/
│   │   └── README.md
│   │
│   └── shared/                  # Shared test utilities
│
├── results/                     # Test results & dashboard
│   ├── history/                 # Historical results (60-day retention)
│   ├── index.html               # Interactive dashboard
│   ├── serve.sh                 # One-command server
│   ├── latest.json              # Latest test results
│   └── README.md
│
├── test_tmp/                    # Temporary test files (auto-cleaned)
│
├── GETTING_STARTED.md           # Quick start guide (START HERE)
├── HOW_TESTS_WORK.md            # Detailed test execution guide
├── ARCHITECTURE.md              # System architecture review
└── README.md                    # This file

🎯 Key Features

✅ SDK-Based Execution

Uses official @opencode-ai/sdk for real agent interaction
Real-time event streaming (10+ events per test)
Actual session recording to disk

✅ Cost-Aware Testing

FREE by default - Uses opencode/grok-code-fast (OpenCode Zen)
Override per-test or via CLI: --model=provider/model
No accidental API costs during development

✅ Smart Timeout System (NEW)

Activity monitoring - extends timeout while agent is working
Base timeout: 300s (5 min) of inactivity
Absolute max: 600s (10 min) hard limit
Prevents false timeouts on complex multi-turn tests

✅ Context Loading Validation (NEW)

5 comprehensive tests covering simple and complex scenarios
Verifies context files loaded before execution
Multi-turn conversation support
Proper file path extraction from SDK events

✅ Rule-Based Validation

4 evaluators check compliance with agent rules
Tests behavior (tool usage, approvals) not style
Model-agnostic test design

✅ Results Tracking & Visualization

Type-safe JSON result generation
Interactive web dashboard with filtering
Pass rate trend charts
CSV export functionality
60-day retention policy

📚 Documentation

Document	Purpose	Audience
GETTING_STARTED.md	Quick start guide	New users
HOW_TESTS_WORK.md	Test execution details	Test authors
ARCHITECTURE.md	System architecture	Developers
framework/SDK_EVAL_README.md	Complete SDK guide	All users
framework/docs/test-design-guide.md	Test design philosophy	Test authors
agents/openagent/CONTEXT_LOADING_COVERAGE.md	Context loading tests	OpenAgent users
agents/openagent/IMPLEMENTATION_SUMMARY.md	Recent implementation	Developers

🔧 Agent Differences

Feature	OpenAgent	Opencoder
Approval	Text-based + tool permissions	Tool permissions only
Workflow	Analyze→Approve→Execute→Validate	Direct execution
Context	Mandatory before execution	On-demand
Test Style	Multi-turn (approval flow)	Single prompt
Timeout	300s (smart timeout)	60s (standard)

🎨 Usage Examples

Run Tests

# All tests with free model
npm run eval:sdk

# Specific category
npm run eval:sdk -- --pattern="context-loading/*.yaml"

# Custom model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# Debug single test
npm run eval:sdk -- --pattern="ctx-simple-coding-standards.yaml" --debug

# Batch execution (avoid API limits)
./scripts/utils/run-tests-batch.sh openagent 3 10

View Results

# Interactive dashboard (one command!)
cd results && ./serve.sh

# View JSON
cat results/latest.json

# Historical results
ls results/history/2025-11/

Create New Test

# Example: context-loading/my-test.yaml
id: my-test-001
name: "My Test"
description: What this test validates

category: developer
agent: openagent
model: anthropic/claude-sonnet-4-5

prompt: "Your test prompt here"

behavior:
  mustUseTools: [read]
  requiresContext: true
  minToolCalls: 1

expectedViolations:
  - rule: context-loading
    shouldViolate: false
    severity: error

approvalStrategy:
  type: auto-approve

timeout: 60000

tags:
  - context-loading

See GETTING_STARTED.md for more examples.

🏗️ Framework Components

SDK Test Runner

ServerManager - Start/stop opencode server
ClientManager - Session and prompt management
EventStreamHandler - Real-time event capture
TestRunner - Test orchestration with evaluators
ApprovalStrategies - Auto-approve, deny, smart rules

Evaluators

ApprovalGateEvaluator - Checks approval before tool execution
ContextLoadingEvaluator - Verifies context files loaded first (FIXED)
DelegationEvaluator - Validates delegation for 4+ files
ToolUsageEvaluator - Checks bash vs specialized tools
BehaviorEvaluator - Validates test-specific behavior expectations

Results System

ResultSaver - Type-safe JSON generation
Dashboard - Interactive web visualization
Helper Scripts - Easy deployment (serve.sh)

🔬 Test Schema (v2)

# Behavior expectations (what agent should do)
behavior:
  mustUseTools: [read, write]      # Required tools
  mustUseAnyOf: [[bash], [list]]   # Alternative tools
  requiresApproval: true            # Must ask for approval
  requiresContext: true             # Must load context
  minToolCalls: 2                   # Minimum tool calls

# Expected violations (what rules to check)
expectedViolations:
  - rule: approval-gate
    shouldViolate: false            # Should NOT violate
    severity: error
  
  - rule: context-loading
    shouldViolate: false
    severity: error

📈 Recent Improvements

November 26, 2025

Context Loading Tests (5 tests, 100% passing)
- 3 simple tests (single prompt, read-only)
- 2 complex tests (multi-turn with file creation)
- Comprehensive coverage of context loading scenarios
Smart Timeout System
- Activity monitoring prevents false timeouts
- Base timeout: 300s inactivity
- Absolute max: 600s hard limit
- Handles complex multi-turn tests gracefully
Fixed Context Loading Evaluator
- Corrected file path extraction (tool.data.state.input.filePath)
- Multi-turn session support
- Checks context for ALL executions, not just first
Batch Test Runner
- run-tests-batch.sh script
- Configurable batch size and delays
- Prevents API rate limits
Results Dashboard
- Interactive web UI with filtering
- Pass rate trend charts
- CSV export
- One-command deployment

🎯 Achievements

✅ Full SDK integration with @opencode-ai/sdk@1.0.90
✅ Real-time event streaming (12+ events per test)
✅ 5 evaluators integrated and working
✅ YAML-based test definitions with Zod validation
✅ CLI runner with detailed reporting
✅ Free model by default (no API costs)
✅ Model-agnostic test design
✅ Both positive and negative test support
✅ Smart timeout with activity monitoring
✅ Context loading validation (100% coverage)
✅ Results tracking and visualization
✅ Batch execution support

Status: ✅ Production-ready for OpenAgent & Opencoder evaluation

🤝 Contributing

See ../docs/contributing/CONTRIBUTING.md

📄 License

MIT

🆘 Support

Getting Started: GETTING_STARTED.md
How Tests Work: HOW_TESTS_WORK.md
Architecture: ARCHITECTURE.md
Issues: Check documentation or create an issue

Last Updated: 2025-11-26
Framework Version: 0.1.0
Test Coverage: 26 tests (22 OpenAgent, 4 Opencoder)
Pass Rate: 100%