darrenhinde f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
..
agents f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
framework f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
results f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
test_tmp f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
ARCHITECTURE.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
DOCUMENTATION_CLEANUP.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
GETTING_STARTED.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
HOW_TESTS_WORK.md 0d1718e551 fix(evals): use test_tmp directory for test artifacts and add cleanup 4 months ago
README.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
SCRIPTS_ORGANIZATION.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago

README.md

OpenCode Agent Evaluation Framework

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.


๐Ÿš€ Quick Start

cd evals/framework
npm install
npm run build

# Run all tests (free model by default)
npm run eval:sdk

# Run specific agent
npm run eval:sdk -- --agent=openagent
npm run eval:sdk -- --agent=opencoder

# View results dashboard
cd ../results && ./serve.sh

๐Ÿ“– New to the framework? Start with GETTING_STARTED.md


๐Ÿ“Š Current Status

Test Coverage

Agent Tests Pass Rate Status
OpenAgent 22 tests 100% โœ… Production Ready
Opencoder 4 tests 100% โœ… Production Ready

Recent Achievements (Nov 26, 2025)

โœ… Context Loading Tests - 5 comprehensive tests (3 simple, 2 complex multi-turn)
โœ… Smart Timeout System - Activity monitoring with absolute max timeout
โœ… Fixed Context Evaluator - Properly detects context files in multi-turn sessions
โœ… Batch Test Runner - Run tests in controlled batches to avoid API limits
โœ… Results Dashboard - Interactive web dashboard with filtering and charts


๐Ÿ“ Directory Structure

evals/
โ”œโ”€โ”€ framework/                    # Core evaluation framework
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ sdk/                 # SDK-based test runner
โ”‚   โ”‚   โ”œโ”€โ”€ collector/           # Session data collection
โ”‚   โ”‚   โ”œโ”€โ”€ evaluators/          # Rule violation detection
โ”‚   โ”‚   โ””โ”€โ”€ types/               # TypeScript types
โ”‚   โ”œโ”€โ”€ docs/                    # Framework documentation
โ”‚   โ”œโ”€โ”€ scripts/utils/run-tests-batch.sh       # Batch test runner
โ”‚   โ””โ”€โ”€ README.md                # Framework docs
โ”‚
โ”œโ”€โ”€ agents/                      # Agent-specific test suites
โ”‚   โ”œโ”€โ”€ openagent/               # OpenAgent tests
โ”‚   โ”‚   โ”œโ”€โ”€ tests/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ context-loading/ # Context loading tests (NEW)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ developer/       # Developer workflow tests
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ business/        # Business analysis tests
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ edge-case/       # Edge case tests
โ”‚   โ”‚   โ”œโ”€โ”€ CONTEXT_LOADING_COVERAGE.md
โ”‚   โ”‚   โ”œโ”€โ”€ IMPLEMENTATION_SUMMARY.md
โ”‚   โ”‚   โ””โ”€โ”€ README.md
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ opencoder/               # Opencoder tests
โ”‚   โ”‚   โ”œโ”€โ”€ tests/developer/
โ”‚   โ”‚   โ””โ”€โ”€ README.md
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ shared/                  # Shared test utilities
โ”‚
โ”œโ”€โ”€ results/                     # Test results & dashboard
โ”‚   โ”œโ”€โ”€ history/                 # Historical results (60-day retention)
โ”‚   โ”œโ”€โ”€ index.html               # Interactive dashboard
โ”‚   โ”œโ”€โ”€ serve.sh                 # One-command server
โ”‚   โ”œโ”€โ”€ latest.json              # Latest test results
โ”‚   โ””โ”€โ”€ README.md
โ”‚
โ”œโ”€โ”€ test_tmp/                    # Temporary test files (auto-cleaned)
โ”‚
โ”œโ”€โ”€ GETTING_STARTED.md           # Quick start guide (START HERE)
โ”œโ”€โ”€ HOW_TESTS_WORK.md            # Detailed test execution guide
โ”œโ”€โ”€ ARCHITECTURE.md              # System architecture review
โ””โ”€โ”€ README.md                    # This file

๐ŸŽฏ Key Features

โœ… SDK-Based Execution

  • Uses official @opencode-ai/sdk for real agent interaction
  • Real-time event streaming (10+ events per test)
  • Actual session recording to disk

โœ… Cost-Aware Testing

  • FREE by default - Uses opencode/grok-code-fast (OpenCode Zen)
  • Override per-test or via CLI: --model=provider/model
  • No accidental API costs during development

โœ… Smart Timeout System (NEW)

  • Activity monitoring - extends timeout while agent is working
  • Base timeout: 300s (5 min) of inactivity
  • Absolute max: 600s (10 min) hard limit
  • Prevents false timeouts on complex multi-turn tests

โœ… Context Loading Validation (NEW)

  • 5 comprehensive tests covering simple and complex scenarios
  • Verifies context files loaded before execution
  • Multi-turn conversation support
  • Proper file path extraction from SDK events

โœ… Rule-Based Validation

  • 4 evaluators check compliance with agent rules
  • Tests behavior (tool usage, approvals) not style
  • Model-agnostic test design

โœ… Results Tracking & Visualization

  • Type-safe JSON result generation
  • Interactive web dashboard with filtering
  • Pass rate trend charts
  • CSV export functionality
  • 60-day retention policy

๐Ÿ“š Documentation

Document Purpose Audience
GETTING_STARTED.md Quick start guide New users
HOW_TESTS_WORK.md Test execution details Test authors
ARCHITECTURE.md System architecture Developers
framework/SDK_EVAL_README.md Complete SDK guide All users
framework/docs/test-design-guide.md Test design philosophy Test authors
agents/openagent/CONTEXT_LOADING_COVERAGE.md Context loading tests OpenAgent users
agents/openagent/IMPLEMENTATION_SUMMARY.md Recent implementation Developers

๐Ÿ”ง Agent Differences

Feature OpenAgent Opencoder
Approval Text-based + tool permissions Tool permissions only
Workflow Analyzeโ†’Approveโ†’Executeโ†’Validate Direct execution
Context Mandatory before execution On-demand
Test Style Multi-turn (approval flow) Single prompt
Timeout 300s (smart timeout) 60s (standard)

๐ŸŽจ Usage Examples

Run Tests

# All tests with free model
npm run eval:sdk

# Specific category
npm run eval:sdk -- --pattern="context-loading/*.yaml"

# Custom model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022

# Debug single test
npm run eval:sdk -- --pattern="ctx-simple-coding-standards.yaml" --debug

# Batch execution (avoid API limits)
./scripts/utils/run-tests-batch.sh openagent 3 10

View Results

# Interactive dashboard (one command!)
cd results && ./serve.sh

# View JSON
cat results/latest.json

# Historical results
ls results/history/2025-11/

Create New Test

# Example: context-loading/my-test.yaml
id: my-test-001
name: "My Test"
description: What this test validates

category: developer
agent: openagent
model: anthropic/claude-sonnet-4-5

prompt: "Your test prompt here"

behavior:
  mustUseTools: [read]
  requiresContext: true
  minToolCalls: 1

expectedViolations:
  - rule: context-loading
    shouldViolate: false
    severity: error

approvalStrategy:
  type: auto-approve

timeout: 60000

tags:
  - context-loading

See GETTING_STARTED.md for more examples.


๐Ÿ—๏ธ Framework Components

SDK Test Runner

  • ServerManager - Start/stop opencode server
  • ClientManager - Session and prompt management
  • EventStreamHandler - Real-time event capture
  • TestRunner - Test orchestration with evaluators
  • ApprovalStrategies - Auto-approve, deny, smart rules

Evaluators

  • ApprovalGateEvaluator - Checks approval before tool execution
  • ContextLoadingEvaluator - Verifies context files loaded first (FIXED)
  • DelegationEvaluator - Validates delegation for 4+ files
  • ToolUsageEvaluator - Checks bash vs specialized tools
  • BehaviorEvaluator - Validates test-specific behavior expectations

Results System

  • ResultSaver - Type-safe JSON generation
  • Dashboard - Interactive web visualization
  • Helper Scripts - Easy deployment (serve.sh)

๐Ÿ”ฌ Test Schema (v2)

# Behavior expectations (what agent should do)
behavior:
  mustUseTools: [read, write]      # Required tools
  mustUseAnyOf: [[bash], [list]]   # Alternative tools
  requiresApproval: true            # Must ask for approval
  requiresContext: true             # Must load context
  minToolCalls: 2                   # Minimum tool calls

# Expected violations (what rules to check)
expectedViolations:
  - rule: approval-gate
    shouldViolate: false            # Should NOT violate
    severity: error
  
  - rule: context-loading
    shouldViolate: false
    severity: error

๐Ÿ“ˆ Recent Improvements

November 26, 2025

  1. Context Loading Tests (5 tests, 100% passing)

    • 3 simple tests (single prompt, read-only)
    • 2 complex tests (multi-turn with file creation)
    • Comprehensive coverage of context loading scenarios
  2. Smart Timeout System

    • Activity monitoring prevents false timeouts
    • Base timeout: 300s inactivity
    • Absolute max: 600s hard limit
    • Handles complex multi-turn tests gracefully
  3. Fixed Context Loading Evaluator

    • Corrected file path extraction (tool.data.state.input.filePath)
    • Multi-turn session support
    • Checks context for ALL executions, not just first
  4. Batch Test Runner

    • run-tests-batch.sh script
    • Configurable batch size and delays
    • Prevents API rate limits
  5. Results Dashboard

    • Interactive web UI with filtering
    • Pass rate trend charts
    • CSV export
    • One-command deployment

๐ŸŽฏ Achievements

โœ… Full SDK integration with @opencode-ai/sdk@1.0.90
โœ… Real-time event streaming (12+ events per test)
โœ… 5 evaluators integrated and working
โœ… YAML-based test definitions with Zod validation
โœ… CLI runner with detailed reporting
โœ… Free model by default (no API costs)
โœ… Model-agnostic test design
โœ… Both positive and negative test support
โœ… Smart timeout with activity monitoring
โœ… Context loading validation (100% coverage)
โœ… Results tracking and visualization
โœ… Batch execution support

Status: โœ… Production-ready for OpenAgent & Opencoder evaluation


๐Ÿค Contributing

See ../docs/contributing/CONTRIBUTING.md


๐Ÿ“„ License

MIT


๐Ÿ†˜ Support


Last Updated: 2025-11-26
Framework Version: 0.1.0
Test Coverage: 26 tests (22 OpenAgent, 4 Opencoder)
Pass Rate: 100%