|
|
4 months ago | |
|---|---|---|
| .. | ||
| agents | 4 months ago | |
| framework | 4 months ago | |
| results | 4 months ago | |
| test_tmp | 4 months ago | |
| ARCHITECTURE.md | 4 months ago | |
| GETTING_STARTED.md | 4 months ago | |
| HOW_TESTS_WORK.md | 4 months ago | |
| README.md | 4 months ago | |
Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.
cd evals/framework
npm install
npm run build
# Run all tests (free model by default)
npm run eval:sdk
# Run specific agent
npm run eval:sdk -- --agent=openagent
npm run eval:sdk -- --agent=opencoder
# View results dashboard
cd ../results && ./serve.sh
๐ New to the framework? Start with GETTING_STARTED.md
| Agent | Tests | Pass Rate | Status |
|---|---|---|---|
| OpenAgent | 22 tests | 100% | โ Production Ready |
| Opencoder | 4 tests | 100% | โ Production Ready |
โ
Context Loading Tests - 5 comprehensive tests (3 simple, 2 complex multi-turn)
โ
Smart Timeout System - Activity monitoring with absolute max timeout
โ
Fixed Context Evaluator - Properly detects context files in multi-turn sessions
โ
Batch Test Runner - Run tests in controlled batches to avoid API limits
โ
Results Dashboard - Interactive web dashboard with filtering and charts
evals/
โโโ framework/ # Core evaluation framework
โ โโโ src/
โ โ โโโ sdk/ # SDK-based test runner
โ โ โโโ collector/ # Session data collection
โ โ โโโ evaluators/ # Rule violation detection
โ โ โโโ types/ # TypeScript types
โ โโโ docs/ # Framework documentation
โ โโโ scripts/utils/run-tests-batch.sh # Batch test runner
โ โโโ README.md # Framework docs
โ
โโโ agents/ # Agent-specific test suites
โ โโโ openagent/ # OpenAgent tests
โ โ โโโ tests/
โ โ โ โโโ context-loading/ # Context loading tests (NEW)
โ โ โ โโโ developer/ # Developer workflow tests
โ โ โ โโโ business/ # Business analysis tests
โ โ โ โโโ edge-case/ # Edge case tests
โ โ โโโ CONTEXT_LOADING_COVERAGE.md
โ โ โโโ IMPLEMENTATION_SUMMARY.md
โ โ โโโ README.md
โ โ
โ โโโ opencoder/ # Opencoder tests
โ โ โโโ tests/developer/
โ โ โโโ README.md
โ โ
โ โโโ shared/ # Shared test utilities
โ
โโโ results/ # Test results & dashboard
โ โโโ history/ # Historical results (60-day retention)
โ โโโ index.html # Interactive dashboard
โ โโโ serve.sh # One-command server
โ โโโ latest.json # Latest test results
โ โโโ README.md
โ
โโโ test_tmp/ # Temporary test files (auto-cleaned)
โ
โโโ GETTING_STARTED.md # Quick start guide (START HERE)
โโโ HOW_TESTS_WORK.md # Detailed test execution guide
โโโ ARCHITECTURE.md # System architecture review
โโโ README.md # This file
@opencode-ai/sdk for real agent interactionopencode/grok-code-fast (OpenCode Zen)--model=provider/model| Document | Purpose | Audience |
|---|---|---|
| GETTING_STARTED.md | Quick start guide | New users |
| HOW_TESTS_WORK.md | Test execution details | Test authors |
| ARCHITECTURE.md | System architecture | Developers |
| framework/SDK_EVAL_README.md | Complete SDK guide | All users |
| framework/docs/test-design-guide.md | Test design philosophy | Test authors |
| agents/openagent/CONTEXT_LOADING_COVERAGE.md | Context loading tests | OpenAgent users |
| agents/openagent/IMPLEMENTATION_SUMMARY.md | Recent implementation | Developers |
| Feature | OpenAgent | Opencoder |
|---|---|---|
| Approval | Text-based + tool permissions | Tool permissions only |
| Workflow | AnalyzeโApproveโExecuteโValidate | Direct execution |
| Context | Mandatory before execution | On-demand |
| Test Style | Multi-turn (approval flow) | Single prompt |
| Timeout | 300s (smart timeout) | 60s (standard) |
# All tests with free model
npm run eval:sdk
# Specific category
npm run eval:sdk -- --pattern="context-loading/*.yaml"
# Custom model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
# Debug single test
npm run eval:sdk -- --pattern="ctx-simple-coding-standards.yaml" --debug
# Batch execution (avoid API limits)
./scripts/utils/run-tests-batch.sh openagent 3 10
# Interactive dashboard (one command!)
cd results && ./serve.sh
# View JSON
cat results/latest.json
# Historical results
ls results/history/2025-11/
# Example: context-loading/my-test.yaml
id: my-test-001
name: "My Test"
description: What this test validates
category: developer
agent: openagent
model: anthropic/claude-sonnet-4-5
prompt: "Your test prompt here"
behavior:
mustUseTools: [read]
requiresContext: true
minToolCalls: 1
expectedViolations:
- rule: context-loading
shouldViolate: false
severity: error
approvalStrategy:
type: auto-approve
timeout: 60000
tags:
- context-loading
See GETTING_STARTED.md for more examples.
serve.sh)# Behavior expectations (what agent should do)
behavior:
mustUseTools: [read, write] # Required tools
mustUseAnyOf: [[bash], [list]] # Alternative tools
requiresApproval: true # Must ask for approval
requiresContext: true # Must load context
minToolCalls: 2 # Minimum tool calls
# Expected violations (what rules to check)
expectedViolations:
- rule: approval-gate
shouldViolate: false # Should NOT violate
severity: error
- rule: context-loading
shouldViolate: false
severity: error
Context Loading Tests (5 tests, 100% passing)
Smart Timeout System
Fixed Context Loading Evaluator
tool.data.state.input.filePath)Batch Test Runner
run-tests-batch.sh scriptResults Dashboard
โ
Full SDK integration with @opencode-ai/sdk@1.0.90
โ
Real-time event streaming (12+ events per test)
โ
5 evaluators integrated and working
โ
YAML-based test definitions with Zod validation
โ
CLI runner with detailed reporting
โ
Free model by default (no API costs)
โ
Model-agnostic test design
โ
Both positive and negative test support
โ
Smart timeout with activity monitoring
โ
Context loading validation (100% coverage)
โ
Results tracking and visualization
โ
Batch execution support
Status: โ Production-ready for OpenAgent & Opencoder evaluation
See ../docs/contributing/CONTRIBUTING.md
MIT
Last Updated: 2025-11-26
Framework Version: 0.1.0
Test Coverage: 26 tests (22 OpenAgent, 4 Opencoder)
Pass Rate: 100%