Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.
# CI/CD - Smoke test (30 seconds)
npm run test:ci:openagent
# Development - Core tests (5-8 minutes)
npm run test:core
# Release - Full suite (40-80 minutes)
npm run test:openagent
# View results dashboard
cd evals/results && ./serve.sh
๐ Complete Guide: See GUIDE.md for everything you need to know
| Tier | Tests | Time | Coverage | Use Case |
|---|---|---|---|---|
| Smoke โก | 1 | ~30s | ~10% | CI/CD, every PR |
| Core โ | 7 | 5-8 min | ~85% | Development, pre-commit |
| Full ๐ฌ | 71 | 40-80 min | 100% | Release validation |
| Agent | Tests | Status |
|---|---|---|
| OpenAgent | 71 tests | โ Production Ready |
| Opencoder | 4 tests | โ Production Ready |
evals/
โโโ framework/ # Core evaluation engine
โ โโโ src/
โ โ โโโ sdk/ # Test runner & execution
โ โ โโโ evaluators/ # Rule validators (8 types)
โ โ โโโ collector/ # Session data collection
โ โโโ package.json
โ
โโโ agents/ # Agent-specific tests
โ โโโ openagent/
โ โ โโโ config/ # Core test configuration
โ โ โโโ tests/ # 71 tests organized by category
โ โ โโโ docs/
โ โโโ opencoder/
โ โโโ tests/
โ
โโโ results/ # Test results & dashboard
โ โโโ history/ # Historical results
โ โโโ index.html # Interactive dashboard
โ โโโ latest.json
โ
โโโ GUIDE.md # Complete guide (READ THIS)
โโโ README.md # This file
โ
SDK-Based Execution - Real agent interaction with event streaming
โ
Three-Tier Testing - Smoke (30s), Core (5-8min), Full (40-80min)
โ
Sequential Execution - Rate limiting protection for free tier
โ
Cost-Aware - FREE by default (grok-code-fast)
โ
8 Evaluators - Comprehensive rule validation
โ
Interactive Dashboard - Results visualization and trends
โ
CI/CD Ready - GitHub Actions configured
Main Guide: GUIDE.md - Complete evaluation system guide
Includes:
# Run core tests (recommended for development)
npm run test:core
# Run with specific model
npm run test:core -- --model=anthropic/claude-sonnet-4-5
# Debug mode
npm run test:core -- --debug
# View results
cd evals/results && ./serve.sh
See GUIDE.md for complete usage examples and test schema
See GUIDE.md for details on:
Complete Guide: GUIDE.md
Issues: Create an issue on GitHub
Questions: Check GUIDE.md first
Last Updated: 2024-11-28
Framework Version: 0.1.0
Status: โ
Production Ready (9/10)
Rating: EXCELLENT