# OpenCode Agent Evaluation Framework Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection. --- ## 🚀 Quick Start ```bash # CI/CD - Smoke test (30 seconds) npm run test:ci:openagent # Development - Core tests (5-8 minutes) npm run test:core # Release - Full suite (40-80 minutes) npm run test:openagent # View results dashboard cd evals/results && ./serve.sh ``` **📖 Complete Guide**: See [GUIDE.md](GUIDE.md) for everything you need to know --- ## 📊 Testing Strategy ### Three-Tier Approach | Tier | Tests | Time | Coverage | Use Case | |------|-------|------|----------|----------| | **Smoke** ⚡ | 1 | ~30s | ~10% | CI/CD, every PR | | **Core** ✅ | 7 | 5-8 min | ~85% | Development, pre-commit | | **Full** 🔬 | 71 | 40-80 min | 100% | Release validation | ### Current Status | Agent | Tests | Status | |-------|-------|--------| | **OpenAgent** | 71 tests | ✅ Production Ready | | **Opencoder** | 4 tests | ✅ Production Ready | --- ## 📁 Directory Structure ``` evals/ ├── framework/ # Core evaluation engine │ ├── src/ │ │ ├── sdk/ # Test runner & execution │ │ ├── evaluators/ # Rule validators (8 types) │ │ └── collector/ # Session data collection │ └── package.json │ ├── agents/ # Agent-specific tests │ ├── openagent/ │ │ ├── config/ # Core test configuration │ │ ├── tests/ # 71 tests organized by category │ │ └── docs/ │ └── opencoder/ │ └── tests/ │ ├── results/ # Test results & dashboard │ ├── history/ # Historical results │ ├── index.html # Interactive dashboard │ └── latest.json │ ├── GUIDE.md # Complete guide (READ THIS) └── README.md # This file ``` --- ## 🎯 Key Features ✅ **SDK-Based Execution** - Real agent interaction with event streaming ✅ **Three-Tier Testing** - Smoke (30s), Core (5-8min), Full (40-80min) ✅ **Sequential Execution** - Rate limiting protection for free tier ✅ **Cost-Aware** - FREE by default (grok-code-fast) ✅ **8 Evaluators** - Comprehensive rule validation ✅ **Interactive Dashboard** - Results visualization and trends ✅ **CI/CD Ready** - GitHub Actions configured --- ## 📚 Documentation **Main Guide**: [GUIDE.md](GUIDE.md) - Complete evaluation system guide **Includes**: - Quick start and installation - Three-tier testing strategy (smoke, core, full) - Architecture and components - Test schema and examples - Core tests detailed breakdown - Results and dashboard - CI/CD integration - Troubleshooting - System review and recommendations --- ## 🎨 Usage Examples ```bash # Run core tests (recommended for development) npm run test:core # Run with specific model npm run test:core -- --model=anthropic/claude-sonnet-4-5 # Debug mode npm run test:core -- --debug # View results cd evals/results && ./serve.sh ``` **See [GUIDE.md](GUIDE.md) for complete usage examples and test schema** --- ## 🤝 Contributing See [GUIDE.md](GUIDE.md) for details on: - Adding new tests - Creating evaluators - Modifying core tests --- ## 🆘 Support **Complete Guide**: [GUIDE.md](GUIDE.md) **Issues**: Create an issue on GitHub **Questions**: Check GUIDE.md first --- **Last Updated**: 2024-11-28 **Framework Version**: 0.1.0 **Status**: ✅ Production Ready (9/10) **Rating**: EXCELLENT