|
|
4 달 전 | |
|---|---|---|
| .. | ||
| agents | 4 달 전 | |
| framework | 4 달 전 | |
| results | 4 달 전 | |
| test_tmp | 4 달 전 | |
| EVAL_FRAMEWORK_GUIDE.md | 4 달 전 | |
| PHASE_5_COMPLETE.md | 4 달 전 | |
| PROJECT_COMPLETE.md | 4 달 전 | |
| README.md | 4 달 전 | |
| VALIDATION_QUICK_REF.md | 4 달 전 | |
Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.
# CI/CD - Smoke test (30 seconds)
npm run test:ci:openagent
# Development - Core tests (5-8 minutes)
npm run test:core
# Release - Full suite (40-80 minutes)
npm run test:openagent
# View results dashboard
cd evals/results && ./serve.sh
📖 Complete Guide: See GUIDE.md for everything you need to know
| Tier | Tests | Time | Coverage | Use Case |
|---|---|---|---|---|
| Smoke ⚡ | 1 | ~30s | ~10% | CI/CD, every PR |
| Core ✅ | 7 | 5-8 min | ~85% | Development, pre-commit |
| Full 🔬 | 71 | 40-80 min | 100% | Release validation |
| Agent | Tests | Status |
|---|---|---|
| OpenAgent | 71 tests | ✅ Production Ready |
| Opencoder | 4 tests | ✅ Production Ready |
evals/
├── framework/ # Core evaluation engine
│ ├── src/
│ │ ├── sdk/ # Test runner & execution
│ │ ├── evaluators/ # Rule validators (8 types)
│ │ └── collector/ # Session data collection
│ └── package.json
│
├── agents/ # Agent-specific tests
│ ├── openagent/
│ │ ├── config/ # Core test configuration
│ │ ├── tests/ # 71 tests organized by category
│ │ └── docs/
│ └── opencoder/
│ └── tests/
│
├── results/ # Test results & dashboard
│ ├── history/ # Historical results
│ ├── index.html # Interactive dashboard
│ └── latest.json
│
├── GUIDE.md # Complete guide (READ THIS)
└── README.md # This file
✅ SDK-Based Execution - Real agent interaction with event streaming
✅ Three-Tier Testing - Smoke (30s), Core (5-8min), Full (40-80min)
✅ Sequential Execution - Rate limiting protection for free tier
✅ Cost-Aware - FREE by default (grok-code-fast)
✅ 8 Evaluators - Comprehensive rule validation
✅ Interactive Dashboard - Results visualization and trends
✅ CI/CD Ready - GitHub Actions configured
Main Guide: GUIDE.md - Complete evaluation system guide
Includes:
# Run core tests (recommended for development)
npm run test:core
# Run with specific model
npm run test:core -- --model=anthropic/claude-sonnet-4-5
# Debug mode
npm run test:core -- --debug
# View results
cd evals/results && ./serve.sh
See GUIDE.md for complete usage examples and test schema
See GUIDE.md for details on:
Complete Guide: GUIDE.md
Issues: Create an issue on GitHub
Questions: Check GUIDE.md first
Last Updated: 2024-11-28
Framework Version: 0.1.0
Status: ✅ Production Ready (9/10)
Rating: EXCELLENT