Darren Hinde 784ffadf92 chore: verify and stabilize main branch (#42)		4 달 전
..
agents	784ffadf92 chore: verify and stabilize main branch (#42)	4 달 전
framework	784ffadf92 chore: verify and stabilize main branch (#42)	4 달 전
results	784ffadf92 chore: verify and stabilize main branch (#42)	4 달 전
test_tmp	79110ed3fb Add Production-Ready Eval Framework for OpenAgent (#25)	4 달 전
EVAL_FRAMEWORK_GUIDE.md	79110ed3fb Add Production-Ready Eval Framework for OpenAgent (#25)	4 달 전
PHASE_5_COMPLETE.md	fc29fa3dc4 feat: add PR template and automated doc sync workflow (#40)	4 달 전
PROJECT_COMPLETE.md	fc29fa3dc4 feat: add PR template and automated doc sync workflow (#40)	4 달 전
README.md	4103805270 Add build validation system and OpenAgent evaluation framework (#26)	4 달 전
VALIDATION_QUICK_REF.md	fc29fa3dc4 feat: add PR template and automated doc sync workflow (#40)	4 달 전

OpenCode Agent Evaluation Framework

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.

🚀 Quick Start

# CI/CD - Smoke test (30 seconds)
npm run test:ci:openagent

# Development - Core tests (5-8 minutes)
npm run test:core

# Release - Full suite (40-80 minutes)
npm run test:openagent

# View results dashboard
cd evals/results && ./serve.sh

📖 Complete Guide: See GUIDE.md for everything you need to know

📊 Testing Strategy

Three-Tier Approach

Tier	Tests	Time	Coverage	Use Case
Smoke ⚡	1	~30s	~10%	CI/CD, every PR
Core ✅	7	5-8 min	~85%	Development, pre-commit
Full 🔬	71	40-80 min	100%	Release validation

Current Status

Agent	Tests	Status
OpenAgent	71 tests	✅ Production Ready
Opencoder	4 tests	✅ Production Ready

📁 Directory Structure

evals/
├── framework/              # Core evaluation engine
│   ├── src/
│   │   ├── sdk/           # Test runner & execution
│   │   ├── evaluators/    # Rule validators (8 types)
│   │   └── collector/     # Session data collection
│   └── package.json
│
├── agents/                # Agent-specific tests
│   ├── openagent/
│   │   ├── config/        # Core test configuration
│   │   ├── tests/         # 71 tests organized by category
│   │   └── docs/
│   └── opencoder/
│       └── tests/
│
├── results/               # Test results & dashboard
│   ├── history/           # Historical results
│   ├── index.html         # Interactive dashboard
│   └── latest.json
│
├── GUIDE.md              # Complete guide (READ THIS)
└── README.md             # This file

🎯 Key Features

✅ SDK-Based Execution - Real agent interaction with event streaming
✅ Three-Tier Testing - Smoke (30s), Core (5-8min), Full (40-80min)
✅ Sequential Execution - Rate limiting protection for free tier
✅ Cost-Aware - FREE by default (grok-code-fast)
✅ 8 Evaluators - Comprehensive rule validation
✅ Interactive Dashboard - Results visualization and trends
✅ CI/CD Ready - GitHub Actions configured

📚 Documentation

Main Guide: GUIDE.md - Complete evaluation system guide

Includes:

Quick start and installation
Three-tier testing strategy (smoke, core, full)
Architecture and components
Test schema and examples
Core tests detailed breakdown
Results and dashboard
CI/CD integration
Troubleshooting
System review and recommendations

🎨 Usage Examples

# Run core tests (recommended for development)
npm run test:core

# Run with specific model
npm run test:core -- --model=anthropic/claude-sonnet-4-5

# Debug mode
npm run test:core -- --debug

# View results
cd evals/results && ./serve.sh

See GUIDE.md for complete usage examples and test schema

🤝 Contributing

See GUIDE.md for details on:

Adding new tests
Creating evaluators
Modifying core tests

🆘 Support

Complete Guide: GUIDE.md
Issues: Create an issue on GitHub
Questions: Check GUIDE.md first

Last Updated: 2024-11-28
Framework Version: 0.1.0
Status: ✅ Production Ready (9/10)
Rating: EXCELLENT