Darren Hinde 784ffadf92 chore: verify and stabilize main branch (#42) 4 달 전
..
agents 784ffadf92 chore: verify and stabilize main branch (#42) 4 달 전
framework 784ffadf92 chore: verify and stabilize main branch (#42) 4 달 전
results 784ffadf92 chore: verify and stabilize main branch (#42) 4 달 전
test_tmp 79110ed3fb Add Production-Ready Eval Framework for OpenAgent (#25) 4 달 전
EVAL_FRAMEWORK_GUIDE.md 79110ed3fb Add Production-Ready Eval Framework for OpenAgent (#25) 4 달 전
PHASE_5_COMPLETE.md fc29fa3dc4 feat: add PR template and automated doc sync workflow (#40) 4 달 전
PROJECT_COMPLETE.md fc29fa3dc4 feat: add PR template and automated doc sync workflow (#40) 4 달 전
README.md 4103805270 Add build validation system and OpenAgent evaluation framework (#26) 4 달 전
VALIDATION_QUICK_REF.md fc29fa3dc4 feat: add PR template and automated doc sync workflow (#40) 4 달 전

README.md

OpenCode Agent Evaluation Framework

Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.


🚀 Quick Start

# CI/CD - Smoke test (30 seconds)
npm run test:ci:openagent

# Development - Core tests (5-8 minutes)
npm run test:core

# Release - Full suite (40-80 minutes)
npm run test:openagent

# View results dashboard
cd evals/results && ./serve.sh

📖 Complete Guide: See GUIDE.md for everything you need to know


📊 Testing Strategy

Three-Tier Approach

Tier Tests Time Coverage Use Case
Smoke 1 ~30s ~10% CI/CD, every PR
Core 7 5-8 min ~85% Development, pre-commit
Full 🔬 71 40-80 min 100% Release validation

Current Status

Agent Tests Status
OpenAgent 71 tests ✅ Production Ready
Opencoder 4 tests ✅ Production Ready

📁 Directory Structure

evals/
├── framework/              # Core evaluation engine
│   ├── src/
│   │   ├── sdk/           # Test runner & execution
│   │   ├── evaluators/    # Rule validators (8 types)
│   │   └── collector/     # Session data collection
│   └── package.json
│
├── agents/                # Agent-specific tests
│   ├── openagent/
│   │   ├── config/        # Core test configuration
│   │   ├── tests/         # 71 tests organized by category
│   │   └── docs/
│   └── opencoder/
│       └── tests/
│
├── results/               # Test results & dashboard
│   ├── history/           # Historical results
│   ├── index.html         # Interactive dashboard
│   └── latest.json
│
├── GUIDE.md              # Complete guide (READ THIS)
└── README.md             # This file

🎯 Key Features

SDK-Based Execution - Real agent interaction with event streaming
Three-Tier Testing - Smoke (30s), Core (5-8min), Full (40-80min)
Sequential Execution - Rate limiting protection for free tier
Cost-Aware - FREE by default (grok-code-fast)
8 Evaluators - Comprehensive rule validation
Interactive Dashboard - Results visualization and trends
CI/CD Ready - GitHub Actions configured


📚 Documentation

Main Guide: GUIDE.md - Complete evaluation system guide

Includes:

  • Quick start and installation
  • Three-tier testing strategy (smoke, core, full)
  • Architecture and components
  • Test schema and examples
  • Core tests detailed breakdown
  • Results and dashboard
  • CI/CD integration
  • Troubleshooting
  • System review and recommendations

🎨 Usage Examples

# Run core tests (recommended for development)
npm run test:core

# Run with specific model
npm run test:core -- --model=anthropic/claude-sonnet-4-5

# Debug mode
npm run test:core -- --debug

# View results
cd evals/results && ./serve.sh

See GUIDE.md for complete usage examples and test schema


🤝 Contributing

See GUIDE.md for details on:

  • Adding new tests
  • Creating evaluators
  • Modifying core tests

🆘 Support

Complete Guide: GUIDE.md
Issues: Create an issue on GitHub
Questions: Check GUIDE.md first


Last Updated: 2024-11-28
Framework Version: 0.1.0
Status: ✅ Production Ready (9/10)
Rating: EXCELLENT