|
|
@@ -7,41 +7,39 @@ Comprehensive SDK-based evaluation framework for testing OpenCode agents with re
|
|
|
## 🚀 Quick Start
|
|
|
|
|
|
```bash
|
|
|
-cd evals/framework
|
|
|
-npm install
|
|
|
-npm run build
|
|
|
+# CI/CD - Smoke test (30 seconds)
|
|
|
+npm run test:ci:openagent
|
|
|
|
|
|
-# Run all tests (free model by default)
|
|
|
-npm run eval:sdk
|
|
|
+# Development - Core tests (5-8 minutes)
|
|
|
+npm run test:core
|
|
|
|
|
|
-# Run specific agent
|
|
|
-npm run eval:sdk -- --agent=openagent
|
|
|
-npm run eval:sdk -- --agent=opencoder
|
|
|
+# Release - Full suite (40-80 minutes)
|
|
|
+npm run test:openagent
|
|
|
|
|
|
# View results dashboard
|
|
|
-cd ../results && ./serve.sh
|
|
|
+cd evals/results && ./serve.sh
|
|
|
```
|
|
|
|
|
|
-**📖 New to the framework?** Start with [GETTING_STARTED.md](GETTING_STARTED.md)
|
|
|
+**📖 Complete Guide**: See [GUIDE.md](GUIDE.md) for everything you need to know
|
|
|
|
|
|
---
|
|
|
|
|
|
-## 📊 Current Status
|
|
|
+## 📊 Testing Strategy
|
|
|
|
|
|
-### Test Coverage
|
|
|
+### Three-Tier Approach
|
|
|
|
|
|
-| Agent | Tests | Pass Rate | Status |
|
|
|
-|-------|-------|-----------|--------|
|
|
|
-| **OpenAgent** | 22 tests | 100% | ✅ Production Ready |
|
|
|
-| **Opencoder** | 4 tests | 100% | ✅ Production Ready |
|
|
|
+| Tier | Tests | Time | Coverage | Use Case |
|
|
|
+|------|-------|------|----------|----------|
|
|
|
+| **Smoke** ⚡ | 1 | ~30s | ~10% | CI/CD, every PR |
|
|
|
+| **Core** ✅ | 7 | 5-8 min | ~85% | Development, pre-commit |
|
|
|
+| **Full** 🔬 | 71 | 40-80 min | 100% | Release validation |
|
|
|
|
|
|
-### Recent Achievements (Nov 26, 2025)
|
|
|
+### Current Status
|
|
|
|
|
|
-✅ **Context Loading Tests** - 5 comprehensive tests (3 simple, 2 complex multi-turn)
|
|
|
-✅ **Smart Timeout System** - Activity monitoring with absolute max timeout
|
|
|
-✅ **Fixed Context Evaluator** - Properly detects context files in multi-turn sessions
|
|
|
-✅ **Batch Test Runner** - Run tests in controlled batches to avoid API limits
|
|
|
-✅ **Results Dashboard** - Interactive web dashboard with filtering and charts
|
|
|
+| Agent | Tests | Status |
|
|
|
+|-------|-------|--------|
|
|
|
+| **OpenAgent** | 71 tests | ✅ Production Ready |
|
|
|
+| **Opencoder** | 4 tests | ✅ Production Ready |
|
|
|
|
|
|
---
|
|
|
|
|
|
@@ -49,306 +47,99 @@ cd ../results && ./serve.sh
|
|
|
|
|
|
```
|
|
|
evals/
|
|
|
-├── framework/ # Core evaluation framework
|
|
|
+├── framework/ # Core evaluation engine
|
|
|
│ ├── src/
|
|
|
-│ │ ├── sdk/ # SDK-based test runner
|
|
|
-│ │ ├── collector/ # Session data collection
|
|
|
-│ │ ├── evaluators/ # Rule violation detection
|
|
|
-│ │ └── types/ # TypeScript types
|
|
|
-│ ├── docs/ # Framework documentation
|
|
|
-│ ├── scripts/utils/run-tests-batch.sh # Batch test runner
|
|
|
-│ └── README.md # Framework docs
|
|
|
+│ │ ├── sdk/ # Test runner & execution
|
|
|
+│ │ ├── evaluators/ # Rule validators (8 types)
|
|
|
+│ │ └── collector/ # Session data collection
|
|
|
+│ └── package.json
|
|
|
│
|
|
|
-├── agents/ # Agent-specific test suites
|
|
|
-│ ├── openagent/ # OpenAgent tests
|
|
|
-│ │ ├── tests/
|
|
|
-│ │ │ ├── context-loading/ # Context loading tests (NEW)
|
|
|
-│ │ │ ├── developer/ # Developer workflow tests
|
|
|
-│ │ │ ├── business/ # Business analysis tests
|
|
|
-│ │ │ └── edge-case/ # Edge case tests
|
|
|
-│ │ ├── CONTEXT_LOADING_COVERAGE.md
|
|
|
-│ │ ├── IMPLEMENTATION_SUMMARY.md
|
|
|
-│ │ └── README.md
|
|
|
-│ │
|
|
|
-│ ├── opencoder/ # Opencoder tests
|
|
|
-│ │ ├── tests/developer/
|
|
|
-│ │ └── README.md
|
|
|
-│ │
|
|
|
-│ └── shared/ # Shared test utilities
|
|
|
+├── agents/ # Agent-specific tests
|
|
|
+│ ├── openagent/
|
|
|
+│ │ ├── config/ # Core test configuration
|
|
|
+│ │ ├── tests/ # 71 tests organized by category
|
|
|
+│ │ └── docs/
|
|
|
+│ └── opencoder/
|
|
|
+│ └── tests/
|
|
|
│
|
|
|
-├── results/ # Test results & dashboard
|
|
|
-│ ├── history/ # Historical results (60-day retention)
|
|
|
-│ ├── index.html # Interactive dashboard
|
|
|
-│ ├── serve.sh # One-command server
|
|
|
-│ ├── latest.json # Latest test results
|
|
|
-│ └── README.md
|
|
|
+├── results/ # Test results & dashboard
|
|
|
+│ ├── history/ # Historical results
|
|
|
+│ ├── index.html # Interactive dashboard
|
|
|
+│ └── latest.json
|
|
|
│
|
|
|
-├── test_tmp/ # Temporary test files (auto-cleaned)
|
|
|
-│
|
|
|
-├── GETTING_STARTED.md # Quick start guide (START HERE)
|
|
|
-├── HOW_TESTS_WORK.md # Detailed test execution guide
|
|
|
-├── ARCHITECTURE.md # System architecture review
|
|
|
-└── README.md # This file
|
|
|
+├── GUIDE.md # Complete guide (READ THIS)
|
|
|
+└── README.md # This file
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## 🎯 Key Features
|
|
|
|
|
|
-### ✅ SDK-Based Execution
|
|
|
-- Uses official `@opencode-ai/sdk` for real agent interaction
|
|
|
-- Real-time event streaming (10+ events per test)
|
|
|
-- Actual session recording to disk
|
|
|
-
|
|
|
-### ✅ Cost-Aware Testing
|
|
|
-- **FREE by default** - Uses `opencode/grok-code-fast` (OpenCode Zen)
|
|
|
-- Override per-test or via CLI: `--model=provider/model`
|
|
|
-- No accidental API costs during development
|
|
|
-
|
|
|
-### ✅ Smart Timeout System (NEW)
|
|
|
-- Activity monitoring - extends timeout while agent is working
|
|
|
-- Base timeout: 300s (5 min) of inactivity
|
|
|
-- Absolute max: 600s (10 min) hard limit
|
|
|
-- Prevents false timeouts on complex multi-turn tests
|
|
|
-
|
|
|
-### ✅ Context Loading Validation (NEW)
|
|
|
-- 5 comprehensive tests covering simple and complex scenarios
|
|
|
-- Verifies context files loaded before execution
|
|
|
-- Multi-turn conversation support
|
|
|
-- Proper file path extraction from SDK events
|
|
|
-
|
|
|
-### ✅ Rule-Based Validation
|
|
|
-- 4 evaluators check compliance with agent rules
|
|
|
-- Tests behavior (tool usage, approvals) not style
|
|
|
-- Model-agnostic test design
|
|
|
-
|
|
|
-### ✅ Results Tracking & Visualization
|
|
|
-- Type-safe JSON result generation
|
|
|
-- Interactive web dashboard with filtering
|
|
|
-- Pass rate trend charts
|
|
|
-- CSV export functionality
|
|
|
-- 60-day retention policy
|
|
|
+✅ **SDK-Based Execution** - Real agent interaction with event streaming
|
|
|
+✅ **Three-Tier Testing** - Smoke (30s), Core (5-8min), Full (40-80min)
|
|
|
+✅ **Sequential Execution** - Rate limiting protection for free tier
|
|
|
+✅ **Cost-Aware** - FREE by default (grok-code-fast)
|
|
|
+✅ **8 Evaluators** - Comprehensive rule validation
|
|
|
+✅ **Interactive Dashboard** - Results visualization and trends
|
|
|
+✅ **CI/CD Ready** - GitHub Actions configured
|
|
|
|
|
|
---
|
|
|
|
|
|
## 📚 Documentation
|
|
|
|
|
|
-| Document | Purpose | Audience |
|
|
|
-|----------|---------|----------|
|
|
|
-| **[GETTING_STARTED.md](GETTING_STARTED.md)** | Quick start guide | New users |
|
|
|
-| **[HOW_TESTS_WORK.md](HOW_TESTS_WORK.md)** | Test execution details | Test authors |
|
|
|
-| **[ARCHITECTURE.md](ARCHITECTURE.md)** | System architecture | Developers |
|
|
|
-| **[framework/SDK_EVAL_README.md](framework/SDK_EVAL_README.md)** | Complete SDK guide | All users |
|
|
|
-| **[framework/docs/test-design-guide.md](framework/docs/test-design-guide.md)** | Test design philosophy | Test authors |
|
|
|
-| **[agents/openagent/CONTEXT_LOADING_COVERAGE.md](agents/openagent/CONTEXT_LOADING_COVERAGE.md)** | Context loading tests | OpenAgent users |
|
|
|
-| **[agents/openagent/IMPLEMENTATION_SUMMARY.md](agents/openagent/IMPLEMENTATION_SUMMARY.md)** | Recent implementation | Developers |
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 🔧 Agent Differences
|
|
|
+**Main Guide**: [GUIDE.md](GUIDE.md) - Complete evaluation system guide
|
|
|
|
|
|
-| Feature | OpenAgent | Opencoder |
|
|
|
-|---------|-----------|-----------|
|
|
|
-| **Approval** | Text-based + tool permissions | Tool permissions only |
|
|
|
-| **Workflow** | Analyze→Approve→Execute→Validate | Direct execution |
|
|
|
-| **Context** | Mandatory before execution | On-demand |
|
|
|
-| **Test Style** | Multi-turn (approval flow) | Single prompt |
|
|
|
-| **Timeout** | 300s (smart timeout) | 60s (standard) |
|
|
|
+**Includes**:
|
|
|
+- Quick start and installation
|
|
|
+- Three-tier testing strategy (smoke, core, full)
|
|
|
+- Architecture and components
|
|
|
+- Test schema and examples
|
|
|
+- Core tests detailed breakdown
|
|
|
+- Results and dashboard
|
|
|
+- CI/CD integration
|
|
|
+- Troubleshooting
|
|
|
+- System review and recommendations
|
|
|
|
|
|
---
|
|
|
|
|
|
## 🎨 Usage Examples
|
|
|
|
|
|
-### Run Tests
|
|
|
-
|
|
|
-```bash
|
|
|
-# All tests with free model
|
|
|
-npm run eval:sdk
|
|
|
-
|
|
|
-# Specific category
|
|
|
-npm run eval:sdk -- --pattern="context-loading/*.yaml"
|
|
|
-
|
|
|
-# Custom model
|
|
|
-npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
|
|
|
-
|
|
|
-# Debug single test
|
|
|
-npm run eval:sdk -- --pattern="ctx-simple-coding-standards.yaml" --debug
|
|
|
-
|
|
|
-# Batch execution (avoid API limits)
|
|
|
-./scripts/utils/run-tests-batch.sh openagent 3 10
|
|
|
-```
|
|
|
-
|
|
|
-### View Results
|
|
|
-
|
|
|
```bash
|
|
|
-# Interactive dashboard (one command!)
|
|
|
-cd results && ./serve.sh
|
|
|
-
|
|
|
-# View JSON
|
|
|
-cat results/latest.json
|
|
|
-
|
|
|
-# Historical results
|
|
|
-ls results/history/2025-11/
|
|
|
-```
|
|
|
-
|
|
|
-### Create New Test
|
|
|
-
|
|
|
-```yaml
|
|
|
-# Example: context-loading/my-test.yaml
|
|
|
-id: my-test-001
|
|
|
-name: "My Test"
|
|
|
-description: What this test validates
|
|
|
-
|
|
|
-category: developer
|
|
|
-agent: openagent
|
|
|
-model: anthropic/claude-sonnet-4-5
|
|
|
-
|
|
|
-prompt: "Your test prompt here"
|
|
|
-
|
|
|
-behavior:
|
|
|
- mustUseTools: [read]
|
|
|
- requiresContext: true
|
|
|
- minToolCalls: 1
|
|
|
-
|
|
|
-expectedViolations:
|
|
|
- - rule: context-loading
|
|
|
- shouldViolate: false
|
|
|
- severity: error
|
|
|
-
|
|
|
-approvalStrategy:
|
|
|
- type: auto-approve
|
|
|
-
|
|
|
-timeout: 60000
|
|
|
-
|
|
|
-tags:
|
|
|
- - context-loading
|
|
|
-```
|
|
|
-
|
|
|
-See [GETTING_STARTED.md](GETTING_STARTED.md) for more examples.
|
|
|
-
|
|
|
----
|
|
|
+# Run core tests (recommended for development)
|
|
|
+npm run test:core
|
|
|
|
|
|
-## 🏗️ Framework Components
|
|
|
+# Run with specific model
|
|
|
+npm run test:core -- --model=anthropic/claude-sonnet-4-5
|
|
|
|
|
|
-### SDK Test Runner
|
|
|
-- **ServerManager** - Start/stop opencode server
|
|
|
-- **ClientManager** - Session and prompt management
|
|
|
-- **EventStreamHandler** - Real-time event capture
|
|
|
-- **TestRunner** - Test orchestration with evaluators
|
|
|
-- **ApprovalStrategies** - Auto-approve, deny, smart rules
|
|
|
+# Debug mode
|
|
|
+npm run test:core -- --debug
|
|
|
|
|
|
-### Evaluators
|
|
|
-- **ApprovalGateEvaluator** - Checks approval before tool execution
|
|
|
-- **ContextLoadingEvaluator** - Verifies context files loaded first (FIXED)
|
|
|
-- **DelegationEvaluator** - Validates delegation for 4+ files
|
|
|
-- **ToolUsageEvaluator** - Checks bash vs specialized tools
|
|
|
-- **BehaviorEvaluator** - Validates test-specific behavior expectations
|
|
|
-
|
|
|
-### Results System
|
|
|
-- **ResultSaver** - Type-safe JSON generation
|
|
|
-- **Dashboard** - Interactive web visualization
|
|
|
-- **Helper Scripts** - Easy deployment (`serve.sh`)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 🔬 Test Schema (v2)
|
|
|
-
|
|
|
-```yaml
|
|
|
-# Behavior expectations (what agent should do)
|
|
|
-behavior:
|
|
|
- mustUseTools: [read, write] # Required tools
|
|
|
- mustUseAnyOf: [[bash], [list]] # Alternative tools
|
|
|
- requiresApproval: true # Must ask for approval
|
|
|
- requiresContext: true # Must load context
|
|
|
- minToolCalls: 2 # Minimum tool calls
|
|
|
-
|
|
|
-# Expected violations (what rules to check)
|
|
|
-expectedViolations:
|
|
|
- - rule: approval-gate
|
|
|
- shouldViolate: false # Should NOT violate
|
|
|
- severity: error
|
|
|
-
|
|
|
- - rule: context-loading
|
|
|
- shouldViolate: false
|
|
|
- severity: error
|
|
|
+# View results
|
|
|
+cd evals/results && ./serve.sh
|
|
|
```
|
|
|
|
|
|
----
|
|
|
-
|
|
|
-## 📈 Recent Improvements
|
|
|
-
|
|
|
-### November 26, 2025
|
|
|
-
|
|
|
-1. **Context Loading Tests** (5 tests, 100% passing)
|
|
|
- - 3 simple tests (single prompt, read-only)
|
|
|
- - 2 complex tests (multi-turn with file creation)
|
|
|
- - Comprehensive coverage of context loading scenarios
|
|
|
-
|
|
|
-2. **Smart Timeout System**
|
|
|
- - Activity monitoring prevents false timeouts
|
|
|
- - Base timeout: 300s inactivity
|
|
|
- - Absolute max: 600s hard limit
|
|
|
- - Handles complex multi-turn tests gracefully
|
|
|
-
|
|
|
-3. **Fixed Context Loading Evaluator**
|
|
|
- - Corrected file path extraction (`tool.data.state.input.filePath`)
|
|
|
- - Multi-turn session support
|
|
|
- - Checks context for ALL executions, not just first
|
|
|
-
|
|
|
-4. **Batch Test Runner**
|
|
|
- - `run-tests-batch.sh` script
|
|
|
- - Configurable batch size and delays
|
|
|
- - Prevents API rate limits
|
|
|
-
|
|
|
-5. **Results Dashboard**
|
|
|
- - Interactive web UI with filtering
|
|
|
- - Pass rate trend charts
|
|
|
- - CSV export
|
|
|
- - One-command deployment
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 🎯 Achievements
|
|
|
-
|
|
|
-✅ Full SDK integration with `@opencode-ai/sdk@1.0.90`
|
|
|
-✅ Real-time event streaming (12+ events per test)
|
|
|
-✅ 5 evaluators integrated and working
|
|
|
-✅ YAML-based test definitions with Zod validation
|
|
|
-✅ CLI runner with detailed reporting
|
|
|
-✅ Free model by default (no API costs)
|
|
|
-✅ Model-agnostic test design
|
|
|
-✅ Both positive and negative test support
|
|
|
-✅ Smart timeout with activity monitoring
|
|
|
-✅ Context loading validation (100% coverage)
|
|
|
-✅ Results tracking and visualization
|
|
|
-✅ Batch execution support
|
|
|
-
|
|
|
-**Status:** ✅ Production-ready for OpenAgent & Opencoder evaluation
|
|
|
+**See [GUIDE.md](GUIDE.md) for complete usage examples and test schema**
|
|
|
|
|
|
---
|
|
|
|
|
|
## 🤝 Contributing
|
|
|
|
|
|
-See [../docs/contributing/CONTRIBUTING.md](../docs/contributing/CONTRIBUTING.md)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 📄 License
|
|
|
-
|
|
|
-MIT
|
|
|
+See [GUIDE.md](GUIDE.md) for details on:
|
|
|
+- Adding new tests
|
|
|
+- Creating evaluators
|
|
|
+- Modifying core tests
|
|
|
|
|
|
---
|
|
|
|
|
|
## 🆘 Support
|
|
|
|
|
|
-- **Getting Started**: [GETTING_STARTED.md](GETTING_STARTED.md)
|
|
|
-- **How Tests Work**: [HOW_TESTS_WORK.md](HOW_TESTS_WORK.md)
|
|
|
-- **Architecture**: [ARCHITECTURE.md](ARCHITECTURE.md)
|
|
|
-- **Issues**: Check documentation or create an issue
|
|
|
+**Complete Guide**: [GUIDE.md](GUIDE.md)
|
|
|
+**Issues**: Create an issue on GitHub
|
|
|
+**Questions**: Check GUIDE.md first
|
|
|
|
|
|
---
|
|
|
|
|
|
-**Last Updated**: 2025-11-26
|
|
|
+**Last Updated**: 2024-11-28
|
|
|
**Framework Version**: 0.1.0
|
|
|
-**Test Coverage**: 26 tests (22 OpenAgent, 4 Opencoder)
|
|
|
-**Pass Rate**: 100%
|
|
|
+**Status**: ✅ Production Ready (9/10)
|
|
|
+**Rating**: EXCELLENT
|