|
|
4 months ago | |
|---|---|---|
| .. | ||
| agents | 4 months ago | |
| framework | 4 months ago | |
| results | 4 months ago | |
| README.md | 4 months ago | |
Comprehensive SDK-based evaluation framework for testing OpenCode agents with real execution, event streaming, and automated violation detection.
cd evals/framework
npm install
npm run build
# Run all tests (uses free model by default)
npm run eval:sdk
# Run with specific model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
# Run specific tests only
npm run eval:sdk -- --pattern="developer/*.yaml"
# Debug mode
npm run eval:sdk -- --debug
evals/
├── framework/ # Core evaluation framework
│ ├── src/
│ │ ├── sdk/ # SDK-based test runner
│ │ │ ├── server-manager.ts
│ │ │ ├── client-manager.ts
│ │ │ ├── event-stream-handler.ts
│ │ │ ├── test-runner.ts
│ │ │ ├── test-case-schema.ts
│ │ │ ├── test-case-loader.ts
│ │ │ ├── run-sdk-tests.ts # CLI entry point
│ │ │ ├── show-test-details.ts # Debug tool
│ │ │ └── approval/ # Approval strategies
│ │ ├── collector/ # Session data collection
│ │ ├── evaluators/ # Rule violation detection
│ │ └── types/ # TypeScript types
│ ├── docs/
│ │ └── test-design-guide.md # Test design philosophy
│ ├── SDK_EVAL_README.md # Comprehensive SDK guide
│ ├── README.md # Framework documentation
│ └── package.json
│
├── agents/openagent/ # OpenAgent-specific tests
│ ├── tests/ # YAML test cases
│ │ ├── developer/ # Developer workflow tests
│ │ ├── business/ # Business analysis tests
│ │ ├── creative/ # Content creation tests
│ │ └── edge-case/ # Edge case tests
│ ├── tests/simple/ # Synthetic test data
│ ├── docs/
│ │ ├── OPENAGENT_RULES.md # Rules from openagent.md
│ │ └── TEST_SCENARIOS.md # Test scenario catalog
│ ├── README.md # OpenAgent test overview
│ └── TEST_RESULTS.md # Test results summary
│
└── results/ # Test outputs (gitignored)
@opencode-ai/sdk for real agent interactionopencode/grok-code-fast (OpenCode Zen)--model=provider/model| Document | Purpose | Audience |
|---|---|---|
| SDK_EVAL_README.md | Complete SDK testing guide | All users |
| docs/test-design-guide.md | Test design philosophy | Test authors |
| openagent/docs/OPENAGENT_RULES.md | Rules reference | Test authors |
| openagent/docs/TEST_SCENARIOS.md | Test scenario catalog | Test authors |
# All tests with free model
npm run eval:sdk
# Specific category
npm run eval:sdk -- --pattern="developer/*.yaml"
# Custom model
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
# Debug single test
npx tsx src/sdk/show-test-details.ts developer/install-dependencies.yaml
# Example: developer/my-test.yaml
id: dev-my-test-001
name: My Test
description: What this test does
category: developer
prompt: "Your test prompt here"
# Behavior expectations (preferred)
behavior:
mustUseTools: [bash]
requiresApproval: true
# Expected violations
expectedViolations:
- rule: approval-gate
shouldViolate: false # Should NOT violate
severity: error
approvalStrategy:
type: auto-approve
timeout: 60000
tags:
- approval-gate
- v2-schema
See test-design-guide.md for best practices.
behavior: # What agent should do
mustUseTools: []
requiresApproval: bool
shouldDelegate: bool
expectedViolations: # What rules to check
- rule: approval-gate
shouldViolate: false
See SDK_EVAL_README.md for complete API.
npm run eval:sdk
# Output:
======================================================================
TEST RESULTS
======================================================================
1. ✅ dev-install-deps-002 - Install Dependencies (v2)
Duration: 10659ms
Events: 12
Approvals: 0
2. ❌ biz-data-analysis-001 - Business Data Analysis
Duration: 17512ms
Events: 18
Errors:
- Expected tool calls but no approvals requested
======================================================================
SUMMARY: 1/2 tests passed (1 failed)
======================================================================
# Uses opencode/grok-code-fast (free)
npm run eval:sdk
# Claude 3.5 Sonnet
npm run eval:sdk -- --model=anthropic/claude-3-5-sonnet-20241022
# GPT-4 Turbo
npm run eval:sdk -- --model=openai/gpt-4-turbo
# In test YAML file
model: anthropic/claude-3-5-sonnet-20241022
cd evals/framework
npm test
npm run build
src/evaluators/BaseEvaluatorevaluate() methodEvaluatorRunner# Show detailed test execution
npx tsx src/sdk/show-test-details.ts path/to/test.yaml
# Check session files
ls ~/.local/share/opencode/storage/session/
# .github/workflows/eval.yml
name: Agent Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
- run: cd evals/framework && npm install
- run: npm run eval:sdk -- --no-evaluators
Edit src/sdk/test-runner.ts:
defaultModel: config.defaultModel || 'opencode/grok-code-fast'
Enable/disable in TestRunner:
runEvaluators: config.runEvaluators ?? true
✅ Full SDK integration with @opencode-ai/sdk@1.0.90
✅ Real-time event streaming (12+ events per test)
✅ 4 evaluators integrated and working
✅ YAML-based test definitions with Zod validation
✅ CLI runner with detailed reporting
✅ Free model by default (no API costs)
✅ Model-agnostic test design
✅ Both positive and negative test support
Status: Production-ready for OpenAgent evaluation
See CONTRIBUTING.md
MIT