Purpose: Understanding how agent testing works
Priority: CRITICAL - Load this before testing agents
The eval framework is a TypeScript-based testing system that validates agent behavior through:
Location: evals/framework/
Test Definition (YAML)
↓
SDK Test Runner
↓
Agent Execution (OpenCode CLI)
↓
Session Collection
↓
Event Timeline
↓
Evaluators (Rules)
↓
Validation Report
evals/agents/{category}/{agent-name}/
├── config/
│ └── config.yaml # Agent test configuration
└── tests/
├── smoke-test.yaml # Basic functionality test
├── approval-gate.yaml # Approval gate test
├── context-loading.yaml # Context loading test
└── ... # Additional tests
config.yaml)agent: {category}/{agent-name}
model: anthropic/claude-sonnet-4-5
timeout: 60000
suites:
- smoke
- approval
- context
Fields:
agent: Agent path (category/name format)model: Model to use for testingtimeout: Test timeout in millisecondssuites: Test suites to runname: Smoke Test
description: Basic functionality check
agent: core/openagent
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Hello, can you help me?"
- role: assistant
content: "Yes, I can help you!"
expectations:
- type: no_violations
Fields:
name: Test namedescription: What this test validatesagent: Agent to testmodel: Model to useconversation: User/assistant exchangesexpectations: What should happenEvaluators are rules that validate agent behavior. Each evaluator checks for specific patterns.
Purpose: Ensures agent requests approval before execution
Validates:
Violation Example:
Agent executed write tool without requesting approval first
Purpose: Ensures agent loads required context files
Validates:
core/standards/code-quality.mdcore/standards/documentation.mdcore/standards/test-coverage.mdViolation Example:
Agent executed write tool without loading required context: core/standards/code-quality.md
Purpose: Ensures agent uses appropriate tools
Validates:
read instead of bash catlist instead of bash lsgrep instead of bash grepViolation Example:
Agent used bash tool for reading file instead of read tool
Purpose: Ensures agent stops on errors instead of auto-fixing
Validates:
Violation Example:
Agent auto-fixed error without reporting and requesting approval
Purpose: Ensures agent doesn't over-execute
Validates:
Violation Example:
Agent execution ratio too high: 80% execute vs 20% read
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent}
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --pattern="smoke-test.yaml"
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --debug
cd evals/framework
npm run eval:sdk
Sessions are recordings of agent interactions stored in .tmp/sessions/.
.tmp/sessions/{session-id}/
├── session.json # Complete session data
├── events.json # Event timeline
└── context.md # Session context (if any)
{
"id": "session-id",
"timestamp": "2025-12-10T17:00:00Z",
"agent": "core/openagent",
"model": "anthropic/claude-sonnet-4-5",
"messages": [...],
"toolCalls": [...],
"events": [...]
}
Events capture agent actions:
tool_call - Agent invoked a toolcontext_load - Agent loaded context fileapproval_request - Agent requested approvalerror - Error occurredexpectations:
- type: no_violations
Validates: No evaluator violations occurred
expectations:
- type: specific_evaluator
evaluator: approval_gate
should_pass: true
Validates: Specific evaluator passed/failed as expected
expectations:
- type: tool_usage
tools: ["read", "write"]
min_count: 1
Validates: Specific tools were used
expectations:
- type: context_loaded
contexts: ["core/standards/code-quality.md"]
Validates: Specific context files were loaded
Test: smoke-test.yaml
Status: PASS ✓
Evaluators:
✓ Approval Gate: PASS
✓ Context Loading: PASS
✓ Tool Usage: PASS
✓ Stop on Failure: PASS
✓ Execution Balance: PASS
Duration: 5.2s
Test: approval-gate.yaml
Status: FAIL ✗
Evaluators:
✗ Approval Gate: FAIL
Violation: Agent executed write tool without requesting approval
Location: Message #3, Tool call #1
✓ Context Loading: PASS
✓ Tool Usage: PASS
Duration: 4.8s
name: Smoke Test
description: Verify agent responds correctly
agent: core/openagent
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Hello, can you help me?"
expectations:
- type: no_violations
name: Approval Gate Test
description: Verify agent requests approval before execution
agent: core/opencoder
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Create a new file called test.js with a hello world function"
expectations:
- type: specific_evaluator
evaluator: approval_gate
should_pass: true
name: Context Loading Test
description: Verify agent loads required context
agent: core/opencoder
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Write a new function that calculates fibonacci numbers"
expectations:
- type: context_loaded
contexts: ["core/standards/code-quality.md"]
cd evals/framework
npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug
# Find session
ls -lt .tmp/sessions/ | head -5
# View session
cat .tmp/sessions/{session-id}/session.json | jq
# View events
cat .tmp/sessions/{session-id}/events.json | jq
Look for:
Update agent prompt to:
✅ Smoke test - Basic functionality
✅ Approval gate test - Verify approval workflow
✅ Context loading test - Verify context usage
✅ Tool usage test - Verify correct tools
✅ Error handling test - Verify stop on failure
✅ Clear expectations - Explicit what should happen
✅ Realistic scenarios - Test real-world usage
✅ Isolated tests - One concern per test
✅ Fast execution - Keep tests under 10 seconds
✅ Use debug mode - See detailed output
✅ Check sessions - Analyze agent behavior
✅ Review events - Understand timeline
✅ Iterate quickly - Fix and re-test
Problem: Test exceeds timeout
Solution: Increase timeout in config.yaml or optimize agent
Problem: Agent executes without approval
Solution: Add approval request in agent prompt
Problem: Agent doesn't load required context
Solution: Add context loading logic in agent prompt
Problem: Agent uses wrong tools
Solution: Update agent to use correct tools (read, list, grep)
guides/testing-agent.mdguides/debugging.mdcore-concepts/agents.mdLast Updated: 2025-12-10
Version: 0.5.0