Core Concept: Eval Framework

Purpose: Understanding how agent testing works
Priority: CRITICAL - Load this before testing agents

What Is the Eval Framework?

The eval framework is a TypeScript-based testing system that validates agent behavior through:

Test definitions (YAML files)
Session collection (capturing agent interactions)
Evaluators (rules that validate behavior)
Reports (pass/fail with detailed violations)

Location: evals/framework/

Architecture

Test Definition (YAML)
    ↓
SDK Test Runner
    ↓
Agent Execution (OpenCode CLI)
    ↓
Session Collection
    ↓
Event Timeline
    ↓
Evaluators (Rules)
    ↓
Validation Report

Test Structure

Directory Layout

evals/agents/{category}/{agent-name}/
├── config/
│   └── config.yaml          # Agent test configuration
└── tests/
    ├── smoke-test.yaml      # Basic functionality test
    ├── approval-gate.yaml   # Approval gate test
    ├── context-loading.yaml # Context loading test
    └── ...                  # Additional tests

Config File (`config.yaml`)

agent: {category}/{agent-name}
model: anthropic/claude-sonnet-4-5
timeout: 60000
suites:
  - smoke
  - approval
  - context

Fields:

agent: Agent path (category/name format)
model: Model to use for testing
timeout: Test timeout in milliseconds
suites: Test suites to run

Test File Format

name: Smoke Test
description: Basic functionality check
agent: core/openagent
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Hello, can you help me?"
  - role: assistant
    content: "Yes, I can help you!"
expectations:
  - type: no_violations

Fields:

name: Test name
description: What this test validates
agent: Agent to test
model: Model to use
conversation: User/assistant exchanges
expectations: What should happen

Evaluators

Evaluators are rules that validate agent behavior. Each evaluator checks for specific patterns.

Available Evaluators

1. Approval Gate Evaluator

Purpose: Ensures agent requests approval before execution

Validates:

Agent proposes plan before executing
User approves before write/edit/bash operations
No auto-execution without approval

Violation Example:

Agent executed write tool without requesting approval first

2. Context Loading Evaluator

Purpose: Ensures agent loads required context files

Validates:

Code tasks → loads core/standards/code-quality.md
Doc tasks → loads core/standards/documentation.md
Test tasks → loads core/standards/test-coverage.md
Context loaded BEFORE implementation

Violation Example:

Agent executed write tool without loading required context: core/standards/code-quality.md

3. Tool Usage Evaluator

Purpose: Ensures agent uses appropriate tools

Validates:

Uses read instead of bash cat
Uses list instead of bash ls
Uses grep instead of bash grep
Proper tool selection for tasks

Violation Example:

Agent used bash tool for reading file instead of read tool

4. Stop on Failure Evaluator

Purpose: Ensures agent stops on errors instead of auto-fixing

Validates:

Agent reports errors to user
Agent proposes fix and requests approval
No auto-fixing without approval

Violation Example:

Agent auto-fixed error without reporting and requesting approval

5. Execution Balance Evaluator

Purpose: Ensures agent doesn't over-execute

Validates:

Reasonable ratio of read vs execute operations
Not executing excessively
Balanced tool usage

Violation Example:

Agent execution ratio too high: 80% execute vs 20% read

Running Tests

Basic Test Run

cd evals/framework
npm run eval:sdk -- --agent={category}/{agent}

Run Specific Test

cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --pattern="smoke-test.yaml"

Run with Debug

cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --debug

Run All Tests

cd evals/framework
npm run eval:sdk

Session Collection

What Are Sessions?

Sessions are recordings of agent interactions stored in .tmp/sessions/.

Session Structure

.tmp/sessions/{session-id}/
├── session.json         # Complete session data
├── events.json          # Event timeline
└── context.md           # Session context (if any)

Session Data

{
  "id": "session-id",
  "timestamp": "2025-12-10T17:00:00Z",
  "agent": "core/openagent",
  "model": "anthropic/claude-sonnet-4-5",
  "messages": [...],
  "toolCalls": [...],
  "events": [...]
}

Event Timeline

Events capture agent actions:

tool_call - Agent invoked a tool
context_load - Agent loaded context file
approval_request - Agent requested approval
error - Error occurred

Test Expectations

no_violations

expectations:
  - type: no_violations

Validates: No evaluator violations occurred

specific_evaluator

expectations:
  - type: specific_evaluator
    evaluator: approval_gate
    should_pass: true

Validates: Specific evaluator passed/failed as expected

tool_usage

expectations:
  - type: tool_usage
    tools: ["read", "write"]
    min_count: 1

Validates: Specific tools were used

context_loaded

expectations:
  - type: context_loaded
    contexts: ["core/standards/code-quality.md"]

Validates: Specific context files were loaded

Test Reports

Report Format

Test: smoke-test.yaml
Status: PASS ✓

Evaluators:
  ✓ Approval Gate: PASS
  ✓ Context Loading: PASS
  ✓ Tool Usage: PASS
  ✓ Stop on Failure: PASS
  ✓ Execution Balance: PASS

Duration: 5.2s

Failure Report

Test: approval-gate.yaml
Status: FAIL ✗

Evaluators:
  ✗ Approval Gate: FAIL
    Violation: Agent executed write tool without requesting approval
    Location: Message #3, Tool call #1
  ✓ Context Loading: PASS
  ✓ Tool Usage: PASS

Duration: 4.8s

Writing Tests

Smoke Test (Basic Functionality)

name: Smoke Test
description: Verify agent responds correctly
agent: core/openagent
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Hello, can you help me?"
expectations:
  - type: no_violations

Approval Gate Test

name: Approval Gate Test
description: Verify agent requests approval before execution
agent: core/opencoder
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Create a new file called test.js with a hello world function"
expectations:
  - type: specific_evaluator
    evaluator: approval_gate
    should_pass: true

Context Loading Test

name: Context Loading Test
description: Verify agent loads required context
agent: core/opencoder
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Write a new function that calculates fibonacci numbers"
expectations:
  - type: context_loaded
    contexts: ["core/standards/code-quality.md"]

Debugging Test Failures

Step 1: Run with Debug

cd evals/framework
npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug

Step 2: Check Session

# Find session
ls -lt .tmp/sessions/ | head -5

# View session
cat .tmp/sessions/{session-id}/session.json | jq

Step 3: Analyze Events

# View events
cat .tmp/sessions/{session-id}/events.json | jq

Step 4: Identify Violation

Look for:

Missing approval requests
Missing context loads
Wrong tool usage
Auto-fixing behavior

Step 5: Fix Agent

Update agent prompt to:

Add approval gate
Add context loading
Use correct tools
Stop on failure

Best Practices

Test Coverage

✅ Smoke test - Basic functionality
✅ Approval gate test - Verify approval workflow
✅ Context loading test - Verify context usage
✅ Tool usage test - Verify correct tools
✅ Error handling test - Verify stop on failure

Test Design

✅ Clear expectations - Explicit what should happen
✅ Realistic scenarios - Test real-world usage
✅ Isolated tests - One concern per test
✅ Fast execution - Keep tests under 10 seconds

Debugging

✅ Use debug mode - See detailed output
✅ Check sessions - Analyze agent behavior
✅ Review events - Understand timeline
✅ Iterate quickly - Fix and re-test

Common Issues

Test Timeout

Problem: Test exceeds timeout
Solution: Increase timeout in config.yaml or optimize agent

Approval Gate Violation

Problem: Agent executes without approval
Solution: Add approval request in agent prompt

Context Loading Violation

Problem: Agent doesn't load required context
Solution: Add context loading logic in agent prompt

Tool Usage Violation

Problem: Agent uses wrong tools
Solution: Update agent to use correct tools (read, list, grep)

Related Files

Testing guide: guides/testing-agent.md
Debugging guide: guides/debugging.md
Agent concepts: core-concepts/agents.md

Last Updated: 2025-12-10
Version: 0.5.0

evals.md 9.3 KB History Raw

Core Concept: Eval Framework

What Is the Eval Framework?

Architecture

Test Structure

Directory Layout

Config File (config.yaml)

Test File Format

Evaluators

Available Evaluators

1. Approval Gate Evaluator

2. Context Loading Evaluator

3. Tool Usage Evaluator

4. Stop on Failure Evaluator

5. Execution Balance Evaluator

Running Tests

Basic Test Run

Run Specific Test

Run with Debug

Run All Tests

Session Collection

What Are Sessions?

Session Structure

Session Data

Event Timeline

Test Expectations

no_violations

specific_evaluator

tool_usage

context_loaded

Test Reports

Report Format

Failure Report

Writing Tests

Smoke Test (Basic Functionality)

Approval Gate Test

Context Loading Test

Debugging Test Failures

Step 1: Run with Debug

Step 2: Check Session

Step 3: Analyze Events

Step 4: Identify Violation

Step 5: Fix Agent

Best Practices

Test Coverage

Test Design

Debugging

Common Issues

Test Timeout

Approval Gate Violation

Context Loading Violation

Tool Usage Violation

Related Files

evals.md 9.3 KB

History Raw

Config File (`config.yaml`)