Prerequisites: Load core-concepts/evals.md first
Purpose: Step-by-step workflow for testing agents
# Run smoke test
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --pattern="smoke-test.yaml"
# Run all tests for agent
npm run eval:sdk -- --agent={category}/{agent}
# Run with debug
npm run eval:sdk -- --agent={category}/{agent} --debug
Purpose: Basic functionality check
name: Smoke Test
description: Verify agent responds correctly
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Hello, can you help me?"
expectations:
- type: no_violations
Run:
npm run eval:sdk -- --agent={agent} --pattern="smoke-test.yaml"
Purpose: Verify agent requests approval
name: Approval Gate Test
description: Verify agent requests approval before execution
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Create a new file called test.js"
expectations:
- type: specific_evaluator
evaluator: approval_gate
should_pass: true
Purpose: Verify agent loads required context
name: Context Loading Test
description: Verify agent loads required context
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Write a new function"
expectations:
- type: context_loaded
contexts: ["core/standards/code-quality.md"]
Purpose: Verify agent uses correct tools
name: Tool Usage Test
description: Verify agent uses appropriate tools
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Read the package.json file"
expectations:
- type: tool_usage
tools: ["read"]
min_count: 1
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --pattern="{test-name}.yaml"
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent}
cd evals/framework
npm run eval:sdk
cd evals/framework
npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug
✓ Test: smoke-test.yaml
Status: PASS
Duration: 5.2s
Evaluators:
✓ Approval Gate: PASS
✓ Context Loading: PASS
✓ Tool Usage: PASS
✓ Stop on Failure: PASS
✓ Execution Balance: PASS
✗ Test: approval-gate.yaml
Status: FAIL
Duration: 4.8s
Evaluators:
✗ Approval Gate: FAIL
Violation: Agent executed write tool without requesting approval
Location: Message #3, Tool call #1
✓ Context Loading: PASS
✓ Tool Usage: PASS
npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug
# Find recent session
ls -lt .tmp/sessions/ | head -5
# View session
cat .tmp/sessions/{session-id}/session.json | jq
# View event timeline
cat .tmp/sessions/{session-id}/events.json | jq
Common issues:
Update agent prompt to address the issue, then re-test.
name: Test Name
description: What this test validates
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "User message"
- role: assistant
content: "Expected response (optional)"
expectations:
- type: no_violations
✅ Clear name - Descriptive test name
✅ Good description - Explain what's being tested
✅ Realistic scenario - Test real-world usage
✅ Specific expectations - Clear pass/fail criteria
✅ Fast execution - Keep under 10 seconds
conversation:
- role: user
content: "Create a new file"
expectations:
- type: specific_evaluator
evaluator: approval_gate
should_pass: true
conversation:
- role: user
content: "Write new code"
expectations:
- type: context_loaded
contexts: ["core/standards/code-quality.md"]
conversation:
- role: user
content: "Read the README file"
expectations:
- type: tool_usage
tools: ["read"]
min_count: 1
# Setup pre-commit hook
./scripts/validation/setup-pre-commit-hook.sh
Tests run automatically on:
core-concepts/evals.mdguides/debugging.mdguides/adding-agent.mdLast Updated: 2025-12-10
Version: 0.5.0