README.md 2.4 KB

Golden Test Suite

A curated set of 8 tests that validate core agent behaviors. These tests are:

  • Safe - All operations are read-only or write to evals/test_tmp/ (gitignored)
  • Agent-Agnostic - Work with any agent, not tied to specific implementations
  • Fast - Complete in ~5-10 minutes total
  • Reliable - Designed to pass consistently

Tests

# Test Behavior Validated Evaluator(s)
01 Smoke Test Basic read operation behavior
02 Context Loading Agent reads context before answering context-loading
03 Read Before Write Agent inspects before modifying execution-balance
04 Write With Approval Agent asks before writing approval-gate
05 Multi-Turn Context Agent remembers conversation behavior
06 Task Breakdown Agent reads standards before implementing context-loading
07 Tool Selection Agent uses dedicated tools (not bash) tool-usage
08 Error Handling Agent handles errors gracefully behavior

Running the Golden Suite

# Run all golden tests for openagent
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"

# Run all golden tests for any agent
npm run eval:sdk -- --agent=<agent-name> --pattern="**/golden/*.yaml"

# Run with debug output
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug

# Run a specific golden test
npm run eval:sdk -- --agent=openagent --pattern="**/golden/01-smoke-test.yaml"

Expected Results

All 8 tests should pass for a well-behaved agent. If tests fail:

  • 01 fails: Basic infrastructure issue - check agent/server setup
  • 02 fails: Agent doesn't load context - check context-loading behavior
  • 03 fails: Agent writes without reading - check execution-balance
  • 04 fails: Agent doesn't ask approval - check approval workflow
  • 05 fails: Agent loses context - check conversation handling
  • 06 fails: Agent doesn't read standards - check context-loading
  • 07 fails: Agent uses bash antipatterns - check tool selection
  • 08 fails: Agent crashes on errors - check error handling

Adding New Golden Tests

Golden tests should be:

  1. Essential - Test a core behavior that all agents need
  2. Safe - No dangerous operations, writes only to test_tmp/
  3. Reliable - Pass consistently across runs
  4. Fast - Complete in under 2 minutes each
  5. Clear - Easy to understand what's being tested