# Golden Test Suite A curated set of 8 tests that validate core agent behaviors. These tests are: - **Safe** - All operations are read-only or write to `evals/test_tmp/` (gitignored) - **Agent-Agnostic** - Work with any agent, not tied to specific implementations - **Fast** - Complete in ~5-10 minutes total - **Reliable** - Designed to pass consistently ## Tests | # | Test | Behavior Validated | Evaluator(s) | |---|------|-------------------|--------------| | 01 | Smoke Test | Basic read operation | behavior | | 02 | Context Loading | Agent reads context before answering | context-loading | | 03 | Read Before Write | Agent inspects before modifying | execution-balance | | 04 | Write With Approval | Agent asks before writing | approval-gate | | 05 | Multi-Turn Context | Agent remembers conversation | behavior | | 06 | Task Breakdown | Agent reads standards before implementing | context-loading | | 07 | Tool Selection | Agent uses dedicated tools (not bash) | tool-usage | | 08 | Error Handling | Agent handles errors gracefully | behavior | ## Running the Golden Suite ```bash # Run all golden tests for openagent npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" # Run all golden tests for any agent npm run eval:sdk -- --agent= --pattern="**/golden/*.yaml" # Run with debug output npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug # Run a specific golden test npm run eval:sdk -- --agent=openagent --pattern="**/golden/01-smoke-test.yaml" ``` ## Expected Results All 8 tests should pass for a well-behaved agent. If tests fail: - **01 fails**: Basic infrastructure issue - check agent/server setup - **02 fails**: Agent doesn't load context - check context-loading behavior - **03 fails**: Agent writes without reading - check execution-balance - **04 fails**: Agent doesn't ask approval - check approval workflow - **05 fails**: Agent loses context - check conversation handling - **06 fails**: Agent doesn't read standards - check context-loading - **07 fails**: Agent uses bash antipatterns - check tool selection - **08 fails**: Agent crashes on errors - check error handling ## Adding New Golden Tests Golden tests should be: 1. **Essential** - Test a core behavior that all agents need 2. **Safe** - No dangerous operations, writes only to test_tmp/ 3. **Reliable** - Pass consistently across runs 4. **Fast** - Complete in under 2 minutes each 5. **Clear** - Easy to understand what's being tested