Darren Hinde 041e3238bc docs: fix agent name casing in documentation (#117)		há 2 meses atrás
..
00-smoke-test.yaml	041e3238bc docs: fix agent name casing in documentation (#117)	há 2 meses atrás
02-context-loading-explicit.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
02-context-loading.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
02-delegation-test.yaml	041e3238bc docs: fix agent name casing in documentation (#117)	há 2 meses atrás
03-read-before-write.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
04-write-with-approval.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
05-multi-turn-context.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
06-delegation-decision.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
07-tool-selection.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
08-error-handling.yaml	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás
README.md	c8f7103cb6 refactor(evals): consolidate documentation and enhance test infrastructure (#56)	há 3 meses atrás

Golden Test Suite

A curated set of 8 tests that validate core agent behaviors. These tests are:

Safe - All operations are read-only or write to evals/test_tmp/ (gitignored)
Agent-Agnostic - Work with any agent, not tied to specific implementations
Fast - Complete in ~5-10 minutes total
Reliable - Designed to pass consistently

Tests

#	Test	Behavior Validated	Evaluator(s)
01	Smoke Test	Basic read operation	behavior
02	Context Loading	Agent reads context before answering	context-loading
03	Read Before Write	Agent inspects before modifying	execution-balance
04	Write With Approval	Agent asks before writing	approval-gate
05	Multi-Turn Context	Agent remembers conversation	behavior
06	Task Breakdown	Agent reads standards before implementing	context-loading
07	Tool Selection	Agent uses dedicated tools (not bash)	tool-usage
08	Error Handling	Agent handles errors gracefully	behavior

Running the Golden Suite

# Run all golden tests for openagent
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"

# Run all golden tests for any agent
npm run eval:sdk -- --agent=<agent-name> --pattern="**/golden/*.yaml"

# Run with debug output
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug

# Run a specific golden test
npm run eval:sdk -- --agent=openagent --pattern="**/golden/01-smoke-test.yaml"

Expected Results

All 8 tests should pass for a well-behaved agent. If tests fail:

01 fails: Basic infrastructure issue - check agent/server setup
02 fails: Agent doesn't load context - check context-loading behavior
03 fails: Agent writes without reading - check execution-balance
04 fails: Agent doesn't ask approval - check approval workflow
05 fails: Agent loses context - check conversation handling
06 fails: Agent doesn't read standards - check context-loading
07 fails: Agent uses bash antipatterns - check tool selection
08 fails: Agent crashes on errors - check error handling

Adding New Golden Tests

Golden tests should be:

Essential - Test a core behavior that all agents need
Safe - No dangerous operations, writes only to test_tmp/
Reliable - Pass consistently across runs
Fast - Complete in under 2 minutes each
Clear - Easy to understand what's being tested

README.md

Golden Test Suite

Tests

Running the Golden Suite

Expected Results

Adding New Golden Tests