|
|
há 2 meses atrás | |
|---|---|---|
| .. | ||
| 00-smoke-test.yaml | há 2 meses atrás | |
| 02-context-loading-explicit.yaml | há 3 meses atrás | |
| 02-context-loading.yaml | há 3 meses atrás | |
| 02-delegation-test.yaml | há 2 meses atrás | |
| 03-read-before-write.yaml | há 3 meses atrás | |
| 04-write-with-approval.yaml | há 3 meses atrás | |
| 05-multi-turn-context.yaml | há 3 meses atrás | |
| 06-delegation-decision.yaml | há 3 meses atrás | |
| 07-tool-selection.yaml | há 3 meses atrás | |
| 08-error-handling.yaml | há 3 meses atrás | |
| README.md | há 3 meses atrás | |
A curated set of 8 tests that validate core agent behaviors. These tests are:
evals/test_tmp/ (gitignored)| # | Test | Behavior Validated | Evaluator(s) |
|---|---|---|---|
| 01 | Smoke Test | Basic read operation | behavior |
| 02 | Context Loading | Agent reads context before answering | context-loading |
| 03 | Read Before Write | Agent inspects before modifying | execution-balance |
| 04 | Write With Approval | Agent asks before writing | approval-gate |
| 05 | Multi-Turn Context | Agent remembers conversation | behavior |
| 06 | Task Breakdown | Agent reads standards before implementing | context-loading |
| 07 | Tool Selection | Agent uses dedicated tools (not bash) | tool-usage |
| 08 | Error Handling | Agent handles errors gracefully | behavior |
# Run all golden tests for openagent
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"
# Run all golden tests for any agent
npm run eval:sdk -- --agent=<agent-name> --pattern="**/golden/*.yaml"
# Run with debug output
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml" --debug
# Run a specific golden test
npm run eval:sdk -- --agent=openagent --pattern="**/golden/01-smoke-test.yaml"
All 8 tests should pass for a well-behaved agent. If tests fail:
Golden tests should be: