Total Tests: 22 (migrated) + new tests to be added
Estimated Full Suite Runtime: 40-80 minutes
Last Updated: Nov 26, 2024
# Run core tests (RECOMMENDED - 7 tests, ~5-8 min)
npm run test:core
# Run all tests (full suite - 71 tests, ~40-80 min)
npm run test:openagent
# Run critical tests only
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"
# Run specific category
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/approval-gate/*.yaml"
# Debug mode (keeps sessions, verbose output)
npm run eval:sdk -- --agent=openagent --debug
NEW: We now have a core test suite with 7 carefully selected tests that provide ~85% coverage in just 5-8 minutes!
# NPM (from root)
npm run test:core
# Script
./scripts/test.sh openagent --core
# Direct
cd evals/framework && npm run eval:sdk:core -- --agent=openagent
| # | Test | Category | Time | Priority |
|---|---|---|---|---|
| 1 | Approval Gate | Critical Rules | 30-60s | โก CRITICAL |
| 2 | Context Loading (Simple) | Critical Rules | 60-90s | โก CRITICAL |
| 3 | Context Loading (Multi-Turn) | Critical Rules | 120-180s | ๐ฅ HIGH |
| 4 | Stop on Failure | Critical Rules | 60-90s | โก CRITICAL |
| 5 | Simple Task (No Delegation) | Delegation | 30-60s | ๐ฅ HIGH |
| 6 | Subagent Delegation | Integration | 90-120s | ๐ฅ HIGH |
| 7 | Tool Usage | Tool Usage | 30-60s | ๐ MEDIUM |
Total Runtime: 5-8 minutes
Coverage: ~85% of critical functionality
Use Core Suite (7 tests, 5-8 min):
Use Full Suite (71 tests, 40-80 min):
See: ../CORE_TESTS.md for detailed documentation
tests/
โโโ 01-critical-rules/ # MUST PASS - Core safety requirements
โ โโโ approval-gate/ # 3 tests - Approval before execution
โ โโโ context-loading/ # 11 tests - Load context before execution
โ โโโ stop-on-failure/ # 1 test - Stop on errors, don't auto-fix
โ โโโ report-first/ # 0 tests - TODO: Add error reporting workflow
โ โโโ confirm-cleanup/ # 0 tests - TODO: Add cleanup confirmation
โ
โโโ 02-workflow-stages/ # Workflow stage validation
โ โโโ analyze/ # 0 tests - TODO
โ โโโ approve/ # 0 tests - TODO
โ โโโ execute/ # 2 tests - Task execution
โ โโโ validate/ # 0 tests - TODO
โ โโโ summarize/ # 0 tests - TODO
โ โโโ confirm/ # 0 tests - TODO
โ
โโโ 03-delegation/ # Delegation scenarios
โ โโโ scale/ # 0 tests - TODO: 4+ files delegation
โ โโโ expertise/ # 0 tests - TODO: Specialized knowledge
โ โโโ complexity/ # 0 tests - TODO: Multi-step dependencies
โ โโโ review/ # 0 tests - TODO: Multi-component review
โ โโโ context-bundles/ # 0 tests - TODO: Bundle creation/passing
โ
โโโ 04-execution-paths/ # Conversational vs Task paths
โ โโโ conversational/ # 0 tests - (covered in approval-gate)
โ โโโ task/ # 2 tests - Task execution path
โ โโโ hybrid/ # 0 tests - TODO
โ
โโโ 05-edge-cases/ # Edge cases and boundaries
โ โโโ tier-conflicts/ # 0 tests - TODO: Tier 1 vs 2/3 conflicts
โ โโโ boundary/ # 0 tests - TODO: Boundary conditions
โ โโโ overrides/ # 1 test - "Just do it" override
โ โโโ negative/ # 0 tests - TODO: Negative tests
โ
โโโ 06-integration/ # Complex multi-turn scenarios
โโโ simple/ # 0 tests - TODO: 1-2 turns
โโโ medium/ # 2 tests - 3-5 turns
โโโ complex/ # 0 tests - TODO: 6+ turns
Priority: HIGHEST
Timeout: 60-120s
Must Pass: YES
Core safety requirements from OpenAgent prompt:
Run: npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"
Priority: HIGH
Timeout: 60-180s
Must Pass: SHOULD
Validates workflow stage progression:
Run: npm run eval:sdk -- --agent=openagent --pattern="02-workflow-stages/**/*.yaml"
Priority: MEDIUM
Timeout: 90-180s
Must Pass: SHOULD
Delegation scenarios (4+ files, specialized knowledge, etc.)
Run: npm run eval:sdk -- --agent=openagent --pattern="03-delegation/**/*.yaml"
Priority: MEDIUM
Timeout: 30-90s
Must Pass: SHOULD
Conversational vs Task execution paths.
Run: npm run eval:sdk -- --agent=openagent --pattern="04-execution-paths/**/*.yaml"
Priority: MEDIUM
Timeout: 60-120s
Must Pass: SHOULD
Edge cases, boundaries, overrides, negative tests.
Run: npm run eval:sdk -- --agent=openagent --pattern="05-edge-cases/**/*.yaml"
Priority: LOW
Timeout: 120-300s
Must Pass: NICE TO HAVE
Complex multi-turn scenarios testing multiple features together.
Run: npm run eval:sdk -- --agent=openagent --pattern="06-integration/**/*.yaml"
Tests run in priority order:
Critical Rules: 50% (2/4 tested)
Context Loading: 100% (5/5 task types)
Delegation Rules: 0% (0/7 tested)
Workflow Stages: 17% (1/6 tested)
01-critical-rules/report-first/01-error-report-workflow.yaml01-critical-rules/report-first/02-auto-fix-negative.yaml01-critical-rules/confirm-cleanup/01-session-cleanup.yaml01-critical-rules/confirm-cleanup/02-temp-files-cleanup.yaml03-delegation/scale/01-exactly-4-files.yaml03-delegation/scale/02-3-files-negative.yaml03-delegation/expertise/01-security-audit.yaml03-delegation/context-bundles/01-bundle-creation.yaml02-workflow-stages/validate/01-quality-check.yaml02-workflow-stages/validate/02-additional-checks-prompt.yaml02-workflow-stages/summarize/01-format-validation.yaml05-edge-cases/boundary/01-bash-ls-approval.yaml05-edge-cases/tier-conflicts/01-context-override-negative.yaml05-edge-cases/negative/01-skip-context-negative.yamlAll tests MUST use safe paths:
# โ
CORRECT - Test files
prompt: |
Create a file at evals/test_tmp/test-output.txt
# โ
CORRECT - Agent creates these automatically
.tmp/sessions/{session-id}/
.tmp/context/{session-id}/bundle.md
# โ WRONG - Don't use these
/tmp/
~/
/Users/
| Category | Simple | Multi-turn | Complex |
|---|---|---|---|
| Critical Rules | 60s | 120s | - |
| Workflow Stages | 60s | 120s | 180s |
| Delegation | 90s | 120s | 180s |
| Execution Paths | 30s | 60s | 90s |
| Edge Cases | 60s | 120s | - |
| Integration | 120s | 180s | 300s |
โ Migration Complete (Nov 26, 2024)
Next Steps:
To remove old folders (after verification):
cd evals/agents/openagent/tests
rm -rf business/ context-loading/ developer/ edge-case/
# Run critical tests only (fast)
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"
# Run critical + workflow tests
npm run eval:sdk -- --agent=openagent --pattern="0[1-2]-*/**/*.yaml"
# Run full suite
npm run eval:sdk -- --agent=openagent
Run with --debug flag:
npm run eval:sdk -- --agent=openagent --pattern="path/to/test.yaml" --debug
Check session files (preserved in debug mode):
ls ~/.local/share/opencode/storage/session/
Review event timeline in test output
Check test_tmp/ for created files:
ls -la evals/test_tmp/
{sequence}-{description}-{type}.yamlid: category-description-001
name: Human Readable Test Name
description: |
What this test validates and why it matters.
Expected behavior:
- Step 1
- Step 2
category: category-name
agent: openagent
model: anthropic/claude-sonnet-4-5
prompt: |
Test prompt here
behavior:
mustUseTools: [read, write]
requiresApproval: true
requiresContext: true
minToolCalls: 2
expectedViolations:
- rule: approval-gate
shouldViolate: false
severity: error
description: Must ask approval before writing
approvalStrategy:
type: auto-approve
timeout: 60000
tags:
- tag1
- tag2
.opencode/agent/openagent.mdevals/framework/evals/HOW_TESTS_WORK.mdevals/agents/openagent/docs/OPENAGENT_RULES.mdFOLDER_STRUCTURE.md (this directory)