evals/test_tmp/ or .tmp/evals/agents/openagent/tests/
├── 01-critical-rules/ # Tier 1: Critical rules (MUST pass)
│ ├── approval-gate/ # @approval_gate rule tests
│ ├── context-loading/ # @critical_context_requirement tests
│ ├── stop-on-failure/ # @stop_on_failure rule tests
│ ├── report-first/ # @report_first rule tests
│ └── confirm-cleanup/ # @confirm_cleanup rule tests
│
├── 02-workflow-stages/ # Tier 2: Workflow validation
│ ├── analyze/ # Stage 1: Analyze
│ ├── approve/ # Stage 2: Approve
│ ├── execute/ # Stage 3: Execute (routing, context loading)
│ ├── validate/ # Stage 4: Validate
│ ├── summarize/ # Stage 5: Summarize
│ └── confirm/ # Stage 6: Confirm
│
├── 03-delegation/ # Delegation scenarios
│ ├── scale/ # 4+ files delegation
│ ├── expertise/ # Specialized knowledge delegation
│ ├── complexity/ # Multi-step dependencies
│ ├── review/ # Multi-component review
│ └── context-bundles/ # Context bundle creation/passing
│
├── 04-execution-paths/ # Conversational vs Task paths
│ ├── conversational/ # Pure questions (no approval)
│ ├── task/ # Execution tasks (requires approval)
│ └── hybrid/ # Mixed scenarios
│
├── 05-edge-cases/ # Edge cases and boundary conditions
│ ├── tier-conflicts/ # Tier 1 vs Tier 2/3 priority conflicts
│ ├── boundary/ # Boundary conditions (exactly 4 files, etc.)
│ ├── overrides/ # "Just do it" and other overrides
│ └── negative/ # Negative tests (what should NOT happen)
│
└── 06-integration/ # Complex multi-turn scenarios
├── simple/ # 1-2 turns, single context
├── medium/ # 3-5 turns, multiple contexts
└── complex/ # 6+ turns, delegation + validation
All tests MUST use these paths for file creation:
# ✅ CORRECT
prompt: |
Create a file at evals/test_tmp/test-output.txt
# ❌ WRONG
prompt: |
Create a file at /tmp/test-output.txt
# ✅ CORRECT - Agent creates these automatically
# Tests verify creation at:
.tmp/sessions/{session-id}/
.tmp/context/{session-id}/bundle.md
# ❌ WRONG - Don't hardcode paths
evals/test_tmp/ is cleaned before/after test runs.tmp/ is managed by the agent (tests verify, don't create){sequence}-{description}-{type}.yaml
Examples:
01-approval-before-bash-positive.yaml
02-approval-missing-negative.yaml
03-just-do-it-override.yaml
Sequence: 01, 02, 03... (execution order within folder) Description: Brief description (kebab-case) Type:
positive - Expected to passnegative - Expected to catch violationsboundary - Boundary condition testoverride - Tests override behavior# Current structure → New structure
business/conv-simple-001.yaml → 04-execution-paths/conversational/01-simple-question.yaml
edge-case/no-approval-negative.yaml → 01-critical-rules/approval-gate/02-skip-approval-detection.yaml
edge-case/missing-approval-negative.yaml → 01-critical-rules/approval-gate/03-missing-approval-negative.yaml
edge-case/just-do-it.yaml → 05-edge-cases/overrides/01-just-do-it.yaml
developer/fail-stop-001.yaml → 01-critical-rules/stop-on-failure/01-test-failure-stop.yaml
developer/ctx-code-001.yaml → 01-critical-rules/context-loading/01-code-task.yaml
developer/ctx-docs-001.yaml → 01-critical-rules/context-loading/02-docs-task.yaml
developer/ctx-tests-001.yaml → 01-critical-rules/context-loading/03-tests-task.yaml
developer/ctx-delegation-001.yaml → 01-critical-rules/context-loading/04-delegation-task.yaml
developer/ctx-review-001.yaml → 01-critical-rules/context-loading/05-review-task.yaml
context-loading/* → 01-critical-rules/context-loading/
01-critical-rules/report-first/01-error-report-workflow.yaml
01-critical-rules/report-first/02-auto-fix-negative.yaml
01-critical-rules/confirm-cleanup/01-session-cleanup.yaml
01-critical-rules/confirm-cleanup/02-temp-files-cleanup.yaml
03-delegation/scale/01-exactly-4-files.yaml
03-delegation/scale/02-3-files-negative.yaml
03-delegation/expertise/01-security-audit.yaml
03-delegation/context-bundles/01-bundle-creation.yaml
02-workflow-stages/validate/01-quality-check.yaml
02-workflow-stages/validate/02-additional-checks-prompt.yaml
06-integration/complex/01-multi-turn-delegation.yaml
# Run all critical rule tests (fast, must pass)
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"
# Run specific critical rule category
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/approval-gate/*.yaml"
# Run delegation tests (slower)
npm run eval:sdk -- --agent=openagent --pattern="03-delegation/**/*.yaml"
# Run integration tests (slowest, run last)
npm run eval:sdk -- --agent=openagent --pattern="06-integration/**/*.yaml"
# Run all tests in order (CI/CD)
npm run eval:sdk -- --agent=openagent
When running all tests, they execute in this order:
Total estimated time: 40-80 minutes for full suite