# OpenAgent Test Folder Structure ## Design Principles 1. **Organized by Priority & Complexity** - Critical rules first, then by test complexity 2. **Manageable Execution** - Complex tests isolated with appropriate timeouts 3. **Safe File Creation** - All file operations use `evals/test_tmp/` or `.tmp/` 4. **Scalable** - Easy to add new tests in the right category 5. **Clear Naming** - Folder names indicate purpose and execution characteristics ## Folder Structure ``` evals/agents/openagent/tests/ ├── 01-critical-rules/ # Tier 1: Critical rules (MUST pass) │ ├── approval-gate/ # @approval_gate rule tests │ ├── context-loading/ # @critical_context_requirement tests │ ├── stop-on-failure/ # @stop_on_failure rule tests │ ├── report-first/ # @report_first rule tests │ └── confirm-cleanup/ # @confirm_cleanup rule tests │ ├── 02-workflow-stages/ # Tier 2: Workflow validation │ ├── analyze/ # Stage 1: Analyze │ ├── approve/ # Stage 2: Approve │ ├── execute/ # Stage 3: Execute (routing, context loading) │ ├── validate/ # Stage 4: Validate │ ├── summarize/ # Stage 5: Summarize │ └── confirm/ # Stage 6: Confirm │ ├── 03-delegation/ # Delegation scenarios │ ├── scale/ # 4+ files delegation │ ├── expertise/ # Specialized knowledge delegation │ ├── complexity/ # Multi-step dependencies │ ├── review/ # Multi-component review │ └── context-bundles/ # Context bundle creation/passing │ ├── 04-execution-paths/ # Conversational vs Task paths │ ├── conversational/ # Pure questions (no approval) │ ├── task/ # Execution tasks (requires approval) │ └── hybrid/ # Mixed scenarios │ ├── 05-edge-cases/ # Edge cases and boundary conditions │ ├── tier-conflicts/ # Tier 1 vs Tier 2/3 priority conflicts │ ├── boundary/ # Boundary conditions (exactly 4 files, etc.) │ ├── overrides/ # "Just do it" and other overrides │ └── negative/ # Negative tests (what should NOT happen) │ └── 06-integration/ # Complex multi-turn scenarios ├── simple/ # 1-2 turns, single context ├── medium/ # 3-5 turns, multiple contexts └── complex/ # 6+ turns, delegation + validation ``` ## Timeout Guidelines by Category ### Critical Rules (01-critical-rules/) - **Simple tests**: 60s (60000ms) - **Multi-turn tests**: 120s (120000ms) - **Rationale**: Core functionality, should be fast ### Workflow Stages (02-workflow-stages/) - **Simple tests**: 60s - **Multi-turn tests**: 120s - **Complex validation**: 180s (180000ms) ### Delegation (03-delegation/) - **Simple delegation**: 90s (90000ms) - **With context bundles**: 120s - **Complex multi-agent**: 180s - **Rationale**: Delegation involves subagent coordination ### Execution Paths (04-execution-paths/) - **Conversational**: 30s (30000ms) - **Task execution**: 60s - **Hybrid**: 90s ### Edge Cases (05-edge-cases/) - **Simple edge cases**: 60s - **Complex edge cases**: 120s ### Integration (06-integration/) - **Simple (1-2 turns)**: 120s - **Medium (3-5 turns)**: 180s - **Complex (6+ turns)**: 300s (5 minutes) - **Rationale**: Multi-turn scenarios need time for user interaction simulation ## File Creation Rules All tests MUST use these paths for file creation: ### Temporary Test Files ```yaml # ✅ CORRECT prompt: | Create a file at evals/test_tmp/test-output.txt # ❌ WRONG prompt: | Create a file at /tmp/test-output.txt ``` ### Session/Context Files ```yaml # ✅ CORRECT - Agent creates these automatically # Tests verify creation at: .tmp/sessions/{session-id}/ .tmp/context/{session-id}/bundle.md # ❌ WRONG - Don't hardcode paths ``` ### Cleanup - `evals/test_tmp/` is cleaned before/after test runs - `.tmp/` is managed by the agent (tests verify, don't create) - Session files deleted after tests (unless --debug flag) ## Test Naming Convention ``` {sequence}-{description}-{type}.yaml Examples: 01-approval-before-bash-positive.yaml 02-approval-missing-negative.yaml 03-just-do-it-override.yaml ``` **Sequence**: 01, 02, 03... (execution order within folder) **Description**: Brief description (kebab-case) **Type**: - `positive` - Expected to pass - `negative` - Expected to catch violations - `boundary` - Boundary condition test - `override` - Tests override behavior ## Migration Plan ### Phase 1: Move Existing Tests (Immediate) ```bash # Current structure → New structure business/conv-simple-001.yaml → 04-execution-paths/conversational/01-simple-question.yaml edge-case/no-approval-negative.yaml → 01-critical-rules/approval-gate/02-skip-approval-detection.yaml edge-case/missing-approval-negative.yaml → 01-critical-rules/approval-gate/03-missing-approval-negative.yaml edge-case/just-do-it.yaml → 05-edge-cases/overrides/01-just-do-it.yaml developer/fail-stop-001.yaml → 01-critical-rules/stop-on-failure/01-test-failure-stop.yaml developer/ctx-code-001.yaml → 01-critical-rules/context-loading/01-code-task.yaml developer/ctx-docs-001.yaml → 01-critical-rules/context-loading/02-docs-task.yaml developer/ctx-tests-001.yaml → 01-critical-rules/context-loading/03-tests-task.yaml developer/ctx-delegation-001.yaml → 01-critical-rules/context-loading/04-delegation-task.yaml developer/ctx-review-001.yaml → 01-critical-rules/context-loading/05-review-task.yaml context-loading/* → 01-critical-rules/context-loading/ ``` ### Phase 2: Add Missing Critical Tests (High Priority) ``` 01-critical-rules/report-first/01-error-report-workflow.yaml 01-critical-rules/report-first/02-auto-fix-negative.yaml 01-critical-rules/confirm-cleanup/01-session-cleanup.yaml 01-critical-rules/confirm-cleanup/02-temp-files-cleanup.yaml ``` ### Phase 3: Add Delegation Tests (Medium Priority) ``` 03-delegation/scale/01-exactly-4-files.yaml 03-delegation/scale/02-3-files-negative.yaml 03-delegation/expertise/01-security-audit.yaml 03-delegation/context-bundles/01-bundle-creation.yaml ``` ### Phase 4: Add Workflow & Integration Tests (Lower Priority) ``` 02-workflow-stages/validate/01-quality-check.yaml 02-workflow-stages/validate/02-additional-checks-prompt.yaml 06-integration/complex/01-multi-turn-delegation.yaml ``` ## Running Tests by Category ```bash # Run all critical rule tests (fast, must pass) npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml" # Run specific critical rule category npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/approval-gate/*.yaml" # Run delegation tests (slower) npm run eval:sdk -- --agent=openagent --pattern="03-delegation/**/*.yaml" # Run integration tests (slowest, run last) npm run eval:sdk -- --agent=openagent --pattern="06-integration/**/*.yaml" # Run all tests in order (CI/CD) npm run eval:sdk -- --agent=openagent ``` ## Test Execution Order When running all tests, they execute in this order: 1. **01-critical-rules/** - Fast, foundational (5-10 min) 2. **02-workflow-stages/** - Medium speed (5-10 min) 3. **04-execution-paths/** - Fast (2-5 min) 4. **05-edge-cases/** - Medium speed (5-10 min) 5. **03-delegation/** - Slower, involves subagents (10-15 min) 6. **06-integration/** - Slowest, complex scenarios (15-30 min) **Total estimated time**: 40-80 minutes for full suite ## Benefits of This Structure 1. **Priority-based** - Critical tests run first, fail fast 2. **Isolated complexity** - Complex tests don't slow down simple tests 3. **Easy navigation** - Clear folder names indicate purpose 4. **Scalable** - Easy to add new tests in right category 5. **CI/CD friendly** - Can run subsets based on priority 6. **Debugging** - Easy to isolate and debug specific categories 7. **Documentation** - Structure itself documents test organization ## Next Steps 1. Create folder structure 2. Migrate existing tests 3. Add missing critical tests 4. Update CI/CD to run by priority 5. Document test patterns in each category