FOLDER_STRUCTURE.md 8.2 KB

OpenAgent Test Folder Structure

Design Principles

  1. Organized by Priority & Complexity - Critical rules first, then by test complexity
  2. Manageable Execution - Complex tests isolated with appropriate timeouts
  3. Safe File Creation - All file operations use evals/test_tmp/ or .tmp/
  4. Scalable - Easy to add new tests in the right category
  5. Clear Naming - Folder names indicate purpose and execution characteristics

Folder Structure

evals/agents/openagent/tests/
├── 01-critical-rules/          # Tier 1: Critical rules (MUST pass)
│   ├── approval-gate/          # @approval_gate rule tests
│   ├── context-loading/        # @critical_context_requirement tests
│   ├── stop-on-failure/        # @stop_on_failure rule tests
│   ├── report-first/           # @report_first rule tests
│   └── confirm-cleanup/        # @confirm_cleanup rule tests
│
├── 02-workflow-stages/         # Tier 2: Workflow validation
│   ├── analyze/                # Stage 1: Analyze
│   ├── approve/                # Stage 2: Approve
│   ├── execute/                # Stage 3: Execute (routing, context loading)
│   ├── validate/               # Stage 4: Validate
│   ├── summarize/              # Stage 5: Summarize
│   └── confirm/                # Stage 6: Confirm
│
├── 03-delegation/              # Delegation scenarios
│   ├── scale/                  # 4+ files delegation
│   ├── expertise/              # Specialized knowledge delegation
│   ├── complexity/             # Multi-step dependencies
│   ├── review/                 # Multi-component review
│   └── context-bundles/        # Context bundle creation/passing
│
├── 04-execution-paths/         # Conversational vs Task paths
│   ├── conversational/         # Pure questions (no approval)
│   ├── task/                   # Execution tasks (requires approval)
│   └── hybrid/                 # Mixed scenarios
│
├── 05-edge-cases/              # Edge cases and boundary conditions
│   ├── tier-conflicts/         # Tier 1 vs Tier 2/3 priority conflicts
│   ├── boundary/               # Boundary conditions (exactly 4 files, etc.)
│   ├── overrides/              # "Just do it" and other overrides
│   └── negative/               # Negative tests (what should NOT happen)
│
└── 06-integration/             # Complex multi-turn scenarios
    ├── simple/                 # 1-2 turns, single context
    ├── medium/                 # 3-5 turns, multiple contexts
    └── complex/                # 6+ turns, delegation + validation

Timeout Guidelines by Category

Critical Rules (01-critical-rules/)

  • Simple tests: 60s (60000ms)
  • Multi-turn tests: 120s (120000ms)
  • Rationale: Core functionality, should be fast

Workflow Stages (02-workflow-stages/)

  • Simple tests: 60s
  • Multi-turn tests: 120s
  • Complex validation: 180s (180000ms)

Delegation (03-delegation/)

  • Simple delegation: 90s (90000ms)
  • With context bundles: 120s
  • Complex multi-agent: 180s
  • Rationale: Delegation involves subagent coordination

Execution Paths (04-execution-paths/)

  • Conversational: 30s (30000ms)
  • Task execution: 60s
  • Hybrid: 90s

Edge Cases (05-edge-cases/)

  • Simple edge cases: 60s
  • Complex edge cases: 120s

Integration (06-integration/)

  • Simple (1-2 turns): 120s
  • Medium (3-5 turns): 180s
  • Complex (6+ turns): 300s (5 minutes)
  • Rationale: Multi-turn scenarios need time for user interaction simulation

File Creation Rules

All tests MUST use these paths for file creation:

Temporary Test Files

# ✅ CORRECT
prompt: |
  Create a file at evals/test_tmp/test-output.txt

# ❌ WRONG
prompt: |
  Create a file at /tmp/test-output.txt

Session/Context Files

# ✅ CORRECT - Agent creates these automatically
# Tests verify creation at:
.tmp/sessions/{session-id}/
.tmp/context/{session-id}/bundle.md

# ❌ WRONG - Don't hardcode paths

Cleanup

  • evals/test_tmp/ is cleaned before/after test runs
  • .tmp/ is managed by the agent (tests verify, don't create)
  • Session files deleted after tests (unless --debug flag)

Test Naming Convention

{sequence}-{description}-{type}.yaml

Examples:
01-approval-before-bash-positive.yaml
02-approval-missing-negative.yaml
03-just-do-it-override.yaml

Sequence: 01, 02, 03... (execution order within folder) Description: Brief description (kebab-case) Type:

  • positive - Expected to pass
  • negative - Expected to catch violations
  • boundary - Boundary condition test
  • override - Tests override behavior

Migration Plan

Phase 1: Move Existing Tests (Immediate)

# Current structure → New structure
business/conv-simple-001.yaml → 04-execution-paths/conversational/01-simple-question.yaml
edge-case/no-approval-negative.yaml → 01-critical-rules/approval-gate/02-skip-approval-detection.yaml
edge-case/missing-approval-negative.yaml → 01-critical-rules/approval-gate/03-missing-approval-negative.yaml
edge-case/just-do-it.yaml → 05-edge-cases/overrides/01-just-do-it.yaml
developer/fail-stop-001.yaml → 01-critical-rules/stop-on-failure/01-test-failure-stop.yaml
developer/ctx-code-001.yaml → 01-critical-rules/context-loading/01-code-task.yaml
developer/ctx-docs-001.yaml → 01-critical-rules/context-loading/02-docs-task.yaml
developer/ctx-tests-001.yaml → 01-critical-rules/context-loading/03-tests-task.yaml
developer/ctx-delegation-001.yaml → 01-critical-rules/context-loading/04-delegation-task.yaml
developer/ctx-review-001.yaml → 01-critical-rules/context-loading/05-review-task.yaml
context-loading/* → 01-critical-rules/context-loading/

Phase 2: Add Missing Critical Tests (High Priority)

01-critical-rules/report-first/01-error-report-workflow.yaml
01-critical-rules/report-first/02-auto-fix-negative.yaml
01-critical-rules/confirm-cleanup/01-session-cleanup.yaml
01-critical-rules/confirm-cleanup/02-temp-files-cleanup.yaml

Phase 3: Add Delegation Tests (Medium Priority)

03-delegation/scale/01-exactly-4-files.yaml
03-delegation/scale/02-3-files-negative.yaml
03-delegation/expertise/01-security-audit.yaml
03-delegation/context-bundles/01-bundle-creation.yaml

Phase 4: Add Workflow & Integration Tests (Lower Priority)

02-workflow-stages/validate/01-quality-check.yaml
02-workflow-stages/validate/02-additional-checks-prompt.yaml
06-integration/complex/01-multi-turn-delegation.yaml

Running Tests by Category

# Run all critical rule tests (fast, must pass)
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"

# Run specific critical rule category
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/approval-gate/*.yaml"

# Run delegation tests (slower)
npm run eval:sdk -- --agent=openagent --pattern="03-delegation/**/*.yaml"

# Run integration tests (slowest, run last)
npm run eval:sdk -- --agent=openagent --pattern="06-integration/**/*.yaml"

# Run all tests in order (CI/CD)
npm run eval:sdk -- --agent=openagent

Test Execution Order

When running all tests, they execute in this order:

  1. 01-critical-rules/ - Fast, foundational (5-10 min)
  2. 02-workflow-stages/ - Medium speed (5-10 min)
  3. 04-execution-paths/ - Fast (2-5 min)
  4. 05-edge-cases/ - Medium speed (5-10 min)
  5. 03-delegation/ - Slower, involves subagents (10-15 min)
  6. 06-integration/ - Slowest, complex scenarios (15-30 min)

Total estimated time: 40-80 minutes for full suite

Benefits of This Structure

  1. Priority-based - Critical tests run first, fail fast
  2. Isolated complexity - Complex tests don't slow down simple tests
  3. Easy navigation - Clear folder names indicate purpose
  4. Scalable - Easy to add new tests in right category
  5. CI/CD friendly - Can run subsets based on priority
  6. Debugging - Easy to isolate and debug specific categories
  7. Documentation - Structure itself documents test organization

Next Steps

  1. Create folder structure
  2. Migrate existing tests
  3. Add missing critical tests
  4. Update CI/CD to run by priority
  5. Document test patterns in each category