OpenAgent Test Folder Structure

Design Principles

Organized by Priority & Complexity - Critical rules first, then by test complexity
Manageable Execution - Complex tests isolated with appropriate timeouts
Safe File Creation - All file operations use evals/test_tmp/ or .tmp/
Scalable - Easy to add new tests in the right category
Clear Naming - Folder names indicate purpose and execution characteristics

Folder Structure

evals/agents/openagent/tests/
├── 01-critical-rules/          # Tier 1: Critical rules (MUST pass)
│   ├── approval-gate/          # @approval_gate rule tests
│   ├── context-loading/        # @critical_context_requirement tests
│   ├── stop-on-failure/        # @stop_on_failure rule tests
│   ├── report-first/           # @report_first rule tests
│   └── confirm-cleanup/        # @confirm_cleanup rule tests
│
├── 02-workflow-stages/         # Tier 2: Workflow validation
│   ├── analyze/                # Stage 1: Analyze
│   ├── approve/                # Stage 2: Approve
│   ├── execute/                # Stage 3: Execute (routing, context loading)
│   ├── validate/               # Stage 4: Validate
│   ├── summarize/              # Stage 5: Summarize
│   └── confirm/                # Stage 6: Confirm
│
├── 03-delegation/              # Delegation scenarios
│   ├── scale/                  # 4+ files delegation
│   ├── expertise/              # Specialized knowledge delegation
│   ├── complexity/             # Multi-step dependencies
│   ├── review/                 # Multi-component review
│   └── context-bundles/        # Context bundle creation/passing
│
├── 04-execution-paths/         # Conversational vs Task paths
│   ├── conversational/         # Pure questions (no approval)
│   ├── task/                   # Execution tasks (requires approval)
│   └── hybrid/                 # Mixed scenarios
│
├── 05-edge-cases/              # Edge cases and boundary conditions
│   ├── tier-conflicts/         # Tier 1 vs Tier 2/3 priority conflicts
│   ├── boundary/               # Boundary conditions (exactly 4 files, etc.)
│   ├── overrides/              # "Just do it" and other overrides
│   └── negative/               # Negative tests (what should NOT happen)
│
└── 06-integration/             # Complex multi-turn scenarios
    ├── simple/                 # 1-2 turns, single context
    ├── medium/                 # 3-5 turns, multiple contexts
    └── complex/                # 6+ turns, delegation + validation

Timeout Guidelines by Category

Critical Rules (01-critical-rules/)

Simple tests: 60s (60000ms)
Multi-turn tests: 120s (120000ms)
Rationale: Core functionality, should be fast

Workflow Stages (02-workflow-stages/)

Simple tests: 60s
Multi-turn tests: 120s
Complex validation: 180s (180000ms)

Delegation (03-delegation/)

Simple delegation: 90s (90000ms)
With context bundles: 120s
Complex multi-agent: 180s
Rationale: Delegation involves subagent coordination

Execution Paths (04-execution-paths/)

Conversational: 30s (30000ms)
Task execution: 60s
Hybrid: 90s

Edge Cases (05-edge-cases/)

Simple edge cases: 60s
Complex edge cases: 120s

Integration (06-integration/)

Simple (1-2 turns): 120s
Medium (3-5 turns): 180s
Complex (6+ turns): 300s (5 minutes)
Rationale: Multi-turn scenarios need time for user interaction simulation

File Creation Rules

All tests MUST use these paths for file creation:

Temporary Test Files

# ✅ CORRECT
prompt: |
  Create a file at evals/test_tmp/test-output.txt

# ❌ WRONG
prompt: |
  Create a file at /tmp/test-output.txt

Session/Context Files

# ✅ CORRECT - Agent creates these automatically
# Tests verify creation at:
.tmp/sessions/{session-id}/
.tmp/context/{session-id}/bundle.md

# ❌ WRONG - Don't hardcode paths

Cleanup

evals/test_tmp/ is cleaned before/after test runs
.tmp/ is managed by the agent (tests verify, don't create)
Session files deleted after tests (unless --debug flag)

Test Naming Convention

{sequence}-{description}-{type}.yaml

Examples:
01-approval-before-bash-positive.yaml
02-approval-missing-negative.yaml
03-just-do-it-override.yaml

Sequence: 01, 02, 03... (execution order within folder) Description: Brief description (kebab-case) Type:

positive - Expected to pass
negative - Expected to catch violations
boundary - Boundary condition test
override - Tests override behavior

Migration Plan

Phase 1: Move Existing Tests (Immediate)

# Current structure → New structure
business/conv-simple-001.yaml → 04-execution-paths/conversational/01-simple-question.yaml
edge-case/no-approval-negative.yaml → 01-critical-rules/approval-gate/02-skip-approval-detection.yaml
edge-case/missing-approval-negative.yaml → 01-critical-rules/approval-gate/03-missing-approval-negative.yaml
edge-case/just-do-it.yaml → 05-edge-cases/overrides/01-just-do-it.yaml
developer/fail-stop-001.yaml → 01-critical-rules/stop-on-failure/01-test-failure-stop.yaml
developer/ctx-code-001.yaml → 01-critical-rules/context-loading/01-code-task.yaml
developer/ctx-docs-001.yaml → 01-critical-rules/context-loading/02-docs-task.yaml
developer/ctx-tests-001.yaml → 01-critical-rules/context-loading/03-tests-task.yaml
developer/ctx-delegation-001.yaml → 01-critical-rules/context-loading/04-delegation-task.yaml
developer/ctx-review-001.yaml → 01-critical-rules/context-loading/05-review-task.yaml
context-loading/* → 01-critical-rules/context-loading/

Phase 2: Add Missing Critical Tests (High Priority)

01-critical-rules/report-first/01-error-report-workflow.yaml
01-critical-rules/report-first/02-auto-fix-negative.yaml
01-critical-rules/confirm-cleanup/01-session-cleanup.yaml
01-critical-rules/confirm-cleanup/02-temp-files-cleanup.yaml

Phase 3: Add Delegation Tests (Medium Priority)

03-delegation/scale/01-exactly-4-files.yaml
03-delegation/scale/02-3-files-negative.yaml
03-delegation/expertise/01-security-audit.yaml
03-delegation/context-bundles/01-bundle-creation.yaml

Phase 4: Add Workflow & Integration Tests (Lower Priority)

02-workflow-stages/validate/01-quality-check.yaml
02-workflow-stages/validate/02-additional-checks-prompt.yaml
06-integration/complex/01-multi-turn-delegation.yaml

Running Tests by Category

# Run all critical rule tests (fast, must pass)
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"

# Run specific critical rule category
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/approval-gate/*.yaml"

# Run delegation tests (slower)
npm run eval:sdk -- --agent=openagent --pattern="03-delegation/**/*.yaml"

# Run integration tests (slowest, run last)
npm run eval:sdk -- --agent=openagent --pattern="06-integration/**/*.yaml"

# Run all tests in order (CI/CD)
npm run eval:sdk -- --agent=openagent

Test Execution Order

When running all tests, they execute in this order:

01-critical-rules/ - Fast, foundational (5-10 min)
02-workflow-stages/ - Medium speed (5-10 min)
04-execution-paths/ - Fast (2-5 min)
05-edge-cases/ - Medium speed (5-10 min)
03-delegation/ - Slower, involves subagents (10-15 min)
06-integration/ - Slowest, complex scenarios (15-30 min)

Total estimated time: 40-80 minutes for full suite

Benefits of This Structure

Priority-based - Critical tests run first, fail fast
Isolated complexity - Complex tests don't slow down simple tests
Easy navigation - Clear folder names indicate purpose
Scalable - Easy to add new tests in right category
CI/CD friendly - Can run subsets based on priority
Debugging - Easy to isolate and debug specific categories
Documentation - Structure itself documents test organization

Next Steps

Create folder structure
Migrate existing tests
Add missing critical tests
Update CI/CD to run by priority
Document test patterns in each category

FOLDER_STRUCTURE.md 8.2 KB History Raw