README.md 11 KB

OpenAgent Test Suite

Total Tests: 22 (migrated) + new tests to be added
Estimated Full Suite Runtime: 40-80 minutes
Last Updated: Nov 26, 2024

Quick Start


# Run core tests (RECOMMENDED - 7 tests, ~5-8 min)
npm run test:core

# Run all tests (full suite - 71 tests, ~40-80 min)
npm run test:openagent

# Run critical tests only

npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"

# Run specific category
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/approval-gate/*.yaml"

# Debug mode (keeps sessions, verbose output)
npm run eval:sdk -- --agent=openagent --debug

Core Test Suite โšก

NEW: We now have a core test suite with 7 carefully selected tests that provide ~85% coverage in just 5-8 minutes!

Quick Commands

# NPM (from root)
npm run test:core

# Script
./scripts/test.sh openagent --core

# Direct
cd evals/framework && npm run eval:sdk:core -- --agent=openagent

What's Included?

# Test Category Time Priority
1 Approval Gate Critical Rules 30-60s โšก CRITICAL
2 Context Loading (Simple) Critical Rules 60-90s โšก CRITICAL
3 Context Loading (Multi-Turn) Critical Rules 120-180s ๐Ÿ”ฅ HIGH
4 Stop on Failure Critical Rules 60-90s โšก CRITICAL
5 Simple Task (No Delegation) Delegation 30-60s ๐Ÿ”ฅ HIGH
6 Subagent Delegation Integration 90-120s ๐Ÿ”ฅ HIGH
7 Tool Usage Tool Usage 30-60s ๐Ÿ“‹ MEDIUM

Total Runtime: 5-8 minutes
Coverage: ~85% of critical functionality

When to Use Core vs Full?

Use Core Suite (7 tests, 5-8 min):

  • โœ… Prompt iteration and testing
  • โœ… Development and quick validation
  • โœ… Pre-commit hooks
  • โœ… PR validation in CI/CD

Use Full Suite (71 tests, 40-80 min):

  • ๐Ÿ”ฌ Release validation
  • ๐Ÿ”ฌ Comprehensive testing
  • ๐Ÿ”ฌ Edge case coverage
  • ๐Ÿ”ฌ Regression testing

See: ../CORE_TESTS.md for detailed documentation


Folder Structure

tests/
โ”œโ”€โ”€ 01-critical-rules/          # MUST PASS - Core safety requirements
โ”‚   โ”œโ”€โ”€ approval-gate/          # 3 tests - Approval before execution
โ”‚   โ”œโ”€โ”€ context-loading/        # 11 tests - Load context before execution
โ”‚   โ”œโ”€โ”€ stop-on-failure/        # 1 test - Stop on errors, don't auto-fix
โ”‚   โ”œโ”€โ”€ report-first/           # 0 tests - TODO: Add error reporting workflow
โ”‚   โ””โ”€โ”€ confirm-cleanup/        # 0 tests - TODO: Add cleanup confirmation
โ”‚
โ”œโ”€โ”€ 02-workflow-stages/         # Workflow stage validation
โ”‚   โ”œโ”€โ”€ analyze/                # 0 tests - TODO
โ”‚   โ”œโ”€โ”€ approve/                # 0 tests - TODO
โ”‚   โ”œโ”€โ”€ execute/                # 2 tests - Task execution
โ”‚   โ”œโ”€โ”€ validate/               # 0 tests - TODO
โ”‚   โ”œโ”€โ”€ summarize/              # 0 tests - TODO
โ”‚   โ””โ”€โ”€ confirm/                # 0 tests - TODO
โ”‚
โ”œโ”€โ”€ 03-delegation/              # Delegation scenarios
โ”‚   โ”œโ”€โ”€ scale/                  # 0 tests - TODO: 4+ files delegation
โ”‚   โ”œโ”€โ”€ expertise/              # 0 tests - TODO: Specialized knowledge
โ”‚   โ”œโ”€โ”€ complexity/             # 0 tests - TODO: Multi-step dependencies
โ”‚   โ”œโ”€โ”€ review/                 # 0 tests - TODO: Multi-component review
โ”‚   โ””โ”€โ”€ context-bundles/        # 0 tests - TODO: Bundle creation/passing
โ”‚
โ”œโ”€โ”€ 04-execution-paths/         # Conversational vs Task paths
โ”‚   โ”œโ”€โ”€ conversational/         # 0 tests - (covered in approval-gate)
โ”‚   โ”œโ”€โ”€ task/                   # 2 tests - Task execution path
โ”‚   โ””โ”€โ”€ hybrid/                 # 0 tests - TODO
โ”‚
โ”œโ”€โ”€ 05-edge-cases/              # Edge cases and boundaries
โ”‚   โ”œโ”€โ”€ tier-conflicts/         # 0 tests - TODO: Tier 1 vs 2/3 conflicts
โ”‚   โ”œโ”€โ”€ boundary/               # 0 tests - TODO: Boundary conditions
โ”‚   โ”œโ”€โ”€ overrides/              # 1 test - "Just do it" override
โ”‚   โ””โ”€โ”€ negative/               # 0 tests - TODO: Negative tests
โ”‚
โ””โ”€โ”€ 06-integration/             # Complex multi-turn scenarios
    โ”œโ”€โ”€ simple/                 # 0 tests - TODO: 1-2 turns
    โ”œโ”€โ”€ medium/                 # 2 tests - 3-5 turns
    โ””โ”€โ”€ complex/                # 0 tests - TODO: 6+ turns

Test Categories

01-critical-rules/ (15 tests)

Priority: HIGHEST
Timeout: 60-120s
Must Pass: YES

Core safety requirements from OpenAgent prompt:

  • โœ… approval-gate (3 tests) - Request approval before execution
  • โœ… context-loading (11 tests) - Load context files before execution
  • โœ… stop-on-failure (1 test) - Stop on errors, don't auto-fix
  • โŒ report-first (0 tests) - Error reporting workflow
  • โŒ confirm-cleanup (0 tests) - Cleanup confirmation

Run: npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"

02-workflow-stages/ (2 tests)

Priority: HIGH
Timeout: 60-180s
Must Pass: SHOULD

Validates workflow stage progression:

  • Analyze โ†’ Approve โ†’ Execute โ†’ Validate โ†’ Summarize โ†’ Confirm

Run: npm run eval:sdk -- --agent=openagent --pattern="02-workflow-stages/**/*.yaml"

03-delegation/ (0 tests)

Priority: MEDIUM
Timeout: 90-180s
Must Pass: SHOULD

Delegation scenarios (4+ files, specialized knowledge, etc.)

Run: npm run eval:sdk -- --agent=openagent --pattern="03-delegation/**/*.yaml"

04-execution-paths/ (2 tests)

Priority: MEDIUM
Timeout: 30-90s
Must Pass: SHOULD

Conversational vs Task execution paths.

Run: npm run eval:sdk -- --agent=openagent --pattern="04-execution-paths/**/*.yaml"

05-edge-cases/ (1 test)

Priority: MEDIUM
Timeout: 60-120s
Must Pass: SHOULD

Edge cases, boundaries, overrides, negative tests.

Run: npm run eval:sdk -- --agent=openagent --pattern="05-edge-cases/**/*.yaml"

06-integration/ (2 tests)

Priority: LOW
Timeout: 120-300s
Must Pass: NICE TO HAVE

Complex multi-turn scenarios testing multiple features together.

Run: npm run eval:sdk -- --agent=openagent --pattern="06-integration/**/*.yaml"

Test Execution Order

Tests run in priority order:

  1. 01-critical-rules/ (5-10 min) - Fast, foundational
  2. 02-workflow-stages/ (5-10 min) - Medium speed
  3. 04-execution-paths/ (2-5 min) - Fast
  4. 05-edge-cases/ (5-10 min) - Medium speed
  5. 03-delegation/ (10-15 min) - Slower, involves subagents
  6. 06-integration/ (15-30 min) - Slowest, complex scenarios

Coverage Analysis

Current Coverage (22 tests)

Critical Rules: 50% (2/4 tested)

  • โœ… approval_gate (3 tests)
  • โš ๏ธ stop_on_failure (1 test - partial)
  • โŒ report_first (0 tests)
  • โŒ confirm_cleanup (0 tests)

Context Loading: 100% (5/5 task types)

  • โœ… code.md (2 tests)
  • โœ… docs.md (2 tests)
  • โœ… tests.md (2 tests)
  • โœ… delegation.md (1 test)
  • โœ… review.md (1 test)
  • โœ… Multi-context (3 tests)

Delegation Rules: 0% (0/7 tested)

  • โŒ 4+ files
  • โŒ specialized knowledge
  • โŒ multi-component review
  • โŒ complexity
  • โŒ fresh eyes
  • โŒ simulation
  • โŒ user request

Workflow Stages: 17% (1/6 tested)

  • โŒ Analyze
  • โŒ Approve
  • โš ๏ธ Execute (2 tests - partial)
  • โŒ Validate
  • โŒ Summarize
  • โŒ Confirm

Target Coverage: 80%+

Missing Tests (High Priority)

Critical Rules (MUST ADD)

  1. 01-critical-rules/report-first/01-error-report-workflow.yaml
  2. 01-critical-rules/report-first/02-auto-fix-negative.yaml
  3. 01-critical-rules/confirm-cleanup/01-session-cleanup.yaml
  4. 01-critical-rules/confirm-cleanup/02-temp-files-cleanup.yaml

Delegation (SHOULD ADD)

  1. 03-delegation/scale/01-exactly-4-files.yaml
  2. 03-delegation/scale/02-3-files-negative.yaml
  3. 03-delegation/expertise/01-security-audit.yaml
  4. 03-delegation/context-bundles/01-bundle-creation.yaml

Workflow Stages (SHOULD ADD)

  1. 02-workflow-stages/validate/01-quality-check.yaml
  2. 02-workflow-stages/validate/02-additional-checks-prompt.yaml
  3. 02-workflow-stages/summarize/01-format-validation.yaml

Edge Cases (NICE TO HAVE)

  1. 05-edge-cases/boundary/01-bash-ls-approval.yaml
  2. 05-edge-cases/tier-conflicts/01-context-override-negative.yaml
  3. 05-edge-cases/negative/01-skip-context-negative.yaml

File Creation Rules

All tests MUST use safe paths:

# โœ… CORRECT - Test files
prompt: |
  Create a file at evals/test_tmp/test-output.txt

# โœ… CORRECT - Agent creates these automatically
.tmp/sessions/{session-id}/
.tmp/context/{session-id}/bundle.md

# โŒ WRONG - Don't use these
/tmp/
~/
/Users/

Timeout Guidelines

Category Simple Multi-turn Complex
Critical Rules 60s 120s -
Workflow Stages 60s 120s 180s
Delegation 90s 120s 180s
Execution Paths 30s 60s 90s
Edge Cases 60s 120s -
Integration 120s 180s 300s

Migration Status

โœ… Migration Complete (Nov 26, 2024)

  • 22 tests migrated to new structure
  • Original folders preserved for verification
  • All tests copied (not moved)

Next Steps:

  1. โœ… Verify migrated tests run correctly
  2. โฌœ Add missing critical tests (Priority 1)
  3. โฌœ Add delegation tests (Priority 2)
  4. โฌœ Remove old folders after verification
  5. โฌœ Update CI/CD to use new structure

To remove old folders (after verification):

cd evals/agents/openagent/tests
rm -rf business/ context-loading/ developer/ edge-case/

CI/CD Integration

Pre-commit Hook

# Run critical tests only (fast)
npm run eval:sdk -- --agent=openagent --pattern="01-critical-rules/**/*.yaml"

PR Validation

# Run critical + workflow tests
npm run eval:sdk -- --agent=openagent --pattern="0[1-2]-*/**/*.yaml"

Release Validation

# Run full suite
npm run eval:sdk -- --agent=openagent

Debugging Failed Tests

  1. Run with --debug flag:

    npm run eval:sdk -- --agent=openagent --pattern="path/to/test.yaml" --debug
    
  2. Check session files (preserved in debug mode):

    ls ~/.local/share/opencode/storage/session/
    
  3. Review event timeline in test output

  4. Check test_tmp/ for created files:

    ls -la evals/test_tmp/
    

Contributing

Adding New Tests

  1. Choose the right category based on what you're testing
  2. Follow naming convention: {sequence}-{description}-{type}.yaml
  3. Set appropriate timeout based on category guidelines
  4. Use safe file paths (evals/test_tmp/)
  5. Add to category README if introducing new pattern

Test Template

id: category-description-001
name: Human Readable Test Name
description: |
  What this test validates and why it matters.
  
  Expected behavior:
  - Step 1
  - Step 2

category: category-name
agent: openagent
model: anthropic/claude-sonnet-4-5

prompt: |
  Test prompt here

behavior:
  mustUseTools: [read, write]
  requiresApproval: true
  requiresContext: true
  minToolCalls: 2

expectedViolations:
  - rule: approval-gate
    shouldViolate: false
    severity: error
    description: Must ask approval before writing

approvalStrategy:
  type: auto-approve

timeout: 60000

tags:
  - tag1
  - tag2

Resources

  • OpenAgent Prompt: .opencode/agent/openagent.md
  • Test Framework: evals/framework/
  • How Tests Work: evals/HOW_TESTS_WORK.md
  • OpenAgent Rules: evals/agents/openagent/docs/OPENAGENT_RULES.md
  • Folder Structure: FOLDER_STRUCTURE.md (this directory)