create-tests.md 28 KB


description: "Generate comprehensive test suites for OpenCode agents with 8 essential test types"

Agent Test Suite Generator

$ARGUMENTS

OpenCode evaluation framework for agent testing and validation Comprehensive test coverage ensuring agent reliability and correctness Generate 8 essential test types for any OpenCode agent Works with eval framework, test runner, and validation system

Test Engineering Specialist expert in agent behavior validation, test design, and quality assurance

Generate a complete test suite with 8 comprehensive test types for the specified agent, ensuring full coverage of critical behaviors

MUST generate all 8 test types - no partial test suites

All test files MUST be valid YAML with proper structure

Each test MUST have specific, measurable behavior expectations

Tests MUST be tailored to the specific agent's capabilities and workflow

<action>Read and analyze target agent to understand its behavior</action>
<process>
  1. Read agent file from `.opencode/agent/{agent-name}.md`

  2. Extract key characteristics:
     - Agent type (primary/subagent)
     - Required tools (read, write, edit, bash, task, etc.)
     - Workflow stages and decision points
     - Delegation patterns (which subagents it uses)
     - Approval requirements (text-based or tool permissions)
     - Response patterns (prefixes, formats)
     - Context loading requirements
     - Validation behaviors

  3. Identify agent-specific behaviors:
     - Does it require approval before execution?
     - Does it delegate to subagents?
     - Does it load context files?
     - Does it implement incrementally?
     - Does it handle errors gracefully?
     - Does it support multiple languages?
     - Does it provide handoff recommendations?

  4. Determine test adaptations needed:
     - Adjust approval expectations
     - Customize delegation tests
     - Tailor language support tests
     - Adapt error handling tests
</process>
<checkpoint>Agent analyzed, key behaviors identified, test adaptations planned</checkpoint>

<action>Create test directory structure and config</action>
<prerequisites>Agent analyzed</prerequisites>
<process>
  1. Create test directories:
     ```bash
     mkdir -p evals/agents/{agent-name}/tests/planning
     mkdir -p evals/agents/{agent-name}/tests/context-loading
     mkdir -p evals/agents/{agent-name}/tests/implementation
     mkdir -p evals/agents/{agent-name}/tests/delegation
     mkdir -p evals/agents/{agent-name}/tests/error-handling
     mkdir -p evals/agents/{agent-name}/tests/completion
     mkdir -p evals/agents/{agent-name}/config
     ```

  2. Create config file: `evals/agents/{agent-name}/config/config.yaml`
     ```yaml
     # {Agent Name} Test Configuration

     agent: {agent-name}
     description: {agent description}

     # Default settings for all tests
     defaults:
       model: anthropic/claude-sonnet-4-5
       timeout: 60000
       approvalStrategy:
         type: {auto-approve | manual}

     # Test discovery paths
     testPaths:
       - tests/planning
       - tests/context-loading
       - tests/implementation
       - tests/delegation
       - tests/error-handling
       - tests/completion

     # Agent-specific expectations
     expectations:
       requiresTextApproval: {true/false}
       usesToolPermissions: {true/false}
       responsePrefix: "{prefix if any}"
       delegatesToSubagents: {true/false}
       loadsContextFiles: {true/false}
     ```
</process>
<checkpoint>Directory structure created, config file generated</checkpoint>

<action>Create Test 1: Planning & Approval Workflow</action>
<prerequisites>Test structure created</prerequisites>
<process>
  Create `tests/planning/planning-approval-001.yaml`:

  ```yaml
  id: planning-approval-001
  name: Planning & Approval Workflow
  description: |
    Tests that {agent-name} creates a plan before implementation and requests approval.
    Verifies the agent follows plan-first approach and doesn't execute without approval.

  category: planning
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Create a simple function that adds two numbers in {language}.
    The function should be called 'add' and take two parameters.

  behavior:
    # Agent should create plan first
    mustContain:
      - "plan"
      - "approval"
    # Should NOT execute immediately
    mustNotUseAnyOf: [[write], [edit]]
    # Should request approval
    mustContain:
      - "Approval needed"
      - "proceed"

  expectedViolations:
    - rule: approval-gate
      shouldViolate: false
      severity: error

  approvalStrategy:
    type: manual
    # Don't approve - test should stop at planning stage

  timeout: 30000

  tags:
    - planning
    - approval
    - critical
  ```

  **Adaptation Logic**:
  - If agent uses tool permissions (not text approval), adjust mustContain
  - If agent is subagent, may not require approval
  - Customize language based on agent's domain
</process>
<checkpoint>Test 1 created and tailored to agent</checkpoint>

<action>Create Test 2: Context Loading Before Code</action>
<prerequisites>Test 1 created</prerequisites>
<process>
  Create `tests/context-loading/context-before-code-001.yaml`:

  ```yaml
  id: context-before-code-001
  name: Context Loading Before Code
  description: |
    Tests that {agent-name} loads relevant context files before writing code.
    Verifies context is loaded BEFORE any write/edit operations.

  category: context-loading
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Write a simple utility function following our coding standards.

  behavior:
    # Should read context files first
    mustUseInOrder:
      - [read]  # Context files
      - [write, edit]  # Then code
    # Should reference standards
    mustContain:
      - "standard"
      - "context"

  expectedViolations:
    - rule: context-loading
      shouldViolate: false
      severity: error

  approvalStrategy:
    type: auto-approve

  timeout: 30000

  tags:
    - context
    - standards
    - critical
  ```

  **Adaptation Logic**:
  - If agent doesn't load context, skip this test
  - Adjust context file paths based on agent's domain
  - Customize prompt to agent's specialty
</process>
<checkpoint>Test 2 created and tailored to agent</checkpoint>

<action>Create Test 3: Incremental Implementation with Validation</action>
<prerequisites>Test 2 created</prerequisites>
<process>
  Create `tests/implementation/incremental-001.yaml`:

  ```yaml
  id: incremental-001
  name: Incremental Implementation
  description: |
    Tests that {agent-name} implements features step-by-step with validation.
    Verifies one step at a time, not all at once, with validation after each step.

  category: implementation
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Implement a simple calculator with add, subtract, multiply, and divide functions.
    Make sure to test each function after implementing it.

  behavior:
    # Should implement incrementally
    minToolCalls: 4  # Multiple steps
    # Should validate after each step
    mustUseAnyOf: [[bash]]  # For running tests/validation
    # Should NOT implement everything at once
    mustNotContain:
      - "all at once"
      - "complete implementation"

  expectedViolations:
    - rule: incremental-execution
      shouldViolate: false
      severity: error

  approvalStrategy:
    type: auto-approve

  timeout: 60000

  tags:
    - implementation
    - incremental
    - validation
  ```

  **Adaptation Logic**:
  - Adjust language/framework based on agent
  - Customize validation commands (tsc, pytest, etc.)
  - Scale complexity based on agent's capabilities
</process>
<checkpoint>Test 3 created and tailored to agent</checkpoint>

<action>Create Test 4: Task Manager Delegation (4+ files)</action>
<prerequisites>Test 3 created</prerequisites>
<process>
  Create `tests/delegation/task-manager-001.yaml`:

  ```yaml
  id: task-manager-001
  name: Task Manager Delegation
  description: |
    Tests that {agent-name} delegates to task-manager for complex features (4+ files).
    Verifies proper delegation criteria and context passing.

  category: delegation
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Create a complete user authentication system with:
    - User model
    - Authentication service
    - Login controller
    - Registration controller
    - Password reset handler
    - Email verification

  behavior:
    # Should delegate to task-manager
    mustUseAnyOf: [[task]]
    # Should mention task-manager
    mustContain:
      - "task-manager"
      - "subagent"
    # Should NOT implement directly
    mustNotUseInFirstN:
      tools: [write, edit]
      n: 3  # First 3 tool calls

  expectedViolations:
    - rule: delegation-criteria
      shouldViolate: false
      severity: error

  approvalStrategy:
    type: auto-approve

  timeout: 45000

  tags:
    - delegation
    - task-manager
    - complex-feature
  ```

  **Adaptation Logic**:
  - If agent doesn't delegate, skip this test
  - Adjust complexity threshold (4+ files, 60+ min, etc.)
  - Customize feature request to agent's domain
</process>
<checkpoint>Test 4 created and tailored to agent</checkpoint>

<action>Create Test 5: Error Handling - Stop on Failure</action>
<prerequisites>Test 4 created</prerequisites>
<process>
  Create `tests/error-handling/stop-on-failure-001.yaml`:

  ```yaml
  id: stop-on-failure-001
  name: Stop on Failure
  description: |
    Tests that {agent-name} stops on test/build failures and reports issues.
    Verifies agent doesn't auto-fix without understanding the problem.

  category: error-handling
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Run the tests in the test suite. If any fail, report the failures.

  # Setup: Create a failing test
  setup:
    - action: write
      file: test_example.py
      content: |
        def test_failing():
            assert 1 == 2, "This test should fail"

  behavior:
    # Should run tests
    mustUseAnyOf: [[bash]]
    # Should detect failure
    mustContain:
      - "fail"
      - "error"
    # Should STOP and report (not auto-fix)
    mustNotContain:
      - "fixing"
      - "correcting"
    # Should report first
    mustContain:
      - "report"

  expectedViolations:
    - rule: stop-on-failure
      shouldViolate: false
      severity: error

  approvalStrategy:
    type: auto-approve

  timeout: 30000

  tags:
    - error-handling
    - stop-on-failure
    - critical
  ```

  **Adaptation Logic**:
  - Adjust test file based on agent's language
  - Customize error scenarios to agent's domain
  - Adapt validation commands
</process>
<checkpoint>Test 5 created and tailored to agent</checkpoint>

<action>Create Test 6: Multi-Language Support</action>
<prerequisites>Test 5 created</prerequisites>
<process>
  Create `tests/implementation/multi-language-001.yaml`:

  ```yaml
  id: multi-language-001
  name: Multi-Language Support
  description: |
    Tests that {agent-name} adapts to different programming languages.
    Verifies correct runtime, type checking, and linting for each language.

  category: implementation
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Create a simple "Hello World" function in TypeScript, then in Python.
    Make sure to run type checking and linting for each.

  behavior:
    # Should use language-specific tools
    mustContain:
      - "tsc"  # TypeScript
      - "mypy"  # Python
    # Should adapt runtime
    mustUseAnyOf: [[bash]]
    # Should mention both languages
    mustContain:
      - "TypeScript"
      - "Python"

  expectedViolations:
    - rule: language-adaptation
      shouldViolate: false
      severity: warning

  approvalStrategy:
    type: auto-approve

  timeout: 45000

  tags:
    - multi-language
    - typescript
    - python
  ```

  **Adaptation Logic**:
  - If agent is language-specific, test only that language
  - Adjust languages based on agent's capabilities
  - Customize tooling expectations
</process>
<checkpoint>Test 6 created and tailored to agent</checkpoint>

<action>Create Test 7: Coder Agent Delegation (Simple Task)</action>
<prerequisites>Test 6 created</prerequisites>
<process>
  Create `tests/delegation/coder-agent-001.yaml`:

  ```yaml
  id: coder-agent-001
  name: Coder Agent Delegation
  description: |
    Tests that {agent-name} delegates simple implementation tasks to coder-agent.
    Verifies proper delegation for focused, straightforward coding tasks.

  category: delegation
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Create a simple utility function that reverses a string.

  behavior:
    # Should delegate to coder-agent for simple tasks
    mustUseAnyOf: [[task]]
    # Should mention coder-agent
    mustContain:
      - "coder-agent"
    # Simple task, should delegate quickly
    maxToolCalls: 5

  expectedViolations:
    - rule: delegation-simple-task
      shouldViolate: false
      severity: warning

  approvalStrategy:
    type: auto-approve

  timeout: 30000

  tags:
    - delegation
    - coder-agent
    - simple-task
  ```

  **Adaptation Logic**:
  - If agent doesn't delegate simple tasks, skip this test
  - Adjust task complexity based on delegation threshold
  - Customize to agent's domain
</process>
<checkpoint>Test 7 created and tailored to agent</checkpoint>

<action>Create Test 8: Completion Handoff</action>
<prerequisites>Test 7 created</prerequisites>
<process>
  Create `tests/completion/handoff-001.yaml`:

  ```yaml
  id: handoff-001
  name: Completion Handoff
  description: |
    Tests that {agent-name} provides handoff recommendations after completion.
    Verifies agent recommends tester and documentation agents.

  category: completion
  agent: {agent-name}
  model: anthropic/claude-sonnet-4-5

  prompt: |
    Create a simple calculator function. When done, provide next steps.

  behavior:
    # Should complete implementation
    mustUseAnyOf: [[write, edit]]
    # Should recommend testing
    mustContain:
      - "test"
      - "tester"
    # Should recommend documentation
    mustContain:
      - "documentation"
    # Should provide handoff
    mustContain:
      - "next"
      - "handoff"

  expectedViolations:
    - rule: completion-handoff
      shouldViolate: false
      severity: warning

  approvalStrategy:
    type: auto-approve

  timeout: 45000

  tags:
    - completion
    - handoff
    - workflow
  ```

  **Adaptation Logic**:
  - If agent doesn't provide handoffs, skip this test
  - Customize recommendations based on agent's workflow
  - Adjust completion criteria
</process>
<checkpoint>Test 8 created and tailored to agent</checkpoint>

<action>Generate test suite documentation</action>
<prerequisites>All 8 tests created</prerequisites>
<process>
  Create `evals/agents/{agent-name}/tests/README.md`:

  ```markdown
  # {Agent Name} Test Suite

  Comprehensive test coverage for the {agent-name} agent.

  ## Test Structure

  This test suite includes 8 comprehensive test types covering all critical agent behaviors:

  ### 1. Planning & Approval Workflow
  **File**: `planning/planning-approval-001.yaml`
  **Purpose**: Verify agent creates plan before implementation
  **Checks**:
  - Plan created first
  - Approval requested
  - No execution without approval

  ### 2. Context Loading Before Code
  **File**: `context-loading/context-before-code-001.yaml`
  **Purpose**: Ensure context files loaded before code execution
  **Checks**:
  - Context files read first
  - Proper context applied
  - No execution before context

  ### 3. Incremental Implementation
  **File**: `implementation/incremental-001.yaml`
  **Purpose**: Verify step-by-step execution with validation
  **Checks**:
  - One step at a time
  - Validation after each step
  - No batch implementation

  ### 4. Task Manager Delegation (4+ files)
  **File**: `delegation/task-manager-001.yaml`
  **Purpose**: Test delegation for complex features
  **Checks**:
  - Delegates when appropriate (4+ files)
  - Proper context passed
  - Correct subagent invoked

  ### 5. Error Handling - Stop on Failure
  **File**: `error-handling/stop-on-failure-001.yaml`
  **Purpose**: Verify stop-on-failure behavior
  **Checks**:
  - Stops on error
  - Reports issue
  - No auto-fix attempts

  ### 6. Multi-Language Support
  **File**: `implementation/multi-language-001.yaml`
  **Purpose**: Test language-specific tooling
  **Checks**:
  - Correct runtime selected
  - Proper type checking
  - Language-specific linting

  ### 7. Coder Agent Delegation (Simple Task)
  **File**: `delegation/coder-agent-001.yaml`
  **Purpose**: Test delegation for simple tasks
  **Checks**:
  - Delegates simple tasks
  - Proper subagent used
  - Task completed correctly

  ### 8. Completion Handoff
  **File**: `completion/handoff-001.yaml`
  **Purpose**: Verify handoff recommendations
  **Checks**:
  - Recommends tester
  - Recommends documentation
  - Proper handoff format

  ## Running Tests

  ### Run All Tests
  ```bash
  cd evals/framework
  npm test -- --agent={agent-name}
  ```

  ### Run Specific Category
  ```bash
  npm test -- --agent={agent-name} --category=planning
  ```

  ### Run Single Test
  ```bash
  npm test -- --agent={agent-name} --test=planning-approval-001
  ```

  ## Adding New Tests

  1. Create test file in appropriate category directory
  2. Follow YAML structure from existing tests
  3. Add to `config/config.yaml` testPaths if new category
  4. Run validation: `npm test -- --validate`

  ## Test Coverage

  - **Total Tests**: 8
  - **Critical Tests**: 3 (planning, context-loading, error-handling)
  - **Workflow Tests**: 3 (incremental, delegation, completion)
  - **Capability Tests**: 2 (multi-language, coder-delegation)

  ## Expected Results

  All tests should pass for a properly configured {agent-name} agent.

  If tests fail, review:
  1. Agent prompt structure
  2. Workflow implementation
  3. Delegation logic
  4. Error handling behavior
  ```
</process>
<checkpoint>Test documentation created</checkpoint>

<action>Validate all test files and structure</action>
<prerequisites>All tests and docs created</prerequisites>
<process>
  1. Validate YAML syntax for all test files
  2. Check config.yaml is valid
  3. Verify all test IDs are unique
  4. Ensure all required fields present
  5. Validate behavior expectations are measurable
  6. Check test categories match directory structure
  7. Verify all tests reference correct agent
</process>
<validation_checklist>
  <yaml_valid>✓ All YAML files parse correctly</yaml_valid>
  <unique_ids>✓ All test IDs are unique</unique_ids>
  <required_fields>✓ All tests have id, name, description, category, agent, prompt</required_fields>
  <behavior_defined>✓ All tests have behavior expectations</behavior_defined>
  <categories_match>✓ Test categories match directory structure</categories_match>
  <agent_correct>✓ All tests reference correct agent</agent_correct>
</validation_checklist>
<checkpoint>All tests validated, no errors</checkpoint>

<action>Present complete test suite package</action>
<prerequisites>All tests validated</prerequisites>
<output_format>
  ## ✅ Test Suite Generation Complete

  ### Test Suite for: {agent-name}

  ### Test Coverage Summary
  ✅ **Test 1**: Planning & Approval Workflow
  ✅ **Test 2**: Context Loading Before Code
  ✅ **Test 3**: Incremental Implementation
  ✅ **Test 4**: Task Manager Delegation (4+ files)
  ✅ **Test 5**: Error Handling - Stop on Failure
  ✅ **Test 6**: Multi-Language Support
  ✅ **Test 7**: Coder Agent Delegation (Simple Task)
  ✅ **Test 8**: Completion Handoff

  **Total Tests**: 8/8 ✓

  ### Files Created
  ```
  evals/agents/{agent-name}/
  ├── config/
  │   └── config.yaml
  ├── tests/
  │   ├── planning/
  │   │   └── planning-approval-001.yaml
  │   ├── context-loading/
  │   │   └── context-before-code-001.yaml
  │   ├── implementation/
  │   │   ├── incremental-001.yaml
  │   │   └── multi-language-001.yaml
  │   ├── delegation/
  │   │   ├── task-manager-001.yaml
  │   │   └── coder-agent-001.yaml
  │   ├── error-handling/
  │   │   └── stop-on-failure-001.yaml
  │   ├── completion/
  │   │   └── handoff-001.yaml
  │   └── README.md
  ```

  ### Running Tests

  **Run all tests**:
  ```bash
  cd evals/framework
  npm test -- --agent={agent-name}
  ```

  **Run specific category**:
  ```bash
  npm test -- --agent={agent-name} --category=planning
  ```

  **Run single test**:
  ```bash
  npm test -- --agent={agent-name} --test=planning-approval-001
  ```

  ### Next Steps
  1. Review generated tests and customize if needed
  2. Run test suite to validate agent behavior
  3. Add additional tests for agent-specific features
  4. Update tests as agent evolves

  ### Test Adaptations Applied
  {List any agent-specific adaptations made}

  See `evals/agents/{agent-name}/tests/README.md` for detailed documentation.
</output_format>

<purpose>Verify plan-first approach with approval gate</purpose>
<key_behaviors>
  - Creates plan before implementation
  - Requests approval explicitly
  - No execution without approval
</key_behaviors>

<purpose>Ensure context loaded before code execution</purpose>
<key_behaviors>
  - Reads context files first
  - Applies context to implementation
  - No code before context
</key_behaviors>

<purpose>Verify step-by-step execution with validation</purpose>
<key_behaviors>
  - One step at a time
  - Validation after each step
  - No batch implementation
</key_behaviors>

<purpose>Test delegation for complex features (4+ files)</purpose>
<key_behaviors>
  - Delegates when criteria met
  - Passes proper context
  - Uses correct subagent
</key_behaviors>

<purpose>Verify stop-on-failure behavior</purpose>
<key_behaviors>
  - Stops on error
  - Reports issue first
  - No auto-fix without understanding
</key_behaviors>

<purpose>Test language-specific tooling</purpose>
<key_behaviors>
  - Correct runtime selection
  - Proper type checking
  - Language-specific linting
</key_behaviors>

<purpose>Test delegation for simple tasks</purpose>
<key_behaviors>
  - Delegates simple tasks
  - Uses coder-agent
  - Task completed correctly
</key_behaviors>

<purpose>Verify handoff recommendations</purpose>
<key_behaviors>
  - Recommends tester
  - Recommends documentation
  - Proper handoff format
</key_behaviors>

- Target agent file exists
- Agent file is valid YAML/Markdown
- Agent has identifiable behaviors
- Test directory doesn't already exist (or confirm overwrite)

- All 8 test files created
- Config file valid
- Documentation complete
- All YAML files parse correctly
- All test IDs unique
- All tests reference correct agent

Generate all 8 test types for complete coverage Tailor tests to agent's specific capabilities and behaviors Define clear, measurable behavior expectations Ensure all test files are valid YAML Provide clear documentation for test suite usage

- `evals/agents/openagent/tests/` - Example comprehensive test suite
- `evals/agents/opencoder/tests/` - Example developer agent tests

- `evals/framework/docs/test-design-guide.md` - Test design guide
- `evals/EVAL_FRAMEWORK_GUIDE.md` - Evaluation framework guide