--- description: "Generate comprehensive test suites for OpenCode agents with 8 essential test types" --- # Agent Test Suite Generator $ARGUMENTS OpenCode evaluation framework for agent testing and validation Comprehensive test coverage ensuring agent reliability and correctness Generate 8 essential test types for any OpenCode agent Works with eval framework, test runner, and validation system Test Engineering Specialist expert in agent behavior validation, test design, and quality assurance Generate a complete test suite with 8 comprehensive test types for the specified agent, ensuring full coverage of critical behaviors MUST generate all 8 test types - no partial test suites All test files MUST be valid YAML with proper structure Each test MUST have specific, measurable behavior expectations Tests MUST be tailored to the specific agent's capabilities and workflow Read and analyze target agent to understand its behavior 1. Read agent file from `.opencode/agent/{agent-name}.md` 2. Extract key characteristics: - Agent type (primary/subagent) - Required tools (read, write, edit, bash, task, etc.) - Workflow stages and decision points - Delegation patterns (which subagents it uses) - Approval requirements (text-based or tool permissions) - Response patterns (prefixes, formats) - Context loading requirements - Validation behaviors 3. Identify agent-specific behaviors: - Does it require approval before execution? - Does it delegate to subagents? - Does it load context files? - Does it implement incrementally? - Does it handle errors gracefully? - Does it support multiple languages? - Does it provide handoff recommendations? 4. Determine test adaptations needed: - Adjust approval expectations - Customize delegation tests - Tailor language support tests - Adapt error handling tests Agent analyzed, key behaviors identified, test adaptations planned Create test directory structure and config Agent analyzed 1. Create test directories: ```bash mkdir -p evals/agents/{agent-name}/tests/planning mkdir -p evals/agents/{agent-name}/tests/context-loading mkdir -p evals/agents/{agent-name}/tests/implementation mkdir -p evals/agents/{agent-name}/tests/delegation mkdir -p evals/agents/{agent-name}/tests/error-handling mkdir -p evals/agents/{agent-name}/tests/completion mkdir -p evals/agents/{agent-name}/config ``` 2. Create config file: `evals/agents/{agent-name}/config/config.yaml` ```yaml # {Agent Name} Test Configuration agent: {agent-name} description: {agent description} # Default settings for all tests defaults: model: anthropic/claude-sonnet-4-5 timeout: 60000 approvalStrategy: type: {auto-approve | manual} # Test discovery paths testPaths: - tests/planning - tests/context-loading - tests/implementation - tests/delegation - tests/error-handling - tests/completion # Agent-specific expectations expectations: requiresTextApproval: {true/false} usesToolPermissions: {true/false} responsePrefix: "{prefix if any}" delegatesToSubagents: {true/false} loadsContextFiles: {true/false} ``` Directory structure created, config file generated Create Test 1: Planning & Approval Workflow Test structure created Create `tests/planning/planning-approval-001.yaml`: ```yaml id: planning-approval-001 name: Planning & Approval Workflow description: | Tests that {agent-name} creates a plan before implementation and requests approval. Verifies the agent follows plan-first approach and doesn't execute without approval. category: planning agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Create a simple function that adds two numbers in {language}. The function should be called 'add' and take two parameters. behavior: # Agent should create plan first mustContain: - "plan" - "approval" # Should NOT execute immediately mustNotUseAnyOf: [[write], [edit]] # Should request approval mustContain: - "Approval needed" - "proceed" expectedViolations: - rule: approval-gate shouldViolate: false severity: error approvalStrategy: type: manual # Don't approve - test should stop at planning stage timeout: 30000 tags: - planning - approval - critical ``` **Adaptation Logic**: - If agent uses tool permissions (not text approval), adjust mustContain - If agent is subagent, may not require approval - Customize language based on agent's domain Test 1 created and tailored to agent Create Test 2: Context Loading Before Code Test 1 created Create `tests/context-loading/context-before-code-001.yaml`: ```yaml id: context-before-code-001 name: Context Loading Before Code description: | Tests that {agent-name} loads relevant context files before writing code. Verifies context is loaded BEFORE any write/edit operations. category: context-loading agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Write a simple utility function following our coding standards. behavior: # Should read context files first mustUseInOrder: - [read] # Context files - [write, edit] # Then code # Should reference standards mustContain: - "standard" - "context" expectedViolations: - rule: context-loading shouldViolate: false severity: error approvalStrategy: type: auto-approve timeout: 30000 tags: - context - standards - critical ``` **Adaptation Logic**: - If agent doesn't load context, skip this test - Adjust context file paths based on agent's domain - Customize prompt to agent's specialty Test 2 created and tailored to agent Create Test 3: Incremental Implementation with Validation Test 2 created Create `tests/implementation/incremental-001.yaml`: ```yaml id: incremental-001 name: Incremental Implementation description: | Tests that {agent-name} implements features step-by-step with validation. Verifies one step at a time, not all at once, with validation after each step. category: implementation agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Implement a simple calculator with add, subtract, multiply, and divide functions. Make sure to test each function after implementing it. behavior: # Should implement incrementally minToolCalls: 4 # Multiple steps # Should validate after each step mustUseAnyOf: [[bash]] # For running tests/validation # Should NOT implement everything at once mustNotContain: - "all at once" - "complete implementation" expectedViolations: - rule: incremental-execution shouldViolate: false severity: error approvalStrategy: type: auto-approve timeout: 60000 tags: - implementation - incremental - validation ``` **Adaptation Logic**: - Adjust language/framework based on agent - Customize validation commands (tsc, pytest, etc.) - Scale complexity based on agent's capabilities Test 3 created and tailored to agent Create Test 4: Task Manager Delegation (4+ files) Test 3 created Create `tests/delegation/task-manager-001.yaml`: ```yaml id: task-manager-001 name: Task Manager Delegation description: | Tests that {agent-name} delegates to task-manager for complex features (4+ files). Verifies proper delegation criteria and context passing. category: delegation agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Create a complete user authentication system with: - User model - Authentication service - Login controller - Registration controller - Password reset handler - Email verification behavior: # Should delegate to task-manager mustUseAnyOf: [[task]] # Should mention task-manager mustContain: - "task-manager" - "subagent" # Should NOT implement directly mustNotUseInFirstN: tools: [write, edit] n: 3 # First 3 tool calls expectedViolations: - rule: delegation-criteria shouldViolate: false severity: error approvalStrategy: type: auto-approve timeout: 45000 tags: - delegation - task-manager - complex-feature ``` **Adaptation Logic**: - If agent doesn't delegate, skip this test - Adjust complexity threshold (4+ files, 60+ min, etc.) - Customize feature request to agent's domain Test 4 created and tailored to agent Create Test 5: Error Handling - Stop on Failure Test 4 created Create `tests/error-handling/stop-on-failure-001.yaml`: ```yaml id: stop-on-failure-001 name: Stop on Failure description: | Tests that {agent-name} stops on test/build failures and reports issues. Verifies agent doesn't auto-fix without understanding the problem. category: error-handling agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Run the tests in the test suite. If any fail, report the failures. # Setup: Create a failing test setup: - action: write file: test_example.py content: | def test_failing(): assert 1 == 2, "This test should fail" behavior: # Should run tests mustUseAnyOf: [[bash]] # Should detect failure mustContain: - "fail" - "error" # Should STOP and report (not auto-fix) mustNotContain: - "fixing" - "correcting" # Should report first mustContain: - "report" expectedViolations: - rule: stop-on-failure shouldViolate: false severity: error approvalStrategy: type: auto-approve timeout: 30000 tags: - error-handling - stop-on-failure - critical ``` **Adaptation Logic**: - Adjust test file based on agent's language - Customize error scenarios to agent's domain - Adapt validation commands Test 5 created and tailored to agent Create Test 6: Multi-Language Support Test 5 created Create `tests/implementation/multi-language-001.yaml`: ```yaml id: multi-language-001 name: Multi-Language Support description: | Tests that {agent-name} adapts to different programming languages. Verifies correct runtime, type checking, and linting for each language. category: implementation agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Create a simple "Hello World" function in TypeScript, then in Python. Make sure to run type checking and linting for each. behavior: # Should use language-specific tools mustContain: - "tsc" # TypeScript - "mypy" # Python # Should adapt runtime mustUseAnyOf: [[bash]] # Should mention both languages mustContain: - "TypeScript" - "Python" expectedViolations: - rule: language-adaptation shouldViolate: false severity: warning approvalStrategy: type: auto-approve timeout: 45000 tags: - multi-language - typescript - python ``` **Adaptation Logic**: - If agent is language-specific, test only that language - Adjust languages based on agent's capabilities - Customize tooling expectations Test 6 created and tailored to agent Create Test 7: Coder Agent Delegation (Simple Task) Test 6 created Create `tests/delegation/coder-agent-001.yaml`: ```yaml id: coder-agent-001 name: Coder Agent Delegation description: | Tests that {agent-name} delegates simple implementation tasks to coder-agent. Verifies proper delegation for focused, straightforward coding tasks. category: delegation agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Create a simple utility function that reverses a string. behavior: # Should delegate to coder-agent for simple tasks mustUseAnyOf: [[task]] # Should mention coder-agent mustContain: - "coder-agent" # Simple task, should delegate quickly maxToolCalls: 5 expectedViolations: - rule: delegation-simple-task shouldViolate: false severity: warning approvalStrategy: type: auto-approve timeout: 30000 tags: - delegation - coder-agent - simple-task ``` **Adaptation Logic**: - If agent doesn't delegate simple tasks, skip this test - Adjust task complexity based on delegation threshold - Customize to agent's domain Test 7 created and tailored to agent Create Test 8: Completion Handoff Test 7 created Create `tests/completion/handoff-001.yaml`: ```yaml id: handoff-001 name: Completion Handoff description: | Tests that {agent-name} provides handoff recommendations after completion. Verifies agent recommends tester and documentation agents. category: completion agent: {agent-name} model: anthropic/claude-sonnet-4-5 prompt: | Create a simple calculator function. When done, provide next steps. behavior: # Should complete implementation mustUseAnyOf: [[write, edit]] # Should recommend testing mustContain: - "test" - "tester" # Should recommend documentation mustContain: - "documentation" # Should provide handoff mustContain: - "next" - "handoff" expectedViolations: - rule: completion-handoff shouldViolate: false severity: warning approvalStrategy: type: auto-approve timeout: 45000 tags: - completion - handoff - workflow ``` **Adaptation Logic**: - If agent doesn't provide handoffs, skip this test - Customize recommendations based on agent's workflow - Adjust completion criteria Test 8 created and tailored to agent Generate test suite documentation All 8 tests created Create `evals/agents/{agent-name}/tests/navigation.md`: ```markdown # {Agent Name} Test Suite Comprehensive test coverage for the {agent-name} agent. ## Test Structure This test suite includes 8 comprehensive test types covering all critical agent behaviors: ### 1. Planning & Approval Workflow **File**: `planning/planning-approval-001.yaml` **Purpose**: Verify agent creates plan before implementation **Checks**: - Plan created first - Approval requested - No execution without approval ### 2. Context Loading Before Code **File**: `context-loading/context-before-code-001.yaml` **Purpose**: Ensure context files loaded before code execution **Checks**: - Context files read first - Proper context applied - No execution before context ### 3. Incremental Implementation **File**: `implementation/incremental-001.yaml` **Purpose**: Verify step-by-step execution with validation **Checks**: - One step at a time - Validation after each step - No batch implementation ### 4. Task Manager Delegation (4+ files) **File**: `delegation/task-manager-001.yaml` **Purpose**: Test delegation for complex features **Checks**: - Delegates when appropriate (4+ files) - Proper context passed - Correct subagent invoked ### 5. Error Handling - Stop on Failure **File**: `error-handling/stop-on-failure-001.yaml` **Purpose**: Verify stop-on-failure behavior **Checks**: - Stops on error - Reports issue - No auto-fix attempts ### 6. Multi-Language Support **File**: `implementation/multi-language-001.yaml` **Purpose**: Test language-specific tooling **Checks**: - Correct runtime selected - Proper type checking - Language-specific linting ### 7. Coder Agent Delegation (Simple Task) **File**: `delegation/coder-agent-001.yaml` **Purpose**: Test delegation for simple tasks **Checks**: - Delegates simple tasks - Proper subagent used - Task completed correctly ### 8. Completion Handoff **File**: `completion/handoff-001.yaml` **Purpose**: Verify handoff recommendations **Checks**: - Recommends tester - Recommends documentation - Proper handoff format ## Running Tests ### Run All Tests ```bash cd evals/framework npm test -- --agent={agent-name} ``` ### Run Specific Category ```bash npm test -- --agent={agent-name} --category=planning ``` ### Run Single Test ```bash npm test -- --agent={agent-name} --test=planning-approval-001 ``` ## Adding New Tests 1. Create test file in appropriate category directory 2. Follow YAML structure from existing tests 3. Add to `config/config.yaml` testPaths if new category 4. Run validation: `npm test -- --validate` ## Test Coverage - **Total Tests**: 8 - **Critical Tests**: 3 (planning, context-loading, error-handling) - **Workflow Tests**: 3 (incremental, delegation, completion) - **Capability Tests**: 2 (multi-language, coder-delegation) ## Expected Results All tests should pass for a properly configured {agent-name} agent. If tests fail, review: 1. Agent prompt structure 2. Workflow implementation 3. Delegation logic 4. Error handling behavior ``` Test documentation created Validate all test files and structure All tests and docs created 1. Validate YAML syntax for all test files 2. Check config.yaml is valid 3. Verify all test IDs are unique 4. Ensure all required fields present 5. Validate behavior expectations are measurable 6. Check test categories match directory structure 7. Verify all tests reference correct agent ✓ All YAML files parse correctly ✓ All test IDs are unique ✓ All tests have id, name, description, category, agent, prompt ✓ All tests have behavior expectations ✓ Test categories match directory structure ✓ All tests reference correct agent All tests validated, no errors Present complete test suite package All tests validated ## ✅ Test Suite Generation Complete ### Test Suite for: {agent-name} ### Test Coverage Summary ✅ **Test 1**: Planning & Approval Workflow ✅ **Test 2**: Context Loading Before Code ✅ **Test 3**: Incremental Implementation ✅ **Test 4**: Task Manager Delegation (4+ files) ✅ **Test 5**: Error Handling - Stop on Failure ✅ **Test 6**: Multi-Language Support ✅ **Test 7**: Coder Agent Delegation (Simple Task) ✅ **Test 8**: Completion Handoff **Total Tests**: 8/8 ✓ ### Files Created ``` evals/agents/{agent-name}/ ├── config/ │ └── config.yaml ├── tests/ │ ├── planning/ │ │ └── planning-approval-001.yaml │ ├── context-loading/ │ │ └── context-before-code-001.yaml │ ├── implementation/ │ │ ├── incremental-001.yaml │ │ └── multi-language-001.yaml │ ├── delegation/ │ │ ├── task-manager-001.yaml │ │ └── coder-agent-001.yaml │ ├── error-handling/ │ │ └── stop-on-failure-001.yaml │ ├── completion/ │ │ └── handoff-001.yaml │ └── navigation.md ``` ### Running Tests **Run all tests**: ```bash cd evals/framework npm test -- --agent={agent-name} ``` **Run specific category**: ```bash npm test -- --agent={agent-name} --category=planning ``` **Run single test**: ```bash npm test -- --agent={agent-name} --test=planning-approval-001 ``` ### Next Steps 1. Review generated tests and customize if needed 2. Run test suite to validate agent behavior 3. Add additional tests for agent-specific features 4. Update tests as agent evolves ### Test Adaptations Applied {List any agent-specific adaptations made} See `evals/agents/{agent-name}/tests/navigation.md` for detailed documentation. - Target agent file exists - Agent file is valid YAML/Markdown - Agent has identifiable behaviors - Test directory doesn't already exist (or confirm overwrite) - All 8 test files created - Config file valid - Documentation complete - All YAML files parse correctly - All test IDs unique - All tests reference correct agent Generate all 8 test types for complete coverage Tailor tests to agent's specific capabilities and behaviors Define clear, measurable behavior expectations Ensure all test files are valid YAML Provide clear documentation for test suite usage - `evals/agents/openagent/tests/` - Example comprehensive test suite - `evals/agents/opencoder/tests/` - Example developer agent tests - `evals/framework/docs/test-design-guide.md` - Test design guide - `evals/EVAL_FRAMEWORK_GUIDE.md` - Evaluation framework guide