# Integration Tests - Eval Pipeline ## Overview Comprehensive integration tests for the OpenCode evaluation framework that validate the complete pipeline from test execution through evaluation and reporting. ## Test File **Location**: `src/__tests__/eval-pipeline-integration.test.ts` **Test Count**: 14 comprehensive integration tests **Status**: ✅ All tests passing (14/14) ## Test Coverage ### 1. Single Test Execution (3 tests) Tests basic test case execution and evaluation: - **Simple test case end-to-end**: Validates basic prompt execution, event capture, evaluation, and scoring - **Test with tool execution**: Validates tool execution detection and evaluation - **Approval gate violations**: Validates approval denial detection ### 2. Multiple Test Execution (1 test) Tests batch execution capabilities: - **Execute multiple tests in sequence**: Validates sequential test execution with proper session isolation ### 3. Evaluator Integration (2 tests) Tests evaluator coordination and aggregation: - **Multiple evaluators on same session**: Validates that multiple evaluators can analyze the same session - **Violation aggregation**: Validates that violations from multiple evaluators are properly aggregated and counted ### 4. Session Data Collection (2 tests) Tests session data collection and timeline building: - **Complete session timeline**: Validates timeline building from session data - **Session with no tool execution**: Validates handling of text-only sessions ### 5. Error Handling (2 tests) Tests error scenarios and edge cases: - **Test timeout handling**: Validates graceful timeout handling - **Invalid session ID**: Validates error handling for non-existent sessions ### 6. Result Validation (2 tests) Tests result structure and validation: - **Result structure validation**: Validates complete result object structure - **Overall score calculation**: Validates score aggregation from multiple evaluators ### 7. Report Generation (2 tests) Tests report generation capabilities: - **Text report generation**: Validates single-session report generation - **Batch summary report**: Validates multi-session summary generation ## Running the Tests ### Run Integration Tests Only ```bash cd evals/framework SKIP_INTEGRATION=false npm test -- src/__tests__/eval-pipeline-integration.test.ts --run ``` ### Run All Tests (Including Integration) ```bash cd evals/framework SKIP_INTEGRATION=false npm test -- --run ``` ### Skip Integration Tests (Default) Integration tests are skipped by default in CI environments and when `SKIP_INTEGRATION=true`: ```bash cd evals/framework npm test -- --run # Integration tests skipped ``` ## Test Requirements Integration tests require: 1. **OpenCode CLI installed**: The `opencode` command must be available 2. **Running server**: Tests start their own server instance 3. **Network access**: Tests communicate with the local server 4. **Time**: Integration tests take ~60 seconds to complete ## Test Architecture ### Components Tested 1. **TestRunner**: Orchestrates test execution 2. **TestExecutor**: Executes individual test cases 3. **SessionReader**: Reads session data from storage 4. **TimelineBuilder**: Builds event timelines from sessions 5. **EvaluatorRunner**: Runs evaluators and aggregates results 6. **Individual Evaluators**: ApprovalGate, ContextLoading, ToolUsage, etc. ### Test Flow ``` Test Case → TestRunner → TestExecutor → Agent Execution ↓ Session Data ↓ SessionReader ↓ TimelineBuilder ↓ EvaluatorRunner ↓ Multiple Evaluators ↓ Aggregated Results ↓ Report Generation ``` ## Key Validations ### Execution Phase - ✅ Session creation and management - ✅ Event stream handling - ✅ Approval strategy execution - ✅ Tool execution detection - ✅ Timeout handling - ✅ Error handling ### Evaluation Phase - ✅ Timeline building from session data - ✅ Multiple evaluators running on same session - ✅ Violation detection and tracking - ✅ Evidence collection - ✅ Score calculation - ✅ Pass/fail determination ### Reporting Phase - ✅ Result structure validation - ✅ Violation aggregation - ✅ Score aggregation - ✅ Text report generation - ✅ Batch summary generation ## Test Isolation Each test: - Creates its own session - Runs independently - Cleans up after completion - Does not affect other tests Sessions are tracked in `sessionIds` array and cleaned up in `afterAll` hook. ## Performance - **Total Duration**: ~60 seconds for all 14 tests - **Average per test**: ~4 seconds - **Longest test**: Batch execution (~8 seconds) - **Shortest test**: Error handling (~2 seconds) ## Debugging To enable debug output: ```bash cd evals/framework DEBUG_VERBOSE=true SKIP_INTEGRATION=false npm test -- src/__tests__/eval-pipeline-integration.test.ts --run ``` This will show: - Detailed event logs - Evaluator execution details - Session data - Timeline events - Violation details ## Future Enhancements Potential additions to integration tests: 1. **Multi-turn conversation tests**: Test complex multi-message interactions 2. **Delegation tests**: Test subagent delegation scenarios 3. **Context loading tests**: Test context file loading and validation 4. **Performance benchmarks**: Test execution speed and resource usage 5. **Parallel execution**: Test concurrent test execution 6. **Custom evaluator tests**: Test custom evaluator registration and execution ## Related Documentation - [Eval Framework README](./README.md) - [Creating Tests Guide](../CREATING_TESTS.md) - [Migration Guide](../MIGRATION_GUIDE.md) - [Subagent Testing](../SUBAGENT_TESTING.md)