Integration Tests - Eval Pipeline
Overview
Comprehensive integration tests for the OpenCode evaluation framework that validate the complete pipeline from test execution through evaluation and reporting.
Test File
Location: src/__tests__/eval-pipeline-integration.test.ts
Test Count: 14 comprehensive integration tests
Status: ✅ All tests passing (14/14)
Test Coverage
1. Single Test Execution (3 tests)
Tests basic test case execution and evaluation:
- Simple test case end-to-end: Validates basic prompt execution, event capture, evaluation, and scoring
- Test with tool execution: Validates tool execution detection and evaluation
- Approval gate violations: Validates approval denial detection
2. Multiple Test Execution (1 test)
Tests batch execution capabilities:
- Execute multiple tests in sequence: Validates sequential test execution with proper session isolation
3. Evaluator Integration (2 tests)
Tests evaluator coordination and aggregation:
- Multiple evaluators on same session: Validates that multiple evaluators can analyze the same session
- Violation aggregation: Validates that violations from multiple evaluators are properly aggregated and counted
4. Session Data Collection (2 tests)
Tests session data collection and timeline building:
- Complete session timeline: Validates timeline building from session data
- Session with no tool execution: Validates handling of text-only sessions
5. Error Handling (2 tests)
Tests error scenarios and edge cases:
- Test timeout handling: Validates graceful timeout handling
- Invalid session ID: Validates error handling for non-existent sessions
6. Result Validation (2 tests)
Tests result structure and validation:
- Result structure validation: Validates complete result object structure
- Overall score calculation: Validates score aggregation from multiple evaluators
7. Report Generation (2 tests)
Tests report generation capabilities:
- Text report generation: Validates single-session report generation
- Batch summary report: Validates multi-session summary generation
Running the Tests
Run Integration Tests Only
cd evals/framework
SKIP_INTEGRATION=false npm test -- src/__tests__/eval-pipeline-integration.test.ts --run
Run All Tests (Including Integration)
cd evals/framework
SKIP_INTEGRATION=false npm test -- --run
Skip Integration Tests (Default)
Integration tests are skipped by default in CI environments and when SKIP_INTEGRATION=true:
cd evals/framework
npm test -- --run # Integration tests skipped
Test Requirements
Integration tests require:
- OpenCode CLI installed: The
opencode command must be available
- Running server: Tests start their own server instance
- Network access: Tests communicate with the local server
- Time: Integration tests take ~60 seconds to complete
Test Architecture
Components Tested
- TestRunner: Orchestrates test execution
- TestExecutor: Executes individual test cases
- SessionReader: Reads session data from storage
- TimelineBuilder: Builds event timelines from sessions
- EvaluatorRunner: Runs evaluators and aggregates results
- Individual Evaluators: ApprovalGate, ContextLoading, ToolUsage, etc.
Test Flow
Test Case → TestRunner → TestExecutor → Agent Execution
↓
Session Data
↓
SessionReader
↓
TimelineBuilder
↓
EvaluatorRunner
↓
Multiple Evaluators
↓
Aggregated Results
↓
Report Generation
Key Validations
Execution Phase
- ✅ Session creation and management
- ✅ Event stream handling
- ✅ Approval strategy execution
- ✅ Tool execution detection
- ✅ Timeout handling
- ✅ Error handling
Evaluation Phase
- ✅ Timeline building from session data
- ✅ Multiple evaluators running on same session
- ✅ Violation detection and tracking
- ✅ Evidence collection
- ✅ Score calculation
- ✅ Pass/fail determination
Reporting Phase
- ✅ Result structure validation
- ✅ Violation aggregation
- ✅ Score aggregation
- ✅ Text report generation
- ✅ Batch summary generation
Test Isolation
Each test:
- Creates its own session
- Runs independently
- Cleans up after completion
- Does not affect other tests
Sessions are tracked in sessionIds array and cleaned up in afterAll hook.
Performance
- Total Duration: ~60 seconds for all 14 tests
- Average per test: ~4 seconds
- Longest test: Batch execution (~8 seconds)
- Shortest test: Error handling (~2 seconds)
Debugging
To enable debug output:
cd evals/framework
DEBUG_VERBOSE=true SKIP_INTEGRATION=false npm test -- src/__tests__/eval-pipeline-integration.test.ts --run
This will show:
- Detailed event logs
- Evaluator execution details
- Session data
- Timeline events
- Violation details
Future Enhancements
Potential additions to integration tests:
- Multi-turn conversation tests: Test complex multi-message interactions
- Delegation tests: Test subagent delegation scenarios
- Context loading tests: Test context file loading and validation
- Performance benchmarks: Test execution speed and resource usage
- Parallel execution: Test concurrent test execution
- Custom evaluator tests: Test custom evaluator registration and execution
Related Documentation