INTEGRATION_TESTS.md 6.1 KB

Integration Tests - Eval Pipeline

Overview

Comprehensive integration tests for the OpenCode evaluation framework that validate the complete pipeline from test execution through evaluation and reporting.

Test File

Location: src/__tests__/eval-pipeline-integration.test.ts

Test Count: 14 comprehensive integration tests

Status: ✅ All tests passing (14/14)

Test Coverage

1. Single Test Execution (3 tests)

Tests basic test case execution and evaluation:

  • Simple test case end-to-end: Validates basic prompt execution, event capture, evaluation, and scoring
  • Test with tool execution: Validates tool execution detection and evaluation
  • Approval gate violations: Validates approval denial detection

2. Multiple Test Execution (1 test)

Tests batch execution capabilities:

  • Execute multiple tests in sequence: Validates sequential test execution with proper session isolation

3. Evaluator Integration (2 tests)

Tests evaluator coordination and aggregation:

  • Multiple evaluators on same session: Validates that multiple evaluators can analyze the same session
  • Violation aggregation: Validates that violations from multiple evaluators are properly aggregated and counted

4. Session Data Collection (2 tests)

Tests session data collection and timeline building:

  • Complete session timeline: Validates timeline building from session data
  • Session with no tool execution: Validates handling of text-only sessions

5. Error Handling (2 tests)

Tests error scenarios and edge cases:

  • Test timeout handling: Validates graceful timeout handling
  • Invalid session ID: Validates error handling for non-existent sessions

6. Result Validation (2 tests)

Tests result structure and validation:

  • Result structure validation: Validates complete result object structure
  • Overall score calculation: Validates score aggregation from multiple evaluators

7. Report Generation (2 tests)

Tests report generation capabilities:

  • Text report generation: Validates single-session report generation
  • Batch summary report: Validates multi-session summary generation

Running the Tests

Run Integration Tests Only

cd evals/framework
SKIP_INTEGRATION=false npm test -- src/__tests__/eval-pipeline-integration.test.ts --run

Run All Tests (Including Integration)

cd evals/framework
SKIP_INTEGRATION=false npm test -- --run

Skip Integration Tests (Default)

Integration tests are skipped by default in CI environments and when SKIP_INTEGRATION=true:

cd evals/framework
npm test -- --run  # Integration tests skipped

Test Requirements

Integration tests require:

  1. OpenCode CLI installed: The opencode command must be available
  2. Running server: Tests start their own server instance
  3. Network access: Tests communicate with the local server
  4. Time: Integration tests take ~60 seconds to complete

Test Architecture

Components Tested

  1. TestRunner: Orchestrates test execution
  2. TestExecutor: Executes individual test cases
  3. SessionReader: Reads session data from storage
  4. TimelineBuilder: Builds event timelines from sessions
  5. EvaluatorRunner: Runs evaluators and aggregates results
  6. Individual Evaluators: ApprovalGate, ContextLoading, ToolUsage, etc.

Test Flow

Test Case → TestRunner → TestExecutor → Agent Execution
                                              ↓
                                        Session Data
                                              ↓
                                     SessionReader
                                              ↓
                                    TimelineBuilder
                                              ↓
                                    EvaluatorRunner
                                              ↓
                                    Multiple Evaluators
                                              ↓
                                    Aggregated Results
                                              ↓
                                    Report Generation

Key Validations

Execution Phase

  • ✅ Session creation and management
  • ✅ Event stream handling
  • ✅ Approval strategy execution
  • ✅ Tool execution detection
  • ✅ Timeout handling
  • ✅ Error handling

Evaluation Phase

  • ✅ Timeline building from session data
  • ✅ Multiple evaluators running on same session
  • ✅ Violation detection and tracking
  • ✅ Evidence collection
  • ✅ Score calculation
  • ✅ Pass/fail determination

Reporting Phase

  • ✅ Result structure validation
  • ✅ Violation aggregation
  • ✅ Score aggregation
  • ✅ Text report generation
  • ✅ Batch summary generation

Test Isolation

Each test:

  • Creates its own session
  • Runs independently
  • Cleans up after completion
  • Does not affect other tests

Sessions are tracked in sessionIds array and cleaned up in afterAll hook.

Performance

  • Total Duration: ~60 seconds for all 14 tests
  • Average per test: ~4 seconds
  • Longest test: Batch execution (~8 seconds)
  • Shortest test: Error handling (~2 seconds)

Debugging

To enable debug output:

cd evals/framework
DEBUG_VERBOSE=true SKIP_INTEGRATION=false npm test -- src/__tests__/eval-pipeline-integration.test.ts --run

This will show:

  • Detailed event logs
  • Evaluator execution details
  • Session data
  • Timeline events
  • Violation details

Future Enhancements

Potential additions to integration tests:

  1. Multi-turn conversation tests: Test complex multi-message interactions
  2. Delegation tests: Test subagent delegation scenarios
  3. Context loading tests: Test context file loading and validation
  4. Performance benchmarks: Test execution speed and resource usage
  5. Parallel execution: Test concurrent test execution
  6. Custom evaluator tests: Test custom evaluator registration and execution

Related Documentation