TESTING.md 14 KB

V2 Provider Testing Strategy

Context

The v2 provider architecture introduces out-of-process providers communicating via gRPC. This architectural shift adds network communication, connection pooling, TLS/mTLS handling, and distributed system concerns that were absent in the v1 in-process model.

Testing must validate:

  • Functional Equivalence: V2 providers produce identical results to v1 providers
  • Performance Characteristics: Quantify the impact of the network hop and identify bottlenecks
  • Reliability: Handle network failures, certificate rotation, and provider restarts gracefully
  • Security: Enforce TLS requirements, namespace isolation, and certificate validation
  • Operational Patterns: Support rolling deployments, scaling, and common operational scenarios

The existing e2e test suite validates provider behavior in the v1 in-process model. We must ensure these tests pass against v2 providers while adding new tests for v2-specific concerns.

Problem Description

V2 testing faces challenges that v1 testing did not:

  1. Distributed System Complexity: Failures can occur in network communication, TLS handshakes, connection pooling, or remote provider processes
  2. Performance Unknowns: The network hop introduces latency, but connection pooling and caching should mitigate impact. We lack quantitative data.
  3. Operational Scenarios: Certificate rotation, rolling updates, and provider scaling are new operational concerns requiring validation
  4. Conformance Across Providers: With out-of-tree providers, we need standardized validation that each provider correctly implements the protocol
  5. Test Environment Complexity: Tests must deploy and manage provider processes, service networking, and TLS infrastructure

Without comprehensive testing, we risk:

  • Silent correctness issues where v2 produces different results than v1
  • Performance regressions that degrade user experience
  • Production failures during certificate rotation or rolling updates
  • Provider implementations that deviate from expected behavior
  • Inability to confidently recommend v2 adoption

Decision

Implement a multi-layered testing strategy covering unit, integration, e2e, conformance, performance, and disruption testing.

1. E2E Test Migration

Objective: Ensure existing provider behavior works identically with v2 architecture.

Approach:

  • Run the complete existing e2e test suite against v2 provider configurations
  • Each test case that currently uses v1 SecretStore gets a v2 equivalent using Provider resources
  • Deploy providers as separate services in the test environment
  • Configure TLS for provider communication in test clusters

Test Matrix:

Provider (AWS, GCP, Vault, etc.)
  × Store Type (SecretStore, ClusterSecretStore)
  × Auth Mode (ManifestNamespace, ProviderNamespace)
  × Feature (GetSecret, GetAllSecrets, PushSecret, DeleteSecret)
  × Data Format (string, binary, JSON templating, dataFrom)

Implementation:

  • Create test utilities that generate v2 provider configurations from existing v1 tests
  • Establish provider deployment helpers for test environments
  • Run v1 and v2 tests in parallel during transition period to detect regressions

Success Criteria:

  • 100% of v1 e2e tests pass with v2 providers
  • No behavioral differences between v1 and v2 results

2. Certificate Management Tests

Objective: Validate TLS certificate lifecycle operations work without service disruption.

Test Scenarios:

2.1 Certificate Rotation

  • Issue initial certificates for ESO→Provider communication
  • Rotate server certificate while maintaining client connections
  • Rotate client certificate while maintaining connectivity
  • Rotate CA certificate with overlap period
  • Verify: Zero failed reconciliations during rotation
  • Verify: Connection pool handles certificate updates gracefully

2.2 Certificate Expiration

  • Configure short-lived certificates (5 minutes)
  • Run reconciliation loop through multiple certificate renewals
  • Verify: Automatic certificate refresh before expiration
  • Verify: Clear error messages if certificate expires

2.3 Certificate Validation Failures

  • Present invalid server certificate (wrong CN, expired, self-signed)
  • Present mismatched CA certificate
  • Present revoked certificate
  • Verify: Connections rejected with clear error messages
  • Verify: Controller status reflects TLS validation failures

2.4 mTLS Configuration

  • Enable mutual TLS with client certificates
  • Verify: Provider rejects connections without valid client cert
  • Verify: ESO successfully authenticates with client cert
  • Rotate both client and server certificates

Implementation:

  • Integrate cert-manager in test environments for automated certificate issuance
  • Create test scenarios with Kubernetes TLS secrets
  • Use short certificate lifetimes to accelerate rotation testing

3. Provider Conformance Suite

Objective: Standardize validation that providers correctly implement the v2 protocol.

Approach: Create a reusable test library (providers/v2/conformance) that provider implementers run against their implementations.

Core Conformance Tests:

3.1 Secret Operations

  • GetSecret: Retrieve secret by key, handle missing secrets, decode strategies
  • GetSecretMap: Retrieve key-value maps (deprecated but supported)
  • GetAllSecrets: Find secrets by path, tags, regex, conversion strategies
  • PushSecret: Write secrets with properties, metadata, idempotency
  • DeleteSecret: Remove secrets, handle non-existent deletions
  • SecretExists: Check existence without retrieving data

3.2 Error Semantics

  • Return NoSecretError for missing secrets with correct deletion policy behavior
  • Return validation errors for malformed requests
  • Return permission errors when authentication fails
  • Propagate timeouts and retryable vs non-retryable errors correctly

3.3 Authentication

  • Respect namespace boundaries for secret references
  • Support multiple authentication methods (IAM, service accounts, static credentials)
  • Handle credential refresh and expiration
  • Reject cross-namespace access when not permitted

3.4 Protocol Compliance

  • Respond to health checks correctly
  • Implement graceful shutdown
  • Handle concurrent requests safely
  • Respect context cancellation
  • Return proper gRPC status codes

3.5 Provider Metadata

  • Return correct capabilities (ReadOnly, WriteOnly, ReadWrite)
  • Validate provider configuration
  • Provide meaningful validation warnings

Implementation:

  • Conformance tests as Go package: import "github.com/external-secrets/external-secrets/providers/v2/conformance"
  • Provider tests instantiate conformance suite with their provider implementation
  • CI integration: gate provider releases on conformance pass
  • Versioned conformance suite (v2alpha1, v2beta1) for compatibility testing

Success Criteria:

  • All official providers pass 100% of conformance tests
  • Conformance suite executable in <2 minutes
  • Clear failure messages identifying non-compliant behavior

4. Performance and Load Tests

Objective: Quantify performance impact of v2 architecture and identify breaking points.

4.1 Baseline Performance Comparison

Compare v1 (in-process) vs v2 (gRPC) on identical workloads:

Metrics:

  • Secret fetch latency (p50, p95, p99)
  • Reconciliation latency (ExternalSecret update to Secret ready)
  • Throughput (secrets/second)
  • CPU usage (controller and provider)
  • Memory usage (controller and provider)
  • Network bandwidth
  • Connection pool statistics (active, idle, created, reused)

Test Scenarios:

Small: 100 ExternalSecrets, 1 secret each, 5m refresh
Medium: 1000 ExternalSecrets, 3 secrets each, 5m refresh
Large: 5000 ExternalSecrets, 10 secrets each, 5m refresh
Burst: 1000 ExternalSecrets created simultaneously

Implementation:

  • Deploy both v1 and v2 configurations in identical clusters
  • Use Prometheus to capture metrics
  • Generate ExternalSecrets with controlled characteristics
  • Run tests for 30 minutes to measure steady-state performance
  • Export results to comparable format (CSV, JSON)

4.2 Connection Pool Behavior

Validate connection pooling effectiveness:

Test Cases:

  • Measure connection reuse rate under steady load
  • Verify connection pool respects max connection limits
  • Validate idle connection timeout and cleanup
  • Measure connection establishment latency
  • Test connection recovery after network blip

4.3 Cache Effectiveness

Measure client manager cache hit rates:

Metrics:

  • Cache hit ratio by provider type
  • Cache invalidation frequency and causes
  • Memory consumption by cache size
  • Impact of generation changes on cache invalidation

4.4 Breaking Point Analysis

Identify system limits:

Approach:

  • Incrementally increase load until degradation or failure
  • Vary: Number of ExternalSecrets, secrets per ExternalSecret, refresh frequency
  • Measure: When does latency exceed SLO? When do errors begin? What fails first?
  • Compare: Where does v2 break compared to v1?

Implementation:

  • Use load generation tools (custom or k6/locust adapted for Kubernetes)
  • Monitor resource exhaustion (CPU, memory, file descriptors, connections)
  • Capture system behavior at breaking point (logs, metrics, traces)

Success Criteria:

  • v2 adds ≤50ms p95 latency compared to v1 under medium load
  • v2 throughput within 80% of v1 throughput
  • v2 handles ≥1000 concurrent ExternalSecrets per controller
  • Connection pool prevents connection exhaustion
  • Clear documentation of performance characteristics and limits

5. Disruption and Chaos Tests

Objective: Validate system resilience during operational disruptions.

5.1 Rolling Deployments

Test rolling updates of ESO controller and providers:

Scenarios:

  • Roll ESO controller pods while providers run
  • Roll provider pods while ESO reconciles
  • Roll both simultaneously
  • Vary: Deployment strategy (RollingUpdate, Recreate), replica count, update velocity

Measurements:

  • Reconciliation success rate during rollout
  • Latency increase during rollout
  • Connection pool behavior during pod replacement
  • Error rate and recovery time
  • Number of failed secret fetches

5.2 Network Failures

Simulate network issues between ESO and providers:

Test Cases:

  • Complete network partition (10s, 60s, 5m)
  • Packet loss (5%, 20%, 50%)
  • Latency injection (+50ms, +500ms, +5s)
  • DNS resolution failures
  • Service endpoint unavailability

Measurements:

  • Retry behavior and backoff
  • Circuit breaking (if implemented)
  • Error propagation to ExternalSecret status
  • Recovery time after network restoration
  • Connection pool health monitoring effectiveness

5.3 Provider Failures

Test provider process failures:

Scenarios:

  • Graceful shutdown (SIGTERM)
  • Forced termination (SIGKILL)
  • Provider panic/crash
  • Provider deadlock/hang
  • OOM kill

Measurements:

  • Health check detection time
  • Connection pool marking unhealthy connections
  • Automatic reconnection attempts
  • User-visible error messages in ExternalSecret status
  • Time to recovery after provider restart

5.4 Certificate Issues

Inject certificate problems:

Test Cases:

  • Expire server certificate mid-operation
  • Revoke certificate
  • Change CA without updating client
  • TLS handshake timeout
  • Certificate chain validation failure

Measurements:

  • Error detection latency
  • Error message clarity
  • Automatic recovery after certificate fix

5.5 Resource Contention

Test behavior under resource pressure:

Scenarios:

  • CPU throttling (limit provider to 100m CPU)
  • Memory pressure (limit provider to 128Mi)
  • Disk I/O saturation
  • High concurrent request load

Measurements:

  • Graceful degradation vs hard failure
  • Request timeout behavior
  • Resource limit enforcement
  • Queue buildup and backpressure

Implementation:

  • Use chaos engineering tools (Chaos Mesh, Litmus)
  • Automate disruption injection in test suite
  • Run continuously in staging environments
  • Generate chaos test reports with metrics and logs

Success Criteria:

  • Zero data corruption during disruptions
  • <5% error rate during rolling updates
  • Automatic recovery within 2 minutes after disruption ends
  • Clear error messages visible in ExternalSecret status
  • No panics or crashes in ESO controller

6. Additional Recommended Tests

6.1 Version Skew Tests

Validate compatibility across version combinations:

Matrix:

  • ESO version N with Provider version N-1, N, N+1
  • Protocol version compatibility
  • Deprecated field handling
  • Forward/backward compatibility

6.2 Metrics Validation

Ensure observability correctness:

Tests:

  • Verify all metrics are emitted correctly
  • Validate metric labels and cardinality
  • Check metrics match actual system behavior
  • Ensure no metrics memory leaks

6.3 Concurrent Operations

Test race conditions and concurrent access:

Scenarios:

  • Multiple ExternalSecrets referencing same Provider simultaneously
  • Rapid ExternalSecret create/delete cycles
  • Concurrent provider client cache access
  • Connection pool concurrent get/release

6.4 Error Recovery

Test recovery from error states:

Scenarios:

  • Provider becomes healthy after being unhealthy
  • Invalid configuration fixed
  • Credentials updated after auth failure
  • Network restored after partition

6.5 Migration Tests

Validate v1 to v2 migration:

Tests:

  • Switch ExternalSecret from v1 to v2 store without data loss
  • Run mixed v1/v2 workloads simultaneously
  • Gradual provider migration
  • Rollback from v2 to v1

Consequences

Positive

  • Confidence: Comprehensive testing enables confident v2 recommendation to users
  • Quality: Conformance suite ensures provider consistency
  • Performance Insight: Quantitative data informs optimization priorities
  • Operational Readiness: Disruption tests validate production scenarios
  • Regression Prevention: Automated testing catches regressions early

Negative

  • Test Infrastructure Complexity: Managing provider deployments increases test environment complexity
  • Execution Time: Comprehensive testing takes longer than v1 tests
  • Maintenance Burden: More tests require ongoing maintenance
  • Resource Cost: Performance and chaos tests consume significant compute resources

Neutral

  • Gradual Rollout: Testing strategy supports phased v2 adoption
  • Provider Responsibility: Out-of-tree providers own their conformance test execution
  • Tooling Requirements: Requires investment in test tooling and infrastructure