Browse Source

feat: design out-of-process providers

Signed-off-by: Moritz Johner <beller.moritz@googlemail.com>
Moritz Johner 7 months ago
parent
commit
a3cbbd8233

+ 127 - 0
design/014/ADAPTER.md

@@ -0,0 +1,127 @@
+# Out-of-Process Provider Adapter Pattern
+
+## Context
+
+External Secrets Operator reconciles `ExternalSecret` and `PushSecret` resources by fetching or pushing secrets to external secret management systems through provider implementations. Historically, all providers run in-process within the controller binary. This architecture requires provider code to be statically linked at compile time and limits deployment flexibility.
+
+We are introducing a v2 provider architecture where providers run as separate gRPC server processes, enabling:
+- Independent deployment and scaling of providers
+- Heterogeneous language support for provider implementations
+- Reduced controller binary size and memory footprint
+- Runtime provider discovery and configuration
+
+This architectural shift introduces a network hop between the controller and secret providers. A critical requirement is maintaining a single codebase for provider implementations—we cannot fork provider implementations into separate "in-process" and "out-of-process" versions.
+
+## Problem Description
+
+The provider abstraction is defined by the `SecretsClient` interface, which provides methods for secret operations (`GetSecret`, `PushSecret`, `DeleteSecret`, etc.). The `clientmanager` is responsible for instantiating and caching these clients for use during reconciliation.
+
+Introducing out-of-process providers creates two challenges:
+
+1. **Interface Compatibility:** The controller expects all providers to implement `SecretsClient`, but out-of-process providers communicate via gRPC rather than direct method calls.
+
+2. **Code Reuse:** Provider implementations must work both as standalone gRPC servers and as libraries usable by in-process controllers without maintaining duplicate codebases.
+
+## Decision
+
+Implement a **bidirectional adapter pattern** with two complementary components:
+
+### Client-Side Adapter
+
+A client-side adapter wraps a gRPC client and implements the `esv1.SecretsClient` interface. When the `clientmanager` requests a provider client, it receives this adapter which:
+
+1. Accepts method calls matching the `SecretsClient` interface
+2. Converts parameters to protobuf messages
+3. Sends gRPC requests to the remote provider server
+4. Converts protobuf responses back to expected return types
+
+The adapter is transparent to the reconciliation logic. Controllers interact with remote providers using the same interface as in-process providers.
+
+Integration point:
+
+```go
+// clientmanager/manager.go
+func (m *Manager) getV2ProviderClient(ctx context.Context, providerName, namespace string) (esv1.SecretsClient, error) {
+    // Get gRPC connection from pool
+    grpcClient, err := pool.Get(ctx, address, tlsConfig)
+    
+    // Wrap with client-side adapter
+    wrappedClient := adapterstore.NewClient(grpcClient, providerRef, authNamespace)
+    
+    // Cache and return - reconciler sees SecretsClient interface
+    return wrappedClient, nil
+}
+```
+
+### Server-Side Adapter
+
+The server-side adapter receives gRPC requests and translates them into `SecretsClient` interface calls. The adapter:
+
+1. Implements gRPC service interfaces (`SecretStoreProviderServer`, `GeneratorProviderServer`)
+2. Receives protobuf request messages
+3. Constructs provider instances using existing v1 provider factories
+4. Converts protobuf parameters to interface types
+5. Invokes methods on the provider's `SecretsClient` implementation
+6. Converts results to protobuf responses
+
+Provider implementations remain unchanged—they implement `ProviderInterface.NewClient()` and return `SecretsClient` instances exactly as they do for in-process use.
+
+Integration point:
+
+```go
+// providers/v2/aws/main.go (generated)
+func main() {
+    // Existing v1 provider implementation
+    v1Provider := store.NewProvider()
+    
+    // Map provider by GVK
+    providerMapping := adapterstore.ProviderMapping{
+        schema.GroupVersionKind{...}: v1Provider,
+    }
+    
+    // Adapter wraps v1 provider as gRPC server
+    adapterServer := adapter.NewServer(kubeClient, scheme, providerMapping, specMapper, generatorMapping)
+    pb.RegisterSecretStoreProviderServer(grpcServer, adapterServer)
+}
+```
+
+### Data Flow
+
+1. Controller reconciles `ExternalSecret` referencing a v2 `Provider`
+2. `clientmanager.Get()` detects v2 provider kind
+3. Manager creates client-side adapter wrapping gRPC connection
+4. Reconciler calls `client.GetSecret(ctx, ref)`
+5. Client-side adapter converts call to `pb.GetSecretRequest`
+6. gRPC request sent to remote provider server
+7. Server-side adapter receives request
+8. Server-side adapter constructs v1 provider client
+9. Server-side adapter calls `client.GetSecret(ctx, ref)` on v1 implementation
+10. Server-side adapter converts result to `pb.GetSecretResponse`
+11. Client-side adapter converts response to `[]byte`
+12. Reconciler receives secret data
+
+### Connection Management
+
+The architecture employs a global connection pool (`grpc.ConnectionPool`) to enable connection reuse across reconciliations. The `clientmanager` tracks pooled connections and releases them on `Close()`, not closing the underlying connection but returning it to the pool for subsequent use.
+
+## Consequences
+
+### Positive
+
+- **Single Codebase:** Provider implementations exist once and work both in-process and out-of-process through adapters
+- **Interface Stability:** Reconciliation logic remains unchanged; the adapter pattern is transparent
+- **Flexibility:** Providers can be deployed in-process (legacy), out-of-process (v2), or mixed
+- **Testability:** v1 provider implementations can be tested directly without gRPC infrastructure
+- **Gradual Migration:** Existing providers migrate individually without disrupting others
+
+### Negative
+
+- **Performance Overhead:** Network hop adds latency compared to in-process calls (mitigated by connection pooling and client caching)
+- **Serialization Cost:** Data must be serialized/deserialized at adapter boundaries
+- **Complexity:** Additional layer of indirection requires understanding adapter pattern for debugging
+- **Error Propagation:** gRPC errors must be properly mapped to provider errors for consistent behavior
+
+### Neutral
+
+- **Interface Constraints:** The adapter pattern requires protobuf definitions to match the `SecretsClient` interface capabilities
+- **Versioning:** Changes to `SecretsClient` interface require coordinated updates to protobuf definitions and both adapters

+ 141 - 0
design/014/API.md

@@ -0,0 +1,141 @@
+# API Design for Out-of-Process Providers
+
+## Decision
+
+How should ExternalSecret and PushSecret resources reference out-of-process providers?
+
+## Context
+
+Out-of-process providers run in separate pods and communicate with ESO Core via gRPC. We need an API design that enables:
+
+- Service discovery: ESO Core must locate the provider's gRPC endpoint
+- Provider-specific configuration: Each provider needs its own configuration (credentials, endpoints, etc.)
+- Cluster-scoped and namespace-scoped resources
+- Common provider behaviors (authentication scope, conditions, status)
+
+## Options
+
+### Option 1: Provider Indirection Layer
+
+ExternalSecret references a `Provider` or `ClusterProvider` resource, which then references a provider-specific custom resource.
+
+**Architecture:**
+```
+ExternalSecret
+  ├─> secretStoreRef.kind: Provider
+  └─> secretStoreRef.name: my-provider
+        └─> Provider
+              ├─> spec.address: grpc://provider-service:8080
+              └─> spec.providerRef
+                    ├─> kind: SecretsManager
+                    └─> name: team-blue-eu-west-2
+```
+
+![Out of Process Providers](./out-of-process-api-provider.png)
+
+**Example:**
+```yaml
+
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: database-credentials
+  namespace: external-secrets-system
+spec:
+  refreshInterval: 1h
+  secretStoreRef:
+    kind: Provider
+    name: aws-provider
+```
+
+**Pros:**
+- Common fields and behaviors are defined once on `Provider`/`ClusterProvider`
+- ESO Core interprets shared fields (authentication scope, conditions)
+- Service discovery is explicit via `spec.address`
+- No need for cluster-scoped variants of provider-specific CRDs
+- Clear separation: `Provider` handles connectivity, provider-specific CR handles configuration
+
+**Cons:**
+- Additional layer of indirection increases complexity
+- Users must understand two resource types instead of one
+- Naming and responsibilities may be unclear to new users
+- More verbose configuration
+
+### Option 2: Direct Reference
+
+ExternalSecret directly references provider-specific custom resources.
+
+**Architecture:**
+```
+ExternalSecret
+  ├─> secretStoreRef.apiVersion: aws.provider.external-secrets.io/v2alpha1
+  ├─> secretStoreRef.kind: SecretsManager
+  └─> secretStoreRef.name: team-blue-eu-west-2
+```
+
+![Out of Process Providers](./out-of-process-api-direct-ref.png)
+
+**Example:**
+
+```yaml
+
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: database-credentials
+  namespace: external-secrets-system
+spec:
+  refreshInterval: 1h
+  secretStoreRef:
+    apiVersion: aws.provider.external-secrets.io/v2alpha1
+    kind: SecretsManager
+    name: team-blue-eu-west-2
+
+```
+
+**Pros:**
+- Simpler mental model: one reference instead of two
+- Fewer resources to manage
+- More intuitive for users familiar with Kubernetes patterns
+- Cleaner YAML configuration
+
+**Cons:**
+- Each provider must implement cluster-scoped variants (e.g., `ClusterSecretsManager`, `ClusterParameterStore`)
+- Significant boilerplate code for cluster-scoped resources
+- Service discovery requires convention-based approach (e.g., labels on Service objects)
+- Common fields must be duplicated across all provider CRDs
+- No single place to configure authentication scope or shared behaviors
+- ESO Core cannot interpret provider-specific fields
+
+## Service Discovery
+
+### Option 1: Explicit Address
+The `Provider` resource contains `spec.address` pointing to the gRPC endpoint.
+
+### Option 2: Convention-Based Discovery
+ESO Core discovers services by:
+- Label selectors on Service objects
+- Naming conventions (e.g., `<provider-kind>-<name>`)
+- Namespace discovery rules
+
+## Trade-offs
+
+| Aspect | Option 1: Indirection | Option 2: Direct Reference |
+|--------|----------------------|----------------------------|
+| User Experience | More complex | Simpler |
+| Implementation | Less boilerplate | More boilerplate |
+| Extensibility | Centralized common fields | Duplicated fields |
+| Service Discovery | Explicit | Convention-based |
+| Maintenance | Single controller pattern | Multiple cluster-scoped controllers |
+
+## Recommendation
+
+[To be filled in after discussion]
+
+## Open Questions
+
+1. How do we handle provider authentication? Should it be configured on the outer layer or provider-specific CR?
+2. What happens when a provider needs to expose custom status conditions?
+3. How do we version provider-specific APIs independently from ESO Core?
+4. Can we provide a middle-ground solution that combines benefits of both approaches?
+

+ 175 - 0
design/014/PROVIDER_AUTOGEN.md

@@ -0,0 +1,175 @@
+# Provider Code Generation
+
+## Context
+
+V2 providers run as standalone gRPC server processes, each requiring a `main.go` file with startup logic and a `Dockerfile` for containerization. This startup code follows a consistent pattern across all providers:
+
+- Parse command-line flags (port, TLS settings, metrics)
+- Initialize Kubernetes client with appropriate schemes
+- Set up metrics collection and health checks
+- Configure gRPC server with TLS/mTLS
+- Register provider and generator services
+- Implement graceful shutdown handling
+
+Provider-specific configuration includes:
+- Which v1 provider implementations to wrap
+- GVK (Group/Version/Kind) mappings for stores and generators
+- Package imports for API types and implementations
+- Spec mapper logic to convert provider resources to `SecretStoreSpec`
+
+Manually maintaining this boilerplate across multiple providers creates maintenance burden, increases error probability, and makes cross-cutting changes (like adding new flags or improving shutdown logic) require updates to every provider.
+
+## Problem Description
+
+Without code generation, each provider requires:
+
+1. **Repetitive Boilerplate**: 150+ lines of identical startup code duplicated across providers
+2. **Maintenance Overhead**: Changes to common patterns require updating every provider's `main.go`
+3. **Error Susceptibility**: Manual construction of import statements and GVK mappings is error-prone
+4. **Inconsistency Risk**: Providers may drift from standard patterns over time
+5. **Slow Provider Addition**: Creating new providers requires copying and adapting existing `main.go` files
+
+The only true differences between providers are:
+- Provider name and display name
+- Which v1 implementations to instantiate
+- GVK mappings for stores and generators
+- Import paths for API types
+
+## Decision
+
+Implement a template-based code generator that produces `main.go` and `Dockerfile` from declarative YAML configuration.
+
+### Architecture
+
+Introduce a **Generator Tool**:
+- Discovers all `provider.yaml` files in the providers directory
+- Validates each against JSON schema
+- Executes Go templates with provider-specific data
+- Manages import aliases to avoid conflicts
+- Formats generated code with `goimports`
+
+**Provider Configuration** (`provider.yaml`):
+```yaml
+provider:
+  name: aws
+  displayName: "AWS Provider"
+  v2Package: "github.com/.../apis/provider/aws/v2alpha1"
+
+stores:
+  - gvk:
+      group: "provider.external-secrets.io"
+      version: "v2alpha1"
+      kind: "SecretsManager"
+    v1Provider: "github.com/.../providers/v2/aws/store"
+    v1ProviderFunc: "NewProvider"
+
+generators:
+  - gvk:
+      group: "generators.external-secrets.io"
+      version: "v1alpha1"
+      kind: "ECRAuthorizationToken"
+    v1Generator: "github.com/.../providers/v2/aws/generator"
+    v1GeneratorFunc: "NewECRGenerator"
+
+configPackage: "."
+```
+
+**Manual Component** (`config.go`):
+
+Provider-specific logic that cannot be templated remains in manually written `config.go`:
+
+```go
+func GetSpecMapper(kubeClient client.Client) func(*pb.ProviderReference) (*v1.SecretStoreSpec, error) {
+    return func(ref *pb.ProviderReference) (*v1.SecretStoreSpec, error) {
+        var provider awsv2alpha1.SecretsManager
+        err := kubeClient.Get(context.Background(), client.ObjectKey{
+            Namespace: ref.Namespace,
+            Name:      ref.Name,
+        }, &provider)
+        if err != nil {
+            return nil, err
+        }
+        return &v1.SecretStoreSpec{
+            Provider: &v1.SecretStoreProvider{
+                AWS: &provider.Spec,
+            },
+        }, nil
+    }
+}
+```
+
+### Generation Process
+
+1. **Discovery**: Walk `providers/v2/` to find all `provider.yaml` files
+2. **Validation**: Validate YAML against JSON schema to ensure correctness
+3. **Template Data Preparation**:
+   - Parse YAML into structured configuration
+   - Deduplicate imports and generate aliases
+   - Build template data with GVK mappings and import information
+4. **Code Generation**:
+   - Execute `main.go.tmpl` template
+   - Execute `Dockerfile.tmpl` template
+5. **Formatting**: Run `goimports` to format and organize imports
+6. **Output**: Write generated files to provider directory
+
+### Schema Validation
+
+The JSON schema enforces:
+- Required fields (`provider.name`, `provider.displayName`)
+- At least one of `stores` or `generators` must be defined
+- Proper GVK structure for all mappings
+- Valid package paths
+
+### Integration
+
+Makefile targets provide interface to the generator:
+
+```bash
+make generate-providers  # Generate all provider files
+make verify-providers    # Check if files are up-to-date
+```
+
+CI verification ensures generated files remain synchronized with configuration.
+
+## Consequences
+
+### Positive
+
+- **Reduced Boilerplate**: Eliminates 150+ lines of repetitive code per provider
+- **Centralized Evolution**: Improvements to startup logic propagate to all providers via template updates
+- **Type Safety**: Schema validation catches configuration errors before code generation
+- **Consistency**: All providers follow identical patterns, reducing cognitive load
+- **Fast Onboarding**: New providers require only YAML configuration and spec mapper logic
+- **Import Management**: Generator handles deduplication and aliasing automatically
+- **Verifiable**: CI can detect drift between configuration and generated code
+
+### Negative
+
+- **Indirection**: Debugging requires understanding template system
+- **Build Complexity**: Additional step in development workflow
+- **Tool Dependency**: Requires `goimports` for formatting
+- **Schema Maintenance**: Changes to common patterns require schema and template updates
+- **Generated Code Friction**: Cannot directly edit `main.go`—must modify template or configuration
+
+### Neutral
+
+- **Hybrid Approach**: Provider-specific logic (`GetSpecMapper`) remains manual, requiring judgment about what to template
+- **Template Language**: Go templates have limitations compared to programmatic generation
+- **Verification Required**: CI must enforce that generated files match configuration
+
+## Alternatives Considered
+
+### Full Manual Implementation
+Rejected because maintenance burden scales linearly with provider count and cross-cutting changes become expensive.
+
+### Pure Library Approach
+Rejected because providers need different combinations of stores and generators, and compile-time type safety requires different imports and initialization code per provider.
+
+### Runtime Configuration
+Rejected because Go's static typing requires compile-time knowledge of which provider implementations to link, and dynamic loading has security and deployment implications.
+
+## Notes
+
+The generator intentionally keeps the spec mapper logic manual because it involves provider-specific type conversions that vary significantly between providers. Templating this logic would create more complexity than it eliminates.
+
+Future enhancements may include automatic discovery of v1 providers to reduce configuration requirements further.

+ 58 - 0
design/014/README.md

@@ -0,0 +1,58 @@
+# Out of Process Providers
+
+This design document is split into separate sub-documents. Each sub-document analyzes different approaches and discusses trade-offs.
+
+## Overview
+
+- [API.md](./API.md) - API Design for Out-of-Process Providers
+- [mTLS.md](./mTLS.md) - mTLS & Service Discovery for Out-of-Process Providers
+- [ADAPTER.md](./ADAPTER.md) - Out-of-Process Provider Adapter Pattern
+- [PROVIDER_AUTOGEN.md](./PROVIDER_AUTOGEN.md) - Provider Code Generation
+- [TESTING.md](./TESTING.md) - V2 Provider Testing Strategy
+- [THREAT_MODEL.md](./THREAT_MODEL.md) - Threat Model for Out-of-Process Providers
+
+## Scope
+
+This document covers the technical aspects of out-of-process providers. 
+
+Multi-repository structure, user migration and graduation criteria are out of scope and will be covered in a separate design document.
+
+## Problem Statements
+
+For detailed discussion, see: [recording](https://zoom.us/rec/share/fioR9a-blopYRqALdvphBU2rXN-To3BAuMOZK1FdpxUIIe31Qam3oWDxGWFHKTT9.UmkUa-3W5abZww8-), [meeting notes](https://zoom-lfx.platform.linuxfoundation.org/meeting/95703312466-1761822000000/summaries?password=47fac69c-506a-408d-a424-ff3c1c69a0dc), [drawings](https://excalidraw.com/#room=bf94104fddcd05c32c20,BTAv5naTXFc-6m0275D0_g), and [this previous PR](https://github.com/external-secrets/external-secrets/pull/4792). There's also an [old issue](https://github.com/external-secrets/external-secrets/issues/696) with more research and history, although outdated.
+
+#### Risk of Supply-Chain Attacks
+
+ESO bundles many dependencies, creating a large attack surface. We need to reduce this surface area to improve security.
+
+#### Users want to maintain their own providers without merging into core ESO
+
+ESO provides limited support for providers built and maintained outside the ESO domain. Enterprises with custom APIs or workflows for managing secrets cannot integrate them effectively. Currently, only the webhook provider supports such integrations, but it lacks features like `GetAllSecrets()` and advanced authentication methods such as OAuth flows.
+
+#### Independence of Provider Maintainers
+
+Provider maintainers need the freedom to change their APIs. This is not possible today because everything is bundled in ESO. The GA SecretStore CRD cannot be changed in breaking ways.
+
+#### The ESO core team does not want to maintain everything
+
+With approximately 40 providers bundled in ESO, maintaining them has become a burden. We want to shift responsibility to the community for provider maintenance.
+
+## Requirements
+
+#### A Provider must run in a separate Pod
+
+Each provider must run in a separate Pod to enable network isolation. When users run multiple providers, each can have its own network policy. This enforces the principle of least privilege at the network level. Core ESO does not need access to anything outside the cluster. All secrets flow through provider pods.
+
+Core ESO and providers must run with different RBAC permissions. We discourage sidecar deployments. ESO must provide a secure-by-default deployment that allows users to apply NetworkPolicies post-deployment and ensures that ESO Core and providers run with separate RBAC permissions to enforce least privilege.
+
+#### A Provider must be built with a minimum set of dependencies
+
+A provider binary should typically import only the ESO `/api` package and the provider-specific SDK. This keeps providers lightweight, easy to audit, and easy to keep up to date.
+
+#### ESO Core must be built without provider dependencies
+
+ESO Core must be built without provider dependencies. This reduces the attack surface by removing unneeded dependencies.
+
+#### Provider code must stay coherent
+
+We must not fork or branch the existing provider source code for the out-of-process provider implementation. Provider code must remain coherent to avoid the maintenance burden of keeping multiple versions in sync.

+ 404 - 0
design/014/TESTING.md

@@ -0,0 +1,404 @@
+# V2 Provider Testing Strategy
+
+## Context
+
+The v2 provider architecture introduces out-of-process providers communicating via gRPC. This architectural shift adds network communication, connection pooling, TLS/mTLS handling, and distributed system concerns that were absent in the v1 in-process model.
+
+Testing must validate:
+- **Functional Equivalence**: V2 providers produce identical results to v1 providers
+- **Performance Characteristics**: Quantify the impact of the network hop and identify bottlenecks
+- **Reliability**: Handle network failures, certificate rotation, and provider restarts gracefully
+- **Security**: Enforce TLS requirements, namespace isolation, and certificate validation
+- **Operational Patterns**: Support rolling deployments, scaling, and common operational scenarios
+
+The existing e2e test suite validates provider behavior in the v1 in-process model. We must ensure these tests pass against v2 providers while adding new tests for v2-specific concerns.
+
+## Problem Description
+
+V2 testing faces challenges that v1 testing did not:
+
+1. **Distributed System Complexity**: Failures can occur in network communication, TLS handshakes, connection pooling, or remote provider processes
+2. **Performance Unknowns**: The network hop introduces latency, but connection pooling and caching should mitigate impact. We lack quantitative data.
+3. **Operational Scenarios**: Certificate rotation, rolling updates, and provider scaling are new operational concerns requiring validation
+4. **Conformance Across Providers**: With out-of-tree providers, we need standardized validation that each provider correctly implements the protocol
+5. **Test Environment Complexity**: Tests must deploy and manage provider processes, service networking, and TLS infrastructure
+
+Without comprehensive testing, we risk:
+- Silent correctness issues where v2 produces different results than v1
+- Performance regressions that degrade user experience
+- Production failures during certificate rotation or rolling updates
+- Provider implementations that deviate from expected behavior
+- Inability to confidently recommend v2 adoption
+
+## Decision
+
+Implement a multi-layered testing strategy covering unit, integration, e2e, conformance, performance, and disruption testing.
+
+### 1. E2E Test Migration
+
+**Objective**: Ensure existing provider behavior works identically with v2 architecture.
+
+**Approach**:
+- Run the complete existing e2e test suite against v2 provider configurations
+- Each test case that currently uses v1 `SecretStore` gets a v2 equivalent using `Provider` resources
+- Deploy providers as separate services in the test environment
+- Configure TLS for provider communication in test clusters
+
+**Test Matrix**:
+```
+Provider (AWS, GCP, Vault, etc.)
+  × Store Type (SecretStore, ClusterSecretStore)
+  × Auth Mode (ManifestNamespace, ProviderNamespace)
+  × Feature (GetSecret, GetAllSecrets, PushSecret, DeleteSecret)
+  × Data Format (string, binary, JSON templating, dataFrom)
+```
+
+**Implementation**:
+- Create test utilities that generate v2 provider configurations from existing v1 tests
+- Establish provider deployment helpers for test environments
+- Run v1 and v2 tests in parallel during transition period to detect regressions
+
+**Success Criteria**:
+- 100% of v1 e2e tests pass with v2 providers
+- No behavioral differences between v1 and v2 results
+
+### 2. Certificate Management Tests
+
+**Objective**: Validate TLS certificate lifecycle operations work without service disruption.
+
+**Test Scenarios**:
+
+**2.1 Certificate Rotation**
+- Issue initial certificates for ESO→Provider communication
+- Rotate server certificate while maintaining client connections
+- Rotate client certificate while maintaining connectivity
+- Rotate CA certificate with overlap period
+- Verify: Zero failed reconciliations during rotation
+- Verify: Connection pool handles certificate updates gracefully
+
+**2.2 Certificate Expiration**
+- Configure short-lived certificates (5 minutes)
+- Run reconciliation loop through multiple certificate renewals
+- Verify: Automatic certificate refresh before expiration
+- Verify: Clear error messages if certificate expires
+
+**2.3 Certificate Validation Failures**
+- Present invalid server certificate (wrong CN, expired, self-signed)
+- Present mismatched CA certificate
+- Present revoked certificate
+- Verify: Connections rejected with clear error messages
+- Verify: Controller status reflects TLS validation failures
+
+**2.4 mTLS Configuration**
+- Enable mutual TLS with client certificates
+- Verify: Provider rejects connections without valid client cert
+- Verify: ESO successfully authenticates with client cert
+- Rotate both client and server certificates
+
+**Implementation**:
+- Integrate cert-manager in test environments for automated certificate issuance
+- Create test scenarios with Kubernetes TLS secrets
+- Use short certificate lifetimes to accelerate rotation testing
+
+### 3. Provider Conformance Suite
+
+**Objective**: Standardize validation that providers correctly implement the v2 protocol.
+
+**Approach**: Create a reusable test library (`providers/v2/conformance`) that provider implementers run against their implementations.
+
+**Core Conformance Tests**:
+
+**3.1 Secret Operations**
+- `GetSecret`: Retrieve secret by key, handle missing secrets, decode strategies
+- `GetSecretMap`: Retrieve key-value maps (deprecated but supported)
+- `GetAllSecrets`: Find secrets by path, tags, regex, conversion strategies
+- `PushSecret`: Write secrets with properties, metadata, idempotency
+- `DeleteSecret`: Remove secrets, handle non-existent deletions
+- `SecretExists`: Check existence without retrieving data
+
+**3.2 Error Semantics**
+- Return `NoSecretError` for missing secrets with correct deletion policy behavior
+- Return validation errors for malformed requests
+- Return permission errors when authentication fails
+- Propagate timeouts and retryable vs non-retryable errors correctly
+
+**3.3 Authentication**
+- Respect namespace boundaries for secret references
+- Support multiple authentication methods (IAM, service accounts, static credentials)
+- Handle credential refresh and expiration
+- Reject cross-namespace access when not permitted
+
+**3.4 Protocol Compliance**
+- Respond to health checks correctly
+- Implement graceful shutdown
+- Handle concurrent requests safely
+- Respect context cancellation
+- Return proper gRPC status codes
+
+**3.5 Provider Metadata**
+- Return correct capabilities (ReadOnly, WriteOnly, ReadWrite)
+- Validate provider configuration
+- Provide meaningful validation warnings
+
+**Implementation**:
+- Conformance tests as Go package: `import "github.com/external-secrets/external-secrets/providers/v2/conformance"`
+- Provider tests instantiate conformance suite with their provider implementation
+- CI integration: gate provider releases on conformance pass
+- Versioned conformance suite (v2alpha1, v2beta1) for compatibility testing
+
+**Success Criteria**:
+- All official providers pass 100% of conformance tests
+- Conformance suite executable in <2 minutes
+- Clear failure messages identifying non-compliant behavior
+
+### 4. Performance and Load Tests
+
+**Objective**: Quantify performance impact of v2 architecture and identify breaking points.
+
+**4.1 Baseline Performance Comparison**
+
+Compare v1 (in-process) vs v2 (gRPC) on identical workloads:
+
+**Metrics**:
+- Secret fetch latency (p50, p95, p99)
+- Reconciliation latency (ExternalSecret update to Secret ready)
+- Throughput (secrets/second)
+- CPU usage (controller and provider)
+- Memory usage (controller and provider)
+- Network bandwidth
+- Connection pool statistics (active, idle, created, reused)
+
+**Test Scenarios**:
+```
+Small: 100 ExternalSecrets, 1 secret each, 5m refresh
+Medium: 1000 ExternalSecrets, 3 secrets each, 5m refresh
+Large: 5000 ExternalSecrets, 10 secrets each, 5m refresh
+Burst: 1000 ExternalSecrets created simultaneously
+```
+
+**Implementation**:
+- Deploy both v1 and v2 configurations in identical clusters
+- Use Prometheus to capture metrics
+- Generate ExternalSecrets with controlled characteristics
+- Run tests for 30 minutes to measure steady-state performance
+- Export results to comparable format (CSV, JSON)
+
+**4.2 Connection Pool Behavior**
+
+Validate connection pooling effectiveness:
+
+**Test Cases**:
+- Measure connection reuse rate under steady load
+- Verify connection pool respects max connection limits
+- Validate idle connection timeout and cleanup
+- Measure connection establishment latency
+- Test connection recovery after network blip
+
+**4.3 Cache Effectiveness**
+
+Measure client manager cache hit rates:
+
+**Metrics**:
+- Cache hit ratio by provider type
+- Cache invalidation frequency and causes
+- Memory consumption by cache size
+- Impact of generation changes on cache invalidation
+
+**4.4 Breaking Point Analysis**
+
+Identify system limits:
+
+**Approach**:
+- Incrementally increase load until degradation or failure
+- Vary: Number of ExternalSecrets, secrets per ExternalSecret, refresh frequency
+- Measure: When does latency exceed SLO? When do errors begin? What fails first?
+- Compare: Where does v2 break compared to v1?
+
+**Implementation**:
+- Use load generation tools (custom or k6/locust adapted for Kubernetes)
+- Monitor resource exhaustion (CPU, memory, file descriptors, connections)
+- Capture system behavior at breaking point (logs, metrics, traces)
+
+**Success Criteria**:
+- v2 adds ≤50ms p95 latency compared to v1 under medium load
+- v2 throughput within 80% of v1 throughput
+- v2 handles ≥1000 concurrent ExternalSecrets per controller
+- Connection pool prevents connection exhaustion
+- Clear documentation of performance characteristics and limits
+
+### 5. Disruption and Chaos Tests
+
+**Objective**: Validate system resilience during operational disruptions.
+
+**5.1 Rolling Deployments**
+
+Test rolling updates of ESO controller and providers:
+
+**Scenarios**:
+- Roll ESO controller pods while providers run
+- Roll provider pods while ESO reconciles
+- Roll both simultaneously
+- Vary: Deployment strategy (RollingUpdate, Recreate), replica count, update velocity
+
+**Measurements**:
+- Reconciliation success rate during rollout
+- Latency increase during rollout
+- Connection pool behavior during pod replacement
+- Error rate and recovery time
+- Number of failed secret fetches
+
+**5.2 Network Failures**
+
+Simulate network issues between ESO and providers:
+
+**Test Cases**:
+- Complete network partition (10s, 60s, 5m)
+- Packet loss (5%, 20%, 50%)
+- Latency injection (+50ms, +500ms, +5s)
+- DNS resolution failures
+- Service endpoint unavailability
+
+**Measurements**:
+- Retry behavior and backoff
+- Circuit breaking (if implemented)
+- Error propagation to ExternalSecret status
+- Recovery time after network restoration
+- Connection pool health monitoring effectiveness
+
+**5.3 Provider Failures**
+
+Test provider process failures:
+
+**Scenarios**:
+- Graceful shutdown (SIGTERM)
+- Forced termination (SIGKILL)
+- Provider panic/crash
+- Provider deadlock/hang
+- OOM kill
+
+**Measurements**:
+- Health check detection time
+- Connection pool marking unhealthy connections
+- Automatic reconnection attempts
+- User-visible error messages in ExternalSecret status
+- Time to recovery after provider restart
+
+**5.4 Certificate Issues**
+
+Inject certificate problems:
+
+**Test Cases**:
+- Expire server certificate mid-operation
+- Revoke certificate
+- Change CA without updating client
+- TLS handshake timeout
+- Certificate chain validation failure
+
+**Measurements**:
+- Error detection latency
+- Error message clarity
+- Automatic recovery after certificate fix
+
+**5.5 Resource Contention**
+
+Test behavior under resource pressure:
+
+**Scenarios**:
+- CPU throttling (limit provider to 100m CPU)
+- Memory pressure (limit provider to 128Mi)
+- Disk I/O saturation
+- High concurrent request load
+
+**Measurements**:
+- Graceful degradation vs hard failure
+- Request timeout behavior
+- Resource limit enforcement
+- Queue buildup and backpressure
+
+**Implementation**:
+- Use chaos engineering tools (Chaos Mesh, Litmus)
+- Automate disruption injection in test suite
+- Run continuously in staging environments
+- Generate chaos test reports with metrics and logs
+
+**Success Criteria**:
+- Zero data corruption during disruptions
+- <5% error rate during rolling updates
+- Automatic recovery within 2 minutes after disruption ends
+- Clear error messages visible in ExternalSecret status
+- No panics or crashes in ESO controller
+
+### 6. Additional Recommended Tests
+
+**6.1 Version Skew Tests**
+
+Validate compatibility across version combinations:
+
+**Matrix**:
+- ESO version N with Provider version N-1, N, N+1
+- Protocol version compatibility
+- Deprecated field handling
+- Forward/backward compatibility
+
+**6.2 Metrics Validation**
+
+Ensure observability correctness:
+
+**Tests**:
+- Verify all metrics are emitted correctly
+- Validate metric labels and cardinality
+- Check metrics match actual system behavior
+- Ensure no metrics memory leaks
+
+**6.3 Concurrent Operations**
+
+Test race conditions and concurrent access:
+
+**Scenarios**:
+- Multiple ExternalSecrets referencing same Provider simultaneously
+- Rapid ExternalSecret create/delete cycles
+- Concurrent provider client cache access
+- Connection pool concurrent get/release
+
+**6.4 Error Recovery**
+
+Test recovery from error states:
+
+**Scenarios**:
+- Provider becomes healthy after being unhealthy
+- Invalid configuration fixed
+- Credentials updated after auth failure
+- Network restored after partition
+
+**6.5 Migration Tests**
+
+Validate v1 to v2 migration:
+
+**Tests**:
+- Switch ExternalSecret from v1 to v2 store without data loss
+- Run mixed v1/v2 workloads simultaneously
+- Gradual provider migration
+- Rollback from v2 to v1
+
+## Consequences
+
+### Positive
+
+- **Confidence**: Comprehensive testing enables confident v2 recommendation to users
+- **Quality**: Conformance suite ensures provider consistency
+- **Performance Insight**: Quantitative data informs optimization priorities
+- **Operational Readiness**: Disruption tests validate production scenarios
+- **Regression Prevention**: Automated testing catches regressions early
+
+### Negative
+
+- **Test Infrastructure Complexity**: Managing provider deployments increases test environment complexity
+- **Execution Time**: Comprehensive testing takes longer than v1 tests
+- **Maintenance Burden**: More tests require ongoing maintenance
+- **Resource Cost**: Performance and chaos tests consume significant compute resources
+
+### Neutral
+
+- **Gradual Rollout**: Testing strategy supports phased v2 adoption
+- **Provider Responsibility**: Out-of-tree providers own their conformance test execution
+- **Tooling Requirements**: Requires investment in test tooling and infrastructure

+ 329 - 0
design/014/THREAT_MODEL.md

@@ -0,0 +1,329 @@
+# Threat Model for Out-of-Process Providers
+
+## Overview
+
+This document analyzes security threats for the out-of-process provider architecture where providers run in separate pods and communicate with ESO Core via gRPC over mTLS.
+
+## Architecture Summary
+
+**Components:**
+- **ESO Core**: Kubernetes controller that reconciles ExternalSecret resources
+- **Out-of-Process Providers**: Separate pods that fetch secrets from external APIs (AWS, Azure, etc.)
+- **gRPC Communication**: mTLS-secured channel between ESO Core and providers
+- **Certificate Authority**: ESO Core manages certificates for provider communication
+
+**Key Properties:**
+- Providers run in separate pods with isolated network policies
+- ESO Core and providers run with different RBAC permissions
+- ESO Core acts as CA and distributes certificates via Kubernetes secrets
+- Providers fetch secrets from external APIs using their own credentials
+- ESO Core does not have direct access to external secret systems
+
+## Assets
+
+What needs protection:
+
+1. **Secrets in Transit**: Secrets returned by providers to ESO Core over gRPC
+2. **TLS Certificates**: CA private key, server certificates, client certificates
+3. **Provider Configuration**: Credentials and configuration stored in provider-specific CRs
+4. **ESO Core Configuration**: Controller configuration and RBAC permissions
+5. **External API Credentials**: AWS IAM roles, Azure service principals, etc. used by providers
+6. **Kubernetes Resources**: ExternalSecret, Provider, and provider-specific custom resources
+
+## Trust Boundaries
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ Kubernetes Cluster                                          │
+│                                                             │
+│  ┌─────────────────┐                  ┌──────────────────┐  │
+│  │ ESO Core        │  mTLS/gRPC       │ Provider Pods    │  │
+│  │ Namespace       │◄────────────────►│ (various NS)     │  │
+│  │                 │                  │                  │  │
+│  │ - Controller    │                  │ - AWS Provider   │──┼──► AWS API
+│  │ - CA Secret     │                  │ - Azure Provider │──┼──► Azure API
+│  └─────────────────┘                  └──────────────────┘  │
+│         │                                      │            │
+│         │                                      │            │
+│         ▼                                      ▼            │
+│  ┌─────────────────┐                  ┌──────────────────┐  │
+│  │ ESO Secrets     │                  │ Provider Secrets │  │
+│  │ - CA private key│                  │ - TLS certs      │  │
+│  └─────────────────┘                  │ - Config/creds   │  │
+│                                       └──────────────────┘  │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Trust Boundaries:**
+1. Between ESO Core and Kubernetes API
+2. Between ESO Core and provider pods (mTLS)
+3. Between provider pods and external secret systems
+4. Between namespaces (RBAC and network policies)
+5. Between pod and mounted secrets
+
+## Threat Actors
+
+### External Attackers
+
+**Capabilities:**
+- Network access to cluster (varies by deployment)
+- Ability to observe network traffic
+- Social engineering or supply chain attacks
+
+**Goals:**
+- Steal secrets
+- Compromise workloads
+- Establish persistence
+
+### Malicious Pod in Cluster
+
+**Capabilities:**
+- Limited RBAC permissions
+- Access to pod network
+- Access to secrets, mounted to the pod
+- Ability to make API requests within assigned permissions
+
+**Goals:**
+- Escalate privileges
+- Access secrets from other namespaces
+- Intercept communication between ESO and providers
+
+### Compromised Provider
+
+**Capabilities:**
+- Valid mTLS server certificates
+- Access to external secret system credentials
+- Ability to communicate with ESO Core
+
+**Goals:**
+- Return malicious secrets, potential DOS
+- Exfiltrate secret requests
+- Pivot to compromise other providers or ESO Core
+
+### Compromised ESO Core
+
+**Capabilities:**
+- CA private key access
+- Cross-namespace secret write permissions
+- Ability to generate arbitrary certificates
+
+**Goals:**
+- Access all secrets across all providers
+- Modify secrets in target namespaces
+- Impersonate providers
+
+## Threat Analysis
+
+### T1: Man-in-the-Middle Attack on gRPC Communication
+
+**Threat:** Attacker intercepts communication between ESO Core and provider to steal secrets in transit.
+
+**Attack Vector:**
+- Network position between ESO Core and provider pods
+- ARP spoofing or DNS hijacking within cluster
+- Compromised network infrastructure
+
+**Mitigation:**
+- ✅ mTLS enforced for all ESO-to-provider communication
+- ✅ Mutual authentication prevents impersonation
+- ✅ Certificate validation ensures proper identity
+- ⚠️ NetworkPolicies should be configured to restrict communication paths
+
+**Residual Risk:** Low (with proper NetworkPolicy configuration)
+
+---
+
+### T2: Certificate Authority Compromise
+
+**Threat:** Attacker gains access to CA private key and can issue arbitrary certificates.
+
+**Attack Vector:**
+- Compromise ESO Core pod
+- Direct access to ESO namespace secrets
+- Kubernetes API server compromise
+
+**Mitigation:**
+- ✅ CA private key only stored in ESO namespace
+- ✅ CA private key never distributed to provider namespaces
+- ✅ Strict RBAC limits access to ESO namespace
+- ⚠️ Consider using external system for CA key storage (KMS/HSM)
+
+**Impact:** Critical - full compromise of provider authentication
+
+**Residual Risk:** Medium (depends on cluster hardening)
+
+---
+
+### T3: Rogue Provider Registration
+
+**Threat:** Attacker deploys malicious provider and gets ESO to communicate with it.
+
+**Attack Vector:**
+- Deploy service with `external-secrets.io/provider: "true"` label
+- ESO generates certificates for the rogue service
+- Create Provider resource pointing to rogue service
+- Intercept or manipulate secret requests
+
+**Mitigation:**
+- ✅ Provider-specific configuration still required (can't serve AWS secrets without AWS credentials)
+- ✅ RBAC controls who can create Provider resources
+- ⚠️ Consider admission webhook to validate Provider resources
+- ⚠️ Monitor for unexpected service labels
+
+**Impact:** Low - attacker can't access external secrets without proper credentials. Provider impersonation may be a DOS vector.
+
+**Residual Risk:** Low
+
+---
+
+### T4: Provider Impersonation
+
+**Threat:** Attacker impersonates legitimate provider to return malicious secrets.
+
+**Attack Vector:**
+- Steal provider TLS certificates
+- Deploy service at same address as legitimate provider
+- Respond to ESO requests with malicious data
+
+**Mitigation:**
+- ✅ mTLS with certificate validation prevents impersonation
+- ✅ Certificates tied to specific DNS names
+- ✅ NetworkPolicies can restrict communication paths
+- ⚠️ Certificate rotation limits window of compromised certs
+
+**Residual Risk:** Low
+
+---
+
+### T5: Supply Chain Attack via Provider Dependencies
+
+**Threat:** Malicious code in provider dependencies exfiltrates secrets or credentials.
+
+**Attack Vector:**
+- Compromised provider SDK
+- Vulnerable dependencies in provider binary
+- Backdoored base images
+
+**Mitigation:**
+- ✅ Providers built with minimal dependencies
+- ✅ Each provider isolated - compromise doesn't affect others
+- ⚠️ Dependency scanning and SBOM generation recommended
+- ⚠️ Image signing and verification
+- ⚠️ Regular security audits of provider code
+
+**Impact:** High - provider can access external secret systems and have a lot of RBAC permissions within Kubernetes.
+
+**Residual Risk:** Medium (ongoing supply chain risk)
+
+---
+
+### T6: Denial of Service via Certificate Exhaustion
+
+**Threat:** Attacker creates many labeled services to exhaust ESO Core resources.
+
+**Attack Vector:**
+- Deploy numerous services with provider label
+- ESO generates certificates for all services
+- Resource exhaustion prevents legitimate operations
+
+**Mitigation:**
+- ⚠️ Rate limiting on certificate generation (not yet implemented)
+- ⚠️ RBAC limits who can create services with eso-provider label
+- ⚠️ Monitoring and alerting on certificate generation rate
+- ⚠️ Resource quotas on ESO Core pod
+
+**Residual Risk:** Medium
+
+---
+
+### T7: Secrets at Rest Exposure
+
+**Threat:** Attacker gains access to secrets stored in Kubernetes etcd.
+
+**Attack Vector:**
+- Direct etcd access
+- Backup compromise
+- Kubernetes API server vulnerability
+
+**Mitigation:**
+- ⚠️ Kubernetes encryption at rest (cluster configuration)
+- ⚠️ Secure etcd access controls
+- ⚠️ Regular key rotation
+- ✅ Provider TLS certificates are short-lived (30 days)
+
+**Impact:** Critical - all certificates and potentially secret data exposed
+
+**Residual Risk:** Medium (depends on cluster configuration)
+
+---
+
+### T8: Malicious Secret Injection
+
+**Threat:** Provider returns malicious or incorrect secrets to ESO Core.
+
+**Attack Vector:**
+- Compromised provider pod
+- Vulnerability in provider code
+- Misconfiguration in provider
+
+**Mitigation:**
+- ⚠️ No built-in validation of secret content (by design)
+- ✅ Provider compromise doesn't grant access to ESO's secrets
+- ✅ Isolation between providers limits blast radius
+- ⚠️ External system auditing (e.g., CloudTrail for AWS)
+- ⚠️ Secret validation at application level
+
+**Impact:** Medium - workloads receive incorrect secrets
+
+**Residual Risk:** Medium
+
+---
+
+### T9: RBAC Misconfiguration
+
+**Threat:** Overly permissive RBAC allows unauthorized access to Provider resources or secrets.
+
+**Attack Vector:**
+- Misconfigured ClusterRole bindings
+- Overly broad service account permissions
+- Default namespaces with excessive permissions
+
+**Mitigation:**
+- ✅ Clear RBAC requirements documented
+- ⚠️ Principle of least privilege enforcement recommended
+- ⚠️ Regular RBAC audits
+- ⚠️ Use of namespace-scoped resources where possible
+
+**Residual Risk:** Medium (depends on user configuration)
+
+---
+
+### T10: Certificate Rotation Failure
+
+**Threat:** Certificate rotation fails, causing service disruption or security degradation.
+
+**Attack Vector:**
+- ESO Core pod failure during rotation
+- Bugs in rotation logic
+- Resource constraints preventing secret updates
+
+**Mitigation:**
+- ✅ 35-day rotation lookahead provides recovery window
+- ✅ Prometheus metrics for monitoring rotation status
+- ⚠️ Automated alerts on rotation failures
+- ⚠️ Manual intervention procedures documented
+
+**Impact:** High - service disruption and potential security degradation
+
+**Residual Risk:** Low
+
+## Assumptions
+
+This threat model assumes:
+
+1. Kubernetes cluster is properly hardened (CIS benchmarks, etc.)
+2. Network infrastructure is trusted within cluster
+3. Kubernetes RBAC is properly configured
+4. Users follow deployment best practices
+5. External secret systems (AWS, Azure, etc.) have their own security controls
+6. Workloads consuming secrets perform their own validation

+ 419 - 0
design/014/mTLS.md

@@ -0,0 +1,419 @@
+# mTLS & Service Discovery for Out-of-Process Providers
+
+## Problem Description
+
+ESO Core needs to establish a secure and authenticated communication channel with out-of-process providers. It has to be encrypted in transit because we transmit secrets over the network. It must to be authenticated, otherwise a malicious actor could call GetSecret() to retrieve secrets and take advantage over the out-of-process provider's service account to read secrets or other resources within the cluster.
+
+We have to take the following into consideration:
+
+1. How do we establish a secure connection with a provider?
+2. How do we discover providers within a Kubernetes cluster?
+3. Do we support providers outside of a Kubernetes cluster? How?
+4. How do we handle certificate rotation?
+
+## Context
+
+Out-of-process providers run in separate pods and communicate with ESO Core via gRPC. We need a certificate management system that:
+
+- Provides mutual authentication between ESO Core and providers
+- Works with in-cluster provider deployments
+- Minimizes configuration burden on users and provider developers
+- Supports automatic certificate rotation
+- Maintains security by default
+
+We do want to integrate with `cert-manager` eventually, but this is out of scope for the proposal.
+
+## Key Decisions
+
+### 1. mTLS Distribution Model: Push vs. Pull
+
+**Decision: Push Model**
+
+ESO Core generates and distributes certificates to provider namespaces.
+
+#### Architecture
+
+```
+User labels Service → ESO discovers Service → ESO generates certificates → 
+ESO creates Secret in provider namespace → Provider mounts Secret → 
+Provider starts with mTLS → ESO connects with mTLS
+```
+
+#### Push Model
+
+**How it works:**
+- ESO acts as Certificate Authority
+- ESO generates server certificates for the provider
+- ESO generates client certificates for ESO core to authenticate with a provider
+- ESO creates secrets in provider namespaces which container the CA and server certificates
+- Providers mount secrets and start gRPC servers
+- ESO connects using client certificates
+
+**Pros:**
+- Simple provider implementation (mount secret, start server)
+- No circular dependencies (providers don't authenticate before receiving certs)
+- Centralized certificate management and rotation
+- Providers restart independently
+- ESO controls security policy
+- Uses standard Kubernetes secret mounting
+
+**Cons:**
+- Requires cross-namespace secret write permissions
+- Secrets stored at rest in kube-apiserver/etcd
+- Providers must implement hot certificate reload
+
+**Why chosen:** Simplicity for provider developers, no chicken-and-egg authentication problem, leverages Kubernetes primitives.
+
+#### Pull Model
+
+**How it works:**
+- Providers generate Certificate Signing Requests (CSRs) at startup
+- Providers request certificates from ESO
+- ESO signs and returns certificates
+
+**Pros:**
+- Certificates contain actual runtime addresses (DNS SANs)
+- Providers control their own keys
+- Short-lived certificates possible
+- Dynamic addressing
+
+**Cons:**
+- Complex provider implementation
+- Circular dependency: how does provider authenticate to ESO before it has certs?
+- Provider startup depends on ESO availability
+- Requires ESO to expose certificate signing API
+- Additional attack surface
+- Non-standard pattern in Kubernetes
+
+**Why rejected:** Circular dependency problem, excessive complexity, tight coupling between ESO and providers.
+
+---
+
+### 2. mTLS Discovery Method: How ESO Finds Providers
+
+**Decision: Label-Based Discovery**
+
+Services are labeled with `external-secrets.io/provider: "true"`.
+
+#### Label-Based Discovery
+
+**Configuration:**
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: provider-aws
+  namespace: external-secrets-system
+  labels:
+    external-secrets.io/provider: "true"
+spec:
+  ports:
+    - port: 8080
+```
+
+**Pros:**
+- Explicit contract (no magic)
+- Kubernetes-native pattern
+- Dynamic discovery (no restarts needed)
+- Clear intent (label means "manage my certs")
+- Works with any deployment method
+- Supports multiple providers per namespace
+
+**Cons:**
+- Users must remember to label services
+- No validation of label correctness
+
+**Why chosen:** Explicit contract over implicit behavior, Kubernetes-native, clear intent.
+
+#### Alternative 1: Parse Address from Provider Resource
+
+**How it works:**
+Extract namespace from `Provider.spec.address` (e.g., `provider-aws.external-secrets-system.svc:8080`).
+
+**Pros:**
+- Zero configuration
+- Fully automatic
+
+**Cons:**
+- Only works for in-cluster services
+- Too "magical" - no explicit contract
+- Fails on non-standard addresses
+- Doesn't handle multiple Providers pointing to same pod
+
+**Why rejected:** Too implicit, preference for explicit contracts.
+
+#### Alternative 2: Static Controller Configuration
+
+**How it works:**
+Configure providers via controller flags or ConfigMap.
+
+**Pros:**
+- Simple, centralized
+- Explicit configuration
+
+**Cons:**
+- Static (requires controller restart)
+- Manual maintenance
+- Doesn't scale with dynamic deployments
+- Not Kubernetes-native
+
+**Why rejected:** Not dynamic, requires manual maintenance.
+
+---
+
+### 3. Scope: In-Cluster vs. External Providers
+
+**Decision: In-Cluster Only**
+
+Automatic certificate management only supports providers running inside the Kubernetes cluster.
+
+We will support out-of-cluster providers, but we don't manage mTLS credentials for them.
+
+**Rationale:**
+- ESO focuses on in-cluster architecture
+- External providers are edge cases
+- External providers add significant complexity
+- Users can implement their own CA infrastructure for external providers (cert-manager, etc.) and need to take care of distributing the CSRs/Certificates anyway
+
+**Out of scope:**
+- Providers running outside the cluster
+- Providers in different clusters
+- Providers with custom CA requirements
+
+---
+
+### 4. Hot Certificate Reload
+
+**Decision: Required**
+
+Providers must reload certificates without restarting when secrets are updated.
+
+**Implementation requirement:**
+```go
+// Watch certificate files for changes
+// Use tls.Config.GetCertificate callback for dynamic loading
+// Reload certificates in-memory when files change
+// Use fsnotify or similar to detect file changes
+```
+
+**Rationale:**
+- Avoids connection disruption during rotation
+- Enables seamless certificate updates
+- Standard practice for production systems
+
+**Metrics requirement:**
+- `provider_certificate_hot_reload_total`
+- `provider_certificate_hot_reload_failures_total`
+
+## Certificate Structure
+
+### Secret Distribution
+
+**In ESO Namespace:**
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: eso-provider-ca-internal
+  namespace: external-secrets-system
+data:
+  ca.crt: <CA certificate>
+  ca.key: <CA private key>  # ONLY in ESO namespace
+```
+
+**In Provider Namespaces:**
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: external-secrets-provider-tls
+  namespace: <provider-namespace>
+data:
+  ca.crt: <CA certificate>
+  tls.crt: <Server certificate with DNS SANs>
+  tls.key: <Server private key>
+  # note: no client certs/keys!
+```
+
+**Security:** 
+1. CA private key is never distributed to provider namespaces.
+2. client certificates/keys are never distributed to provider namespaces.
+
+### Certificate Validity Periods
+
+| Certificate Type   | Validity | Rotation Lookahead |
+| ------------------ | -------- | ------------------ |
+| CA Certificate     | 1 year   | 60 days            |
+| Server Certificate | 90 days  | 35 days            |
+| Client Certificate | 90 days  | 35 days            |
+
+**Reconciliation Interval:** 10 minutes
+
+### DNS Subject Alternative Names (SANs)
+
+E.g. for service `provider-aws` in namespace `provider-system`:
+
+```
+DNS SANs in server certificate:
+  - provider-aws
+  - provider-aws.provider-system
+  - provider-aws.provider-system.svc
+  - provider-aws.provider-system.svc.cluster.local
+```
+
+Covers all Kubernetes DNS resolution patterns. The `cluster.local` must be configurable, as some clusters have custom cluster domains.
+
+## Certificate Lifecycle
+
+### Controller: Rotation Triggers
+
+1. Service labeled for first time (initial generation)
+2. Certificate expires within 35 days (rotation)
+3. Service DNS changes (regeneration with new SANs)
+4. Secret deleted (recreation)
+5. CA certificate rotated (regenerate all)
+6. Leaf certificate doesn't match CA (regeneration)
+
+### Rotation Strategy
+
+- CA certificate preserved when valid
+- ESO verifies leaf certificates were signed by current CA
+- If CA changes or is deleted, regenerate all leaf certificates
+- Providers hot-reload from mounted volume (no restart)
+
+## Security Model
+
+### Trust Boundaries
+
+1. ESO acts as Certificate Authority for provider communication only. The CA certificate for out-of-process Providers must be distinct from any other certificates used by ESO (e.g. conversion/validating webhook)
+2. Providers present server certificates to ESO (ESO is the client)
+3. ESO presents client certificates to providers
+4. Mutual authentication is enforced (mTLS required)
+5. Providers retrieve secrets from external APIs using their own RBAC
+6. ESO does not transmit secrets over gRPC - providers fetch and return them
+
+### Why ESO Can Trust Providers
+
+ESO can trust any provider. ESO can trust anyone who is able to create a service object with the appropriate labels.
+
+- Providers do not have access to ESO's secrets
+- Providers only return data they fetch from external APIs
+- Provider RBAC is separate (AWS IAM roles, Kubernetes service accounts, etc.)
+- ESO signing **server** certificates for providers doesn't grant additional privileges
+- mTLS ensures only authenticated providers can communicate
+
+**However:** We must not distribute client certificates anyhwere, as this will allow anyone with access to a client certificate + key to fetch secrets from a provider.
+
+## RBAC Requirements
+
+ESO controller requires:
+
+```yaml
+# Watch services cluster-wide
+- apiGroups: [""]
+  resources: ["services"]
+  verbs: ["get", "list", "watch"]
+
+# Manage secrets in any namespace
+- apiGroups: [""]
+  resources: ["secrets"]
+  verbs: ["create", "update", "patch", "get", "list", "watch"]
+```
+
+**Security consideration:** Cross-namespace secret write is privileged. ESO only writes to the fixed secret name `external-secrets-provider-tls` and only in namespaces with labeled services.
+
+## User Experience
+
+### For Provider Users (Zero Config)
+
+1. Deploy provider with labeled service:
+```yaml
+labels:
+  external-secrets.io/provider: "true"
+```
+
+2. Create Provider resource:
+```yaml
+apiVersion: external-secrets.io/v1
+kind: Provider
+metadata:
+  name: my-aws
+spec:
+  address: provider-aws.provider-system.svc:8080
+```
+
+**That's it.** Certificates are automatic.
+
+### For Provider Developers
+
+1. Label your service with `external-secrets.io/provider: "true"`
+
+2. Mount secret in pod:
+```yaml
+volumeMounts:
+  - name: certs
+    mountPath: /etc/provider/certs
+volumes:
+  - name: certs
+    secret:
+      secretName: external-secrets-provider-tls
+```
+
+3. Configure gRPC server to use certs from `/etc/provider/certs/`
+
+4. Implement hot certificate reload (required)
+
+5. Expose Prometheus metrics
+
+## Monitoring
+
+### Required Metrics
+
+```
+# Certificate expiration (seconds until expiry)
+eso_provider_certificate_expiry_seconds{namespace, service}
+
+# CA certificate expiration
+eso_ca_certificate_expiry_seconds
+
+# Certificate rotations
+eso_provider_certificate_rotation_total{namespace, service, reason}
+
+# Rotation failures
+eso_provider_certificate_rotation_failures_total{namespace, service}
+
+# Hot reload events (in provider pods)
+eso_provider_certificate_hot_reload_total{namespace, service}
+
+# Hot reload failures (in provider pods)
+eso_provider_certificate_hot_reload_failures_total{namespace, service}
+```
+
+### Recommended Alerts
+
+- Certificate expiring in <3 days
+- Certificate rotation failure
+- Hot reload failure
+- CA certificate expiring in <30 days
+
+## Open Questions
+
+### 1. gRPC connection handling during rotation
+
+**Options:**
+- A: Automatically re-establish connections when certificates rotate
+- B: Let existing connections complete, new connections use new certs
+- C: Force reconnect on certificate rotation
+
+**Impact:** Affects connection stability during rotation.
+
+### 2. Graceful certificate rollover
+
+Should ESO briefly accept both old and new certificates during rotation?
+
+**Pros:**
+- Smoother rotation experience
+- Reduces connection errors
+
+**Cons:**
+- More complex implementation

BIN
design/014/out-of-process-api-direct-ref.png


BIN
design/014/out-of-process-api-provider.png