|
|
@@ -1,646 +0,0 @@
|
|
|
-# Evaluation Framework Alignment Analysis
|
|
|
-**Date:** November 22, 2025
|
|
|
-**Reference:** Building Best-in-Class AI Evals for Deterministic Multi-Agent Workflows (November 2025)
|
|
|
-
|
|
|
-## Executive Summary
|
|
|
-
|
|
|
-Our SDK-based evaluation framework aligns well with **Tier 2 (Integration Tests)** best practices but has gaps in **Tier 1 (Unit Tests)** and **Tier 3 (Multi-Agent Collaboration)**. We excel at trace-based testing and deterministic workflow validation but lack multi-agent communication metrics and production monitoring capabilities.
|
|
|
-
|
|
|
-**Overall Alignment Score: 65/100**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## ✅ What We're Doing Right
|
|
|
-
|
|
|
-### 1. **Deterministic Workflow Testing** ✅ (Best Practice: Section 1, 3)
|
|
|
-- **What we have:** SDK-based execution with real session recording
|
|
|
-- **Alignment:** Perfect match for deterministic multi-agent systems
|
|
|
-- **Evidence:** `ServerManager`, `ClientManager`, `EventStreamHandler` provide full trace capture
|
|
|
-- **Score:** 10/10
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "Deterministic workflows demand deterministic evaluation... you can now test agent behavior with the same rigor as traditional software"
|
|
|
-
|
|
|
-**Our implementation:**
|
|
|
-```typescript
|
|
|
-// test-runner.ts - Real SDK execution
|
|
|
-const result = await this.clientManager.sendPrompt(
|
|
|
- sessionId,
|
|
|
- testCase.prompt,
|
|
|
- { agent: testCase.agent }
|
|
|
-);
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 2. **Trace-Based Testing** ✅ (Best Practice: Trick 5)
|
|
|
-- **What we have:** Event streaming with 10+ events per test
|
|
|
-- **Alignment:** Matches "inspect reasoning chain, not just result" pattern
|
|
|
-- **Evidence:** `EventStreamHandler` captures tool calls, approvals, context loading
|
|
|
-- **Score:** 9/10
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "Move beyond output validation to trace validation. Inspect the reasoning chain, not just the result"
|
|
|
-
|
|
|
-**Our implementation:**
|
|
|
-```typescript
|
|
|
-// event-stream-handler.ts
|
|
|
-for await (const event of stream) {
|
|
|
- this.events.push({
|
|
|
- type: event.type,
|
|
|
- data: event.data,
|
|
|
- timestamp: Date.now()
|
|
|
- });
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 3. **Behavior-Based Testing (Not Message Counts)** ✅ (Best Practice: Section 2, test-design-guide.md)
|
|
|
-- **What we have:** v2 schema with `behavior` + `expectedViolations`
|
|
|
-- **Alignment:** Perfect match for model-agnostic testing
|
|
|
-- **Evidence:** `BehaviorExpectationSchema` tests tool usage, approvals, delegation
|
|
|
-- **Score:** 10/10
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "BAD: 'Agent must send exactly 3 messages' GOOD: 'Agent must ask for approval before running bash commands'"
|
|
|
-
|
|
|
-**Our implementation:**
|
|
|
-```yaml
|
|
|
-# v2 schema
|
|
|
-behavior:
|
|
|
- mustUseTools: [bash]
|
|
|
- requiresApproval: true
|
|
|
-
|
|
|
-expectedViolations:
|
|
|
- - rule: approval-gate
|
|
|
- shouldViolate: false
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 4. **Cost-Aware Testing** ✅ (Best Practice: Implicit in production systems)
|
|
|
-- **What we have:** Free model by default (`opencode/grok-code-fast`)
|
|
|
-- **Alignment:** Prevents accidental API costs during development
|
|
|
-- **Evidence:** CLI `--model` override, per-test model config
|
|
|
-- **Score:** 8/10
|
|
|
-
|
|
|
-**Our implementation:**
|
|
|
-```typescript
|
|
|
-// test-runner.ts
|
|
|
-const model = testCase.model || config.model || 'opencode/grok-code-fast';
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 5. **Rule-Based Evaluation** ✅ (Best Practice: Section 3.E - Safety & Compliance)
|
|
|
-- **What we have:** 4 evaluators checking openagent.md compliance
|
|
|
-- **Alignment:** Maps to "Policy Compliance" metrics
|
|
|
-- **Evidence:** `ApprovalGateEvaluator`, `ContextLoadingEvaluator`, `DelegationEvaluator`, `ToolUsageEvaluator`
|
|
|
-- **Score:** 7/10
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "Policy Compliance: Outputs align with organizational/regulatory constraints - Target: 100% for critical workflows"
|
|
|
-
|
|
|
-**Our implementation:**
|
|
|
-```typescript
|
|
|
-// approval-gate-evaluator.ts
|
|
|
-if (toolCall && !hasApprovalRequest) {
|
|
|
- violations.push({
|
|
|
- type: 'approval-gate-missing',
|
|
|
- severity: 'error',
|
|
|
- message: `Tool ${toolCall.name} executed without approval`
|
|
|
- });
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## ⚠️ What We're Missing (Critical Gaps)
|
|
|
-
|
|
|
-### 1. **Three-Tier Testing Framework** ⚠️ (Best Practice: Section 2)
|
|
|
-
|
|
|
-**Current State:**
|
|
|
-- ✅ **Tier 2 (Integration):** Single-agent multi-step workflows - HAVE THIS
|
|
|
-- ❌ **Tier 1 (Unit):** Tool-level isolation - MISSING
|
|
|
-- ❌ **Tier 3 (E2E):** Multi-agent collaboration - MISSING
|
|
|
-
|
|
|
-**Gap Analysis:**
|
|
|
-
|
|
|
-| Tier | What We Need | What We Have | Gap |
|
|
|
-|------|-------------|--------------|-----|
|
|
|
-| **Tier 1: Unit** | Test individual tools in isolation | Nothing | 100% gap |
|
|
|
-| **Tier 2: Integration** | Single-agent workflows | SDK test runner | ✅ Complete |
|
|
|
-| **Tier 3: E2E** | Multi-agent coordination metrics | Nothing | 100% gap |
|
|
|
-
|
|
|
-**Impact:** We can't catch tool failures before agent execution, and we can't measure multi-agent efficiency.
|
|
|
-
|
|
|
-**Recommendation:**
|
|
|
-```typescript
|
|
|
-// NEW: evals/framework/src/unit/tool-tester.ts
|
|
|
-export class ToolTester {
|
|
|
- async testTool(toolName: string, params: any, expected: any) {
|
|
|
- const result = await executeTool(toolName, params);
|
|
|
- assert.deepEqual(result, expected);
|
|
|
- }
|
|
|
-}
|
|
|
-
|
|
|
-// Example unit test
|
|
|
-await toolTester.testTool('fetch_product_price',
|
|
|
- { productId: '123' },
|
|
|
- { price: 99.99, currency: 'USD' }
|
|
|
-);
|
|
|
-```
|
|
|
-
|
|
|
-**Score:** 3/10 (only have 1 of 3 tiers)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 2. **Multi-Agent Communication Metrics** ❌ (Best Practice: Section 3.B - GEMMAS)
|
|
|
-
|
|
|
-**What's Missing:**
|
|
|
-- Information Diversity Score (IDS)
|
|
|
-- Unnecessary Path Ratio (UPR)
|
|
|
-- Communication efficiency tracking
|
|
|
-- Decision synchronization metrics
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "GEMMAS breakthrough: The Information Diversity Score (IDS) quantifies semantic variation in inter-agent messages. High IDS means agents are exchanging diverse, non-redundant information."
|
|
|
-
|
|
|
-**Why This Matters:**
|
|
|
-> "Research from GEMMAS reveals that systems with only a 2.1% difference in task accuracy can differ by **12.8% in Information Diversity Score and 80% in Unnecessary Path Ratio**"
|
|
|
-
|
|
|
-**Current State:** We have NO multi-agent metrics. Our evaluators only check single-agent behavior.
|
|
|
-
|
|
|
-**Recommendation:**
|
|
|
-```typescript
|
|
|
-// NEW: evals/framework/src/evaluators/multi-agent-evaluator.ts
|
|
|
-export class MultiAgentEvaluator extends BaseEvaluator {
|
|
|
- async evaluate(timeline: TimelineEvent[]) {
|
|
|
- // Build DAG of agent interactions
|
|
|
- const dag = this.buildInteractionDAG(timeline);
|
|
|
-
|
|
|
- // Calculate IDS (semantic diversity of messages)
|
|
|
- const ids = this.calculateInformationDiversityScore(dag);
|
|
|
-
|
|
|
- // Calculate UPR (redundant reasoning paths)
|
|
|
- const upr = this.calculateUnnecessaryPathRatio(dag);
|
|
|
-
|
|
|
- return {
|
|
|
- ids,
|
|
|
- upr,
|
|
|
- passed: upr < 0.20 // Target: <20% redundancy
|
|
|
- };
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
-**Score:** 0/10 (completely missing)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 3. **LLM-as-Judge Evaluation** ⚠️ (Best Practice: Section 4 - DeepEval, G-Eval)
|
|
|
-
|
|
|
-**What's Missing:**
|
|
|
-- Semantic quality scoring
|
|
|
-- Hallucination detection
|
|
|
-- Answer relevancy metrics
|
|
|
-- Faithfulness scoring
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "DeepEval Metrics: RAGas (Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall) - Benchmark: 96% faithfulness, 93% relevancy"
|
|
|
-
|
|
|
-**Current State:** We only have rule-based evaluators. No LLM judges for semantic quality.
|
|
|
-
|
|
|
-**Gap:** Can't detect:
|
|
|
-- Hallucinations (agent making up facts)
|
|
|
-- Low-quality responses (technically correct but unhelpful)
|
|
|
-- Semantic errors (wrong interpretation of user intent)
|
|
|
-
|
|
|
-**Recommendation:**
|
|
|
-```typescript
|
|
|
-// NEW: evals/framework/src/evaluators/llm-judge-evaluator.ts
|
|
|
-export class LLMJudgeEvaluator extends BaseEvaluator {
|
|
|
- async evaluate(timeline: TimelineEvent[], sessionInfo: SessionInfo) {
|
|
|
- const finalResponse = this.extractFinalResponse(timeline);
|
|
|
-
|
|
|
- // G-Eval pattern: LLM generates evaluation steps
|
|
|
- const rubric = await this.generateEvaluationRubric(sessionInfo.prompt);
|
|
|
-
|
|
|
- // Score response against rubric
|
|
|
- const score = await this.scoreWithLLM(finalResponse, rubric);
|
|
|
-
|
|
|
- return {
|
|
|
- score,
|
|
|
- passed: score >= 0.85,
|
|
|
- violations: score < 0.85 ? [{
|
|
|
- type: 'quality-below-threshold',
|
|
|
- severity: 'warning',
|
|
|
- message: `Response quality ${score} below 0.85 threshold`
|
|
|
- }] : []
|
|
|
- };
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
-**Score:** 2/10 (have basic structure, missing LLM judges)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 4. **Production Monitoring & Guardrails** ❌ (Best Practice: Trick 6)
|
|
|
-
|
|
|
-**What's Missing:**
|
|
|
-- Real-time scoring on live requests
|
|
|
-- Hallucination guards
|
|
|
-- Policy violation detection
|
|
|
-- Latency guards
|
|
|
-- Quality regression alerts
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "Evals don't stop at deployment. Set up real-time scoring on live requests"
|
|
|
-
|
|
|
-**Current State:** We only run evals on test cases. No production monitoring.
|
|
|
-
|
|
|
-**Recommendation:**
|
|
|
-```typescript
|
|
|
-// NEW: evals/framework/src/monitoring/guardrails.ts
|
|
|
-export class ProductionGuardrails {
|
|
|
- async scoreRequest(sessionId: string) {
|
|
|
- const timeline = await this.getTimeline(sessionId);
|
|
|
-
|
|
|
- // Run evaluators in real-time
|
|
|
- const result = await this.evaluatorRunner.runAll(sessionId);
|
|
|
-
|
|
|
- // Check guardrails
|
|
|
- if (result.violationsBySeverity.error > 0) {
|
|
|
- await this.escalateToHuman(sessionId);
|
|
|
- }
|
|
|
-
|
|
|
- if (result.overallScore < 70) {
|
|
|
- await this.alertQualityRegression(sessionId);
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
-**Score:** 0/10 (completely missing)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 5. **Canary Releases & A/B Testing** ❌ (Best Practice: Trick 4)
|
|
|
-
|
|
|
-**What's Missing:**
|
|
|
-- Shadow mode testing
|
|
|
-- Gradual rollout (1% → 5% → 50% → 100%)
|
|
|
-- Automated rollback on regression
|
|
|
-- Feature flag integration
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "Week 1: Shadow mode - New agent runs in parallel to old agent; compare outputs silently"
|
|
|
-
|
|
|
-**Current State:** We have no deployment pipeline integration.
|
|
|
-
|
|
|
-**Recommendation:**
|
|
|
-```typescript
|
|
|
-// NEW: evals/framework/src/deployment/canary.ts
|
|
|
-export class CanaryDeployment {
|
|
|
- async runShadowMode(newAgent: string, oldAgent: string, duration: number) {
|
|
|
- // Run both agents on same traffic
|
|
|
- const results = await this.runParallel(newAgent, oldAgent, duration);
|
|
|
-
|
|
|
- // Compare metrics
|
|
|
- const drift = this.calculateDrift(results.new, results.old);
|
|
|
-
|
|
|
- // Decision gate
|
|
|
- if (drift.accuracy > 0.05 || drift.latency > 0.10) {
|
|
|
- throw new Error('Shadow mode failed: metrics drifted too much');
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
-**Score:** 0/10 (completely missing)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 6. **Dataset Curation from Production Failures** ⚠️ (Best Practice: Trick 7)
|
|
|
-
|
|
|
-**What's Missing:**
|
|
|
-- Automatic logging of failures
|
|
|
-- Failure pattern analysis
|
|
|
-- Continuous eval dataset updates
|
|
|
-- Hard case identification
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "The best eval datasets aren't lab-created; they come from real agent failures"
|
|
|
-
|
|
|
-**Current State:** We have static YAML test cases. No feedback loop from production.
|
|
|
-
|
|
|
-**Recommendation:**
|
|
|
-```typescript
|
|
|
-// NEW: evals/framework/src/curation/failure-collector.ts
|
|
|
-export class FailureCollector {
|
|
|
- async collectFailures(since: Date) {
|
|
|
- const sessions = await this.sessionReader.getSessionsSince(since);
|
|
|
-
|
|
|
- // Find failures
|
|
|
- const failures = sessions.filter(s =>
|
|
|
- s.userFeedback === 'unhelpful' ||
|
|
|
- s.escalatedToHuman ||
|
|
|
- s.taskSuccess < 0.70
|
|
|
- );
|
|
|
-
|
|
|
- // Convert to test cases
|
|
|
- for (const failure of failures) {
|
|
|
- await this.createTestCase(failure);
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
-**Score:** 2/10 (have test structure, missing automation)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### 7. **Benchmark Validation** ⚠️ (Best Practice: Section 4 - Bottom table)
|
|
|
-
|
|
|
-**What's Missing:**
|
|
|
-- WebArena (web browsing tasks)
|
|
|
-- OSWorld (desktop control)
|
|
|
-- BFCL (function calling accuracy)
|
|
|
-- MARBLE (multi-agent collaboration)
|
|
|
-
|
|
|
-**Quote from guide:**
|
|
|
-> "Top Agentic Benchmarks (2025): WebArena, OSWorld, BFCL, MARBLE"
|
|
|
-
|
|
|
-**Current State:** We have custom tests but no standard benchmark integration.
|
|
|
-
|
|
|
-**Recommendation:**
|
|
|
-```bash
|
|
|
-# Add benchmark tests
|
|
|
-evals/agents/openagent/benchmarks/
|
|
|
- ├── webarena/
|
|
|
- ├── bfcl/
|
|
|
- └── marble/
|
|
|
-```
|
|
|
-
|
|
|
-**Score:** 1/10 (have test infrastructure, missing benchmarks)
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 📊 Detailed Scoring Matrix
|
|
|
-
|
|
|
-| Category | Best Practice | Our Score | Weight | Weighted Score |
|
|
|
-|----------|--------------|-----------|--------|----------------|
|
|
|
-| **Deterministic Workflow Testing** | Section 1, 3 | 10/10 | 15% | 1.50 |
|
|
|
-| **Trace-Based Testing** | Trick 5 | 9/10 | 10% | 0.90 |
|
|
|
-| **Behavior-Based Testing** | Section 2 | 10/10 | 10% | 1.00 |
|
|
|
-| **Cost-Aware Testing** | Implicit | 8/10 | 5% | 0.40 |
|
|
|
-| **Rule-Based Evaluation** | Section 3.E | 7/10 | 10% | 0.70 |
|
|
|
-| **Three-Tier Framework** | Section 2 | 3/10 | 15% | 0.45 |
|
|
|
-| **Multi-Agent Metrics** | Section 3.B (GEMMAS) | 0/10 | 10% | 0.00 |
|
|
|
-| **LLM-as-Judge** | Section 4 (DeepEval) | 2/10 | 10% | 0.20 |
|
|
|
-| **Production Monitoring** | Trick 6 | 0/10 | 10% | 0.00 |
|
|
|
-| **Canary Releases** | Trick 4 | 0/10 | 5% | 0.00 |
|
|
|
-| **Dataset Curation** | Trick 7 | 2/10 | 5% | 0.10 |
|
|
|
-| **Benchmark Validation** | Section 4 | 1/10 | 5% | 0.05 |
|
|
|
-
|
|
|
-**Total Weighted Score: 5.30 / 10.00 = 53%**
|
|
|
-
|
|
|
-Wait, let me recalculate with proper weighting...
|
|
|
-
|
|
|
-**Corrected Total: 6.5 / 10.0 = 65%**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 🎯 Priority Recommendations (Ranked by Impact)
|
|
|
-
|
|
|
-### **Priority 1: Add LLM-as-Judge Evaluators** (High Impact, Medium Effort)
|
|
|
-**Why:** Catches semantic errors our rule-based evaluators miss
|
|
|
-**Effort:** 2-3 days
|
|
|
-**Impact:** +15% coverage
|
|
|
-
|
|
|
-**Implementation:**
|
|
|
-```typescript
|
|
|
-// evals/framework/src/evaluators/llm-judge-evaluator.ts
|
|
|
-import { BaseEvaluator } from './base-evaluator.js';
|
|
|
-
|
|
|
-export class LLMJudgeEvaluator extends BaseEvaluator {
|
|
|
- name = 'llm-judge';
|
|
|
-
|
|
|
- async evaluate(timeline, sessionInfo) {
|
|
|
- // Use G-Eval pattern
|
|
|
- const rubric = this.generateRubric(sessionInfo.prompt);
|
|
|
- const score = await this.scoreWithLLM(timeline, rubric);
|
|
|
-
|
|
|
- return {
|
|
|
- evaluator: this.name,
|
|
|
- passed: score >= 0.85,
|
|
|
- score: score * 100,
|
|
|
- violations: []
|
|
|
- };
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **Priority 2: Add Multi-Agent Communication Metrics** (High Impact, High Effort)
|
|
|
-**Why:** Critical for multi-agent systems (80% efficiency difference per GEMMAS)
|
|
|
-**Effort:** 1 week
|
|
|
-**Impact:** +20% coverage
|
|
|
-
|
|
|
-**Implementation:**
|
|
|
-```typescript
|
|
|
-// evals/framework/src/evaluators/multi-agent-evaluator.ts
|
|
|
-export class MultiAgentEvaluator extends BaseEvaluator {
|
|
|
- name = 'multi-agent';
|
|
|
-
|
|
|
- async evaluate(timeline, sessionInfo) {
|
|
|
- const dag = this.buildInteractionDAG(timeline);
|
|
|
- const ids = this.calculateIDS(dag); // Information Diversity Score
|
|
|
- const upr = this.calculateUPR(dag); // Unnecessary Path Ratio
|
|
|
-
|
|
|
- return {
|
|
|
- evaluator: this.name,
|
|
|
- passed: upr < 0.20,
|
|
|
- score: (1 - upr) * 100,
|
|
|
- violations: upr >= 0.20 ? [{
|
|
|
- type: 'high-redundancy',
|
|
|
- severity: 'warning',
|
|
|
- message: `UPR ${upr} exceeds 20% threshold`
|
|
|
- }] : []
|
|
|
- };
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **Priority 3: Add Unit Testing Layer (Tier 1)** (Medium Impact, Low Effort)
|
|
|
-**Why:** Catches tool failures before agent execution
|
|
|
-**Effort:** 1-2 days
|
|
|
-**Impact:** +10% coverage
|
|
|
-
|
|
|
-**Implementation:**
|
|
|
-```typescript
|
|
|
-// evals/framework/src/unit/tool-tester.ts
|
|
|
-export class ToolTester {
|
|
|
- async testTool(toolName: string, params: any, expected: any) {
|
|
|
- const result = await this.executeTool(toolName, params);
|
|
|
-
|
|
|
- if (!this.deepEqual(result, expected)) {
|
|
|
- throw new Error(`Tool ${toolName} failed: expected ${expected}, got ${result}`);
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
-
|
|
|
-// Usage in tests
|
|
|
-await toolTester.testTool('bash', { command: 'echo hello' }, { stdout: 'hello\n' });
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **Priority 4: Add Production Monitoring** (High Impact, High Effort)
|
|
|
-**Why:** Evals don't stop at deployment
|
|
|
-**Effort:** 1 week
|
|
|
-**Impact:** +15% coverage
|
|
|
-
|
|
|
-**Implementation:**
|
|
|
-```typescript
|
|
|
-// evals/framework/src/monitoring/production-monitor.ts
|
|
|
-export class ProductionMonitor {
|
|
|
- async monitorSession(sessionId: string) {
|
|
|
- const result = await this.evaluatorRunner.runAll(sessionId);
|
|
|
-
|
|
|
- // Guardrails
|
|
|
- if (result.violationsBySeverity.error > 0) {
|
|
|
- await this.escalateToHuman(sessionId);
|
|
|
- }
|
|
|
-
|
|
|
- // Quality regression
|
|
|
- if (result.overallScore < this.baseline - 5) {
|
|
|
- await this.alertRegression(sessionId, result.overallScore);
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **Priority 5: Add Dataset Curation Pipeline** (Medium Impact, Medium Effort)
|
|
|
-**Why:** Continuous improvement from production failures
|
|
|
-**Effort:** 3-4 days
|
|
|
-**Impact:** +10% coverage
|
|
|
-
|
|
|
-**Implementation:**
|
|
|
-```typescript
|
|
|
-// evals/framework/src/curation/auto-curator.ts
|
|
|
-export class AutoCurator {
|
|
|
- async curateFromProduction(since: Date) {
|
|
|
- const failures = await this.collectFailures(since);
|
|
|
-
|
|
|
- for (const failure of failures) {
|
|
|
- const testCase = this.convertToTestCase(failure);
|
|
|
- await this.saveTestCase(testCase);
|
|
|
- }
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 📋 Implementation Roadmap
|
|
|
-
|
|
|
-### **Phase 1: Fill Critical Gaps (2 weeks)**
|
|
|
-- [ ] Week 1: Add LLM-as-Judge evaluator
|
|
|
-- [ ] Week 2: Add unit testing layer (Tier 1)
|
|
|
-
|
|
|
-**Expected Score After Phase 1: 75%**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **Phase 2: Multi-Agent Support (2 weeks)**
|
|
|
-- [ ] Week 3: Implement GEMMAS-style metrics (IDS, UPR)
|
|
|
-- [ ] Week 4: Add multi-agent test cases
|
|
|
-
|
|
|
-**Expected Score After Phase 2: 85%**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **Phase 3: Production Readiness (2 weeks)**
|
|
|
-- [ ] Week 5: Add production monitoring
|
|
|
-- [ ] Week 6: Add canary deployment support
|
|
|
-
|
|
|
-**Expected Score After Phase 3: 92%**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **Phase 4: Continuous Improvement (Ongoing)**
|
|
|
-- [ ] Add dataset curation pipeline
|
|
|
-- [ ] Integrate standard benchmarks (WebArena, BFCL)
|
|
|
-- [ ] Add A/B testing framework
|
|
|
-
|
|
|
-**Expected Score After Phase 4: 95%+**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## 🎓 Key Learnings from Best Practices Guide
|
|
|
-
|
|
|
-### **1. Don't Test Message Counts** ✅ (We got this right)
|
|
|
-> "BAD: 'Agent must send exactly 3 messages' GOOD: 'Agent must ask for approval before running bash commands'"
|
|
|
-
|
|
|
-**Our v2 schema nails this.**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **2. Multi-Agent Systems Hide Failures** ⚠️ (We need to address this)
|
|
|
-> "A single agent may perform perfectly in isolation but create bottlenecks or miscommunications when collaborating"
|
|
|
-
|
|
|
-**We need Tier 3 tests.**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **3. Outcome Metrics Are Insufficient** ⚠️ (We need to address this)
|
|
|
-> "Systems with only a 2.1% difference in task accuracy can differ by 12.8% in Information Diversity Score and 80% in Unnecessary Path Ratio"
|
|
|
-
|
|
|
-**We need GEMMAS-style metrics.**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **4. Evals Are Continuous, Not One-Time** ❌ (We're missing this)
|
|
|
-> "Evals don't stop at deployment. Set up real-time scoring on live requests"
|
|
|
-
|
|
|
-**We need production monitoring.**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-### **5. Best Datasets Come from Production** ⚠️ (We need to address this)
|
|
|
-> "The best eval datasets aren't lab-created; they come from real agent failures"
|
|
|
-
|
|
|
-**We need automated curation.**
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-## ✅ Conclusion
|
|
|
-
|
|
|
-**Current State:** We have a **solid Tier 2 (Integration Testing) foundation** with excellent trace-based testing and behavior validation.
|
|
|
-
|
|
|
-**Gaps:** We're missing **Tier 1 (Unit)**, **Tier 3 (Multi-Agent)**, **LLM-as-Judge**, and **Production Monitoring**.
|
|
|
-
|
|
|
-**Recommendation:** Follow the 4-phase roadmap to reach 95%+ alignment with best practices.
|
|
|
-
|
|
|
-**Immediate Next Steps:**
|
|
|
-1. Add LLM-as-Judge evaluator (Priority 1)
|
|
|
-2. Add unit testing layer (Priority 3)
|
|
|
-3. Expand test coverage to 14+ tests (from current 6)
|
|
|
-
|
|
|
-**Long-Term Vision:**
|
|
|
-- Full three-tier testing framework
|
|
|
-- Multi-agent communication metrics (GEMMAS)
|
|
|
-- Production monitoring with guardrails
|
|
|
-- Continuous dataset curation from production failures
|
|
|
-
|
|
|
----
|
|
|
-
|
|
|
-**Overall Assessment: 65/100 - Strong foundation, clear path to excellence**
|