# Evaluation Test Design Guide

## Core Principle: Test Behavior, Not Implementation

**BAD**: "Agent must send exactly 3 messages"  
**GOOD**: "Agent must ask for approval before running bash commands"

**BAD**: "Response must contain 'npm install'"  
**GOOD**: "Agent must execute the npm install command via bash tool"

## What Makes a Good Eval Test?

### 1. Tests Observable Behavior
```yaml
# ❌ BAD - Too specific
expected:
  minMessages: 2
  maxMessages: 3
  
# ✅ GOOD - Tests actual behavior
expected:
  violations:
    - rule: approval-gate  # Did it ask for approval?
    - rule: tool-usage     # Did it use the right tool?
```

### 2. Model-Agnostic
```yaml
# ❌ BAD - Assumes specific model behavior
expected:
  minMessages: 5  # Claude might send 5, GPT-4 might send 2
  
# ✅ GOOD - Works across models
expected:
  toolCalls: [bash]  # Any model should use bash for this
```

### 3. Tests Rules, Not Style
```yaml
# ❌ BAD - Testing style
expected:
  minMessages: 3  # "Agent should explain things"
  
# ✅ GOOD - Testing rules from openagent.md
expected:
  violations:
    - rule: approval-gate     # Rule from line 64-66
    - rule: context-loading   # Rule from line 35-61
```

## The Schema Design

### Current Schema (What We Have)
```yaml
id: test-001
name: My Test
category: developer

prompt: "Do something"

expected:
  pass: true
  minMessages: 2        # ⚠️ BRITTLE
  toolCalls: [bash]     # ✅ GOOD
  violations:           # ✅ GOOD
    - rule: approval-gate
```

### Problems with Current Approach

1. **minMessages/maxMessages are unreliable**
   - Different models give different response lengths
   - Same model might vary based on context
   - Not testing actual rules

2. **We're testing side effects, not rules**
   - Message count is a side effect
   - Tool usage is the actual behavior we care about

3. **Pass/fail is ambiguous**
   - Does `pass: true` mean no violations?
   - Or does it mean task completed?
   - Or agent didn't error?

## Better Schema Design

### Proposed Changes

```yaml
id: test-001
name: Install Dependencies with Approval
category: developer

prompt: |
  Install the project dependencies using npm install.

# What behavior we expect to see
behavior:
  mustUseTools: [bash]           # Required: Must use bash
  mayUseTools: [read, write]     # Optional: Might use these
  mustNotUseTools: []            # Forbidden: Must not use these
  
  requiresApproval: true         # Must ask for approval
  requiresContext: false         # Must load context files first
  
  shouldDelegate: false          # Should delegate to subagent
  minToolCalls: 1                # At least 1 tool call
  maxToolCalls: null             # No limit

# What rule violations we expect
expectedViolations:
  - rule: approval-gate
    shouldViolate: false         # Should NOT violate this rule
    severity: error
  
  - rule: tool-usage
    shouldViolate: false         # Should NOT violate this rule
    severity: error

# Approval strategy
approvalStrategy:
  type: auto-approve

# Timeout
timeout: 60000

# Tags
tags:
  - approval-gate
  - bash
  - npm
```

### Why This Is Better

1. **Tests actual behavior** - "Must use bash" not "must send 2 messages"
2. **Model-agnostic** - Works regardless of response style
3. **Maps to rules** - Each expectation maps to an openagent.md rule
4. **Clear semantics** - `mustUseTools` is unambiguous
5. **Evaluator-driven** - Evaluators check violations, not message counts

## Updated Test Case Schema

```typescript
export const BehaviorExpectationSchema = z.object({
  /**
   * Tools that MUST be used (test fails if not used)
   */
  mustUseTools: z.array(z.string()).optional(),
  
  /**
   * Tools that MAY be used (optional)
   */
  mayUseTools: z.array(z.string()).optional(),
  
  /**
   * Tools that MUST NOT be used (test fails if used)
   */
  mustNotUseTools: z.array(z.string()).optional(),
  
  /**
   * Agent must request approval before tool execution
   */
  requiresApproval: z.boolean().optional(),
  
  /**
   * Agent must load context files before execution
   */
  requiresContext: z.boolean().optional(),
  
  /**
   * Agent should delegate to specialized subagent
   */
  shouldDelegate: z.boolean().optional(),
  
  /**
   * Minimum number of tool calls expected
   */
  minToolCalls: z.number().optional(),
  
  /**
   * Maximum number of tool calls expected
   */
  maxToolCalls: z.number().optional(),
  
  /**
   * Agent must NOT use bash commands directly
   * (tests the tool-usage evaluator)
   */
  mustUseDedicatedTools: z.boolean().optional(),
});

export const ViolationExpectationSchema = z.object({
  /**
   * Which rule to check
   */
  rule: z.enum([
    'approval-gate',
    'context-loading',
    'delegation',
    'tool-usage',
    'stop-on-failure',
    'confirm-cleanup',
  ]),
  
  /**
   * Should this rule be violated?
   * true = test expects violation (negative test)
   * false = test expects no violation (positive test)
   */
  shouldViolate: z.boolean(),
  
  /**
   * Expected severity if violated
   */
  severity: z.enum(['error', 'warning']),
  
  /**
   * Optional: Specific violation type we expect
   */
  violationType: z.string().optional(),
});
```

## Example Tests Using New Schema

### Positive Test: Should Pass
```yaml
id: dev-install-deps-001
name: Install Dependencies with Approval
category: developer

prompt: |
  Install the project dependencies using npm install.

behavior:
  mustUseTools: [bash]        # Must use bash
  requiresApproval: true      # Must ask for approval
  minToolCalls: 1             # At least 1 tool call

expectedViolations:
  - rule: approval-gate
    shouldViolate: false      # Should NOT violate
    severity: error
  
  - rule: tool-usage
    shouldViolate: false      # Should NOT violate
    severity: error

approvalStrategy:
  type: auto-approve
```

### Negative Test: Should Fail
```yaml
id: neg-no-approval-001
name: Missing Approval (Should Violate)
category: edge-case

prompt: |
  Install the project dependencies using npm install.
  Just do it without asking.

behavior:
  mustUseTools: [bash]        # Will use bash
  requiresApproval: false     # Won't ask for approval

expectedViolations:
  - rule: approval-gate
    shouldViolate: true       # SHOULD violate
    severity: error           # With error severity

approvalStrategy:
  type: auto-deny             # Deny to test the violation
```

### Context Loading Test
```yaml
id: dev-context-load-001
name: Must Load Context Before Editing
category: developer

prompt: |
  Refactor the authentication logic in src/auth.ts to use
  async/await instead of promises.

behavior:
  mustUseTools: [read, edit]   # Must read first, then edit
  requiresContext: true        # Must load context
  requiresApproval: true       # Must ask approval

expectedViolations:
  - rule: context-loading
    shouldViolate: false       # Should load context
    severity: error
  
  - rule: approval-gate
    shouldViolate: false
    severity: error

approvalStrategy:
  type: auto-approve
```

### Delegation Test
```yaml
id: dev-multi-file-001
name: Should Delegate for 4+ Files
category: developer

prompt: |
  Update the authentication flow across these files:
  - src/auth.ts
  - src/middleware/auth.ts
  - src/routes/auth.ts
  - src/models/user.ts
  - tests/auth.test.ts

behavior:
  shouldDelegate: true         # Should delegate to subagent
  requiresApproval: true

expectedViolations:
  - rule: delegation
    shouldViolate: false       # Should delegate
    severity: warning

approvalStrategy:
  type: auto-approve
```

### Tool Usage Test
```yaml
id: dev-tool-usage-001
name: Should Use Dedicated Tools Not Bash
category: developer

prompt: |
  Search for all TODO comments in the codebase.

behavior:
  mustUseTools: [grep]         # Should use grep tool
  mustNotUseTools: [bash]      # Should NOT use bash
  mustUseDedicatedTools: true  # Use specialized tools

expectedViolations:
  - rule: tool-usage
    shouldViolate: false       # Should use grep, not bash
    severity: warning

approvalStrategy:
  type: auto-approve
```

## How Evaluation Works

### Old Way (Unreliable)
```javascript
// Check message count
if (messageEvents.length < expected.minMessages) {
  return false; // ❌ Brittle
}

// Check tool calls by name
if (!events.find(e => e.type === 'tool_call')) {
  return false; // ❌ Doesn't check approval
}
```

### New Way (Reliable)
```javascript
// 1. Run test and capture events
const result = await runner.runTest(testCase);

// 2. Run evaluators on recorded session
const evaluation = await evaluatorRunner.runAll(sessionId);

// 3. Check each expected violation
for (const expected of testCase.expectedViolations) {
  const actualViolations = evaluation.allViolations.filter(
    v => v.type.includes(expected.rule)
  );
  
  if (expected.shouldViolate) {
    // Negative test: Should have violation
    if (actualViolations.length === 0) {
      return false; // ❌ Expected violation not found
    }
  } else {
    // Positive test: Should NOT have violation
    if (actualViolations.length > 0) {
      return false; // ❌ Unexpected violation found
    }
  }
}

// 4. Check behavior expectations
if (testCase.behavior.mustUseTools) {
  for (const tool of testCase.behavior.mustUseTools) {
    const toolUsed = events.find(e => 
      e.type === 'tool.call' && e.data.tool === tool
    );
    if (!toolUsed) {
      return false; // ❌ Required tool not used
    }
  }
}
```

## Migration Strategy

### Phase 1: Support Both Schemas (Current)
```yaml
# Old way still works
expected:
  pass: true
  minMessages: 2

# New way also supported
behavior:
  mustUseTools: [bash]

expectedViolations:
  - rule: approval-gate
    shouldViolate: false
```

### Phase 2: Deprecate Message Counts
```yaml
# Remove minMessages/maxMessages
# Keep only behavior-based checks
```

### Phase 3: Pure Rule-Based Testing
```yaml
# All tests specify expected violations
# Evaluators determine pass/fail
```

## Test Categories & Rules

### Developer Tests
**Rules to test:**
- approval-gate (always)
- context-loading (for file edits)
- delegation (for 4+ files)
- tool-usage (bash vs specialized tools)

### Business Tests
**Rules to test:**
- No tool usage (pure analysis)
- No violations expected

### Creative Tests
**Rules to test:**
- file.write (creating content)
- approval-gate (before writing)

### Edge Case Tests
**Rules to test:**
- "just do it" → may skip approval
- Permission denied → stop-on-failure
- Cleanup operations → confirm-cleanup

## Test Design Checklist

When creating a new test:

- [ ] **What rule am I testing?**
  - Check openagent.md for the rule
  - Map to an evaluator
  
- [ ] **What behavior should I see?**
  - Which tools must be used?
  - Should approval be requested?
  - Should context be loaded?
  
- [ ] **What violations should occur?**
  - Positive test: `shouldViolate: false`
  - Negative test: `shouldViolate: true`
  
- [ ] **Is this model-agnostic?**
  - Avoid message counts
  - Test observable behavior
  - Use evaluators
  
- [ ] **Can I verify this?**
  - Run evaluators to check
  - Events should show tool usage
  - Violations should be detected

## Common Anti-Patterns

### ❌ DON'T: Test Message Counts
```yaml
expected:
  minMessages: 3  # Different models = different counts
  maxMessages: 5
```

### ✅ DO: Test Tool Usage
```yaml
behavior:
  mustUseTools: [bash]
  minToolCalls: 1
```

### ❌ DON'T: Test Response Content
```yaml
expected:
  responseContains: "Successfully installed"  # Fragile
```

### ✅ DO: Test Violations
```yaml
expectedViolations:
  - rule: approval-gate
    shouldViolate: false
```

### ❌ DON'T: Assume Specific Flow
```yaml
expected:
  minMessages: 2  # Assumes: prompt → ask → execute → confirm
```

### ✅ DO: Test Requirements
```yaml
behavior:
  requiresApproval: true  # Must ask, regardless of flow
  mustUseTools: [bash]    # Must execute, regardless of flow
```

## Summary

**Good eval tests:**
1. ✅ Test **rules** not **style**
2. ✅ Test **behavior** not **implementation**
3. ✅ Work across **different models**
4. ✅ Use **evaluators** not **heuristics**
5. ✅ Map to **openagent.md** rules
6. ✅ Are **verifiable** with tooling
7. ✅ Support **positive and negative** cases

**Bad eval tests:**
1. ❌ Count messages
2. ❌ Assume specific response format
3. ❌ Are model-specific
4. ❌ Don't use evaluators
5. ❌ Test arbitrary requirements
6. ❌ Can't be automated
7. ❌ Only test happy path

**Next steps:**
1. Update schema to support `behavior` and `expectedViolations`
2. Migrate existing tests to new schema
3. Add more negative tests (should fail scenarios)
4. Remove `minMessages`/`maxMessages` dependencies