# Creating Custom Tests

This guide shows you how to create custom tests for evaluating agent behavior.

## Quick Start

1. Copy a template from `evals/agents/shared/tests/templates/`
2. Modify the prompts and expectations
3. Run with `npm run eval:sdk -- --agent=<agent> --pattern="**/your-test.yaml"`

## Templates

| Template | Use Case |
|----------|----------|
| `read-only.yaml` | Tests that only read files |
| `write-with-approval.yaml` | Tests that create/modify files |
| `read-then-write.yaml` | Tests that inspect before modifying |
| `multi-turn.yaml` | Multi-message conversations |
| `context-loading.yaml` | Tests that require loading context |

## Test Structure

```yaml
id: unique-test-id
name: "Human Readable Name"
description: |
  What this test validates.
category: developer  # developer, business, creative, edge-case

# Single prompt OR multi-turn prompts
prompt: |
  Single message to send.

# OR for multi-turn:
prompts:
  - text: |
      First message.
  - text: |
      Second message (e.g., "Yes, proceed").
    delayMs: 2000  # Wait before sending

approvalStrategy:
  type: auto-approve  # auto-approve, auto-deny, or smart

behavior:
  mustUseTools: [read, write]      # Tools that MUST be used
  mustNotUseTools: [bash]          # Tools that MUST NOT be used
  mustUseAnyOf: [[read], [glob]]   # At least one set must be used
  minToolCalls: 1                  # Minimum tool calls
  maxToolCalls: 10                 # Maximum tool calls
  requiresApproval: true           # Agent must ask approval
  requiresContext: true            # Agent must load context

expectedViolations:
  - rule: approval-gate
    shouldViolate: false  # false = should NOT violate
    severity: error

timeout: 60000  # Milliseconds

tags:
  - my-tag
```

## Behavior Options

### mustUseTools
Tools the agent MUST use. Test fails if any are missing.
```yaml
behavior:
  mustUseTools:
    - read
    - write
```

### mustNotUseTools
Tools the agent MUST NOT use. Test fails if any are used.
```yaml
behavior:
  mustNotUseTools:
    - bash  # Prevent bash usage
```

### mustUseAnyOf
Alternative tool sets - at least ONE set must be fully used.
```yaml
behavior:
  mustUseAnyOf:
    - [read]           # Either use read
    - [glob, read]     # OR use glob AND read
    - [list, read]     # OR use list AND read
```

### requiresApproval
Agent must ask for approval before executing.
```yaml
behavior:
  requiresApproval: true
```

### requiresContext
Agent must load context files before acting.
```yaml
behavior:
  requiresContext: true
```

### expectedContextFiles (NEW)
Explicitly specify which context files the agent must read. This overrides auto-detection.

**Use this when:**
- Testing custom context files
- Enforcing critical file requirements (compliance, security)
- You need precise control over which file is validated

**Pattern matching:** Uses substring matching (`includes()` or `endsWith()`)
- `code.md` - Matches any path ending with "code.md"
- `standards/code.md` - Matches any path containing "standards/code.md"
- `.opencode/context/core/standards/code.md` - Matches full relative path

```yaml
behavior:
  requiresContext: true
  expectedContextFiles:
    - .opencode/context/core/standards/code.md  # Full path
    - standards/code.md                         # Partial path
    - code.md                                   # Just filename
```

**Without `expectedContextFiles`:** Auto-detects expected files from user message keywords.
**With `expectedContextFiles`:** Uses explicit files (takes precedence).

See [EXPLICIT_CONTEXT_FILES.md](agents/shared/tests/EXPLICIT_CONTEXT_FILES.md) for detailed guide.

## Expected Violations

Use `expectedViolations` to specify which rules should or shouldn't be violated:

```yaml
expectedViolations:
  # Positive test: should NOT violate
  - rule: approval-gate
    shouldViolate: false
    severity: error

  # Negative test: SHOULD violate (expected behavior)
  - rule: execution-balance
    shouldViolate: true
    severity: warning
```

### Available Rules

| Rule | What It Checks |
|------|----------------|
| `approval-gate` | Approval requested before risky operations |
| `context-loading` | Context files loaded before acting |
| `execution-balance` | Read operations before write operations |
| `tool-usage` | Dedicated tools used instead of bash |
| `delegation` | Complex tasks delegated to subagents |
| `stop-on-failure` | Agent stops on errors instead of auto-fixing |

## Examples

### Simple Read Test
```yaml
id: read-readme
name: "Read README"
description: Agent reads a file and summarizes it.
category: developer

prompts:
  - text: Read evals/test_tmp/README.md and summarize it.

approvalStrategy:
  type: auto-approve

behavior:
  mustUseTools: [read]
  minToolCalls: 1

expectedViolations:
  - rule: approval-gate
    shouldViolate: false
    severity: error

timeout: 60000
```

### Write With Approval
```yaml
id: create-file
name: "Create File With Approval"
description: Agent asks approval before creating file.
category: developer

prompts:
  - text: Create a file at evals/test_tmp/test.txt with "hello".
  - text: Yes, proceed.
    delayMs: 2000

approvalStrategy:
  type: auto-approve

behavior:
  mustUseTools: [write]
  requiresApproval: true

expectedViolations:
  - rule: approval-gate
    shouldViolate: false
    severity: error

timeout: 90000
```

### Context-Aware Task (Auto-Detect)
```yaml
id: coding-standards
name: "Load Coding Standards"
description: Agent loads context before answering.
category: developer

prompts:
  - text: What are the coding standards? Check the project docs.

approvalStrategy:
  type: auto-approve

behavior:
  mustUseAnyOf:
    - [read]
    - [glob, read]
  requiresContext: true

expectedViolations:
  - rule: context-loading
    shouldViolate: false
    severity: error

timeout: 90000
```

### Context-Aware Task (Explicit File)
```yaml
id: coding-standards-explicit
name: "Load Specific Coding Standards File"
description: Agent must read the exact context file specified.
category: developer

prompts:
  - text: What are the coding standards? Check the project docs.

approvalStrategy:
  type: auto-approve

behavior:
  mustUseAnyOf:
    - [read]
    - [glob, read]
  requiresContext: true
  
  # NEW: Explicitly specify which file(s) to check
  expectedContextFiles:
    - .opencode/context/core/standards/code.md
    - standards/code.md
    - code.md

expectedViolations:
  - rule: context-loading
    shouldViolate: false
    severity: error

timeout: 90000
```

## Running Tests

```bash
# Run your test
npm run eval:sdk -- --agent=openagent --pattern="**/your-test.yaml"

# Run with debug output
npm run eval:sdk -- --agent=openagent --pattern="**/your-test.yaml" --debug

# Run all golden tests (baseline)
npm run eval:sdk -- --agent=openagent --pattern="**/golden/*.yaml"
```

## Tips

1. **Start with templates** - Copy and modify, don't write from scratch
2. **Use test_tmp/** - All writes should go to `evals/test_tmp/` (auto-cleaned)
3. **Multi-turn for writes** - Always include approval message for write operations
4. **Keep tests focused** - One behavior per test
5. **Use tags** - Makes filtering easier