darrenhinde f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
..
config cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure 4 months ago
docs c0a8395878 chore: clean up redundant documentation and old test files 4 months ago
tests f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
CONTEXT_LOADING_COVERAGE.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
IMPLEMENTATION_SUMMARY.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago
README.md f773b290ce chore(evals): comprehensive cleanup, documentation, and test infrastructure improvements 4 months ago

README.md

OpenAgent Test Suite

Comprehensive test suite for OpenAgent with focus on context loading, approval workflows, and multi-turn conversations.


๐Ÿ“Š Test Coverage

Total Tests: 22
Pass Rate: 100% โœ…
Last Updated: 2025-11-26

Test Categories

Category Tests Status Description
context-loading 5 โœ… 100% Context file loading validation
developer 12 โœ… 100% Developer workflow tests
business 2 โœ… 100% Business analysis tests
edge-case 3 โœ… 100% Edge cases and error handling

๐ŸŽฏ Context Loading Tests (NEW)

Overview

5 comprehensive tests validating that OpenAgent loads context files before execution:

Test Type Duration Status
ctx-simple-testing-approach Simple ~38s โœ… PASS
ctx-simple-documentation-format Simple ~26s โœ… PASS
ctx-simple-coding-standards Simple ~21s โœ… PASS
ctx-multi-standards-to-docs Complex ~116s โœ… PASS
ctx-multi-error-handling-to-tests Complex ~148s โœ… PASS

Total Duration: ~6 minutes for all 5 tests
Pass Rate: 100% (5/5)

What They Test

Simple Tests (Read-Only)

  1. ctx-simple-coding-standards - Asks about coding standards

    • Validates: Loads code.md before responding
    • Tools: read
  2. ctx-simple-documentation-format - Asks about documentation format

    • Validates: Loads docs.md before responding
    • Tools: read
  3. ctx-simple-testing-approach - Asks about testing strategy

    • Validates: Loads testing-related files before responding
    • Tools: read (multiple files)

Complex Tests (Multi-Turn with File Creation)

  1. ctx-multi-standards-to-docs - Standards โ†’ Documentation creation

    • Turn 1: "What are our coding standards?"
    • Turn 2: "Create documentation about these standards"
    • Validates: Loads code.md + docs.md before writing
    • Tools: read, write
  2. ctx-multi-error-handling-to-tests - Error handling โ†’ Test creation

    • Turn 1: "How should we handle errors?"
    • Turn 2: "Write tests for error handling"
    • Validates: Loads code.md + tests.md before writing
    • Tools: read, write, grep, list, glob

See: CONTEXT_LOADING_COVERAGE.md for detailed documentation


๐Ÿš€ Running Tests

All OpenAgent Tests

cd evals/framework
npm run eval:sdk -- --agent=openagent

Context Loading Tests Only

npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"

Specific Test

npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"

Debug Mode

npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml" --debug

Batch Execution (Avoid API Limits)

./scripts/utils/run-tests-batch.sh openagent 3 10
# Args: agent, batch_size, delay_seconds

๐Ÿ“ Test Structure

tests/
โ”œโ”€โ”€ context-loading/              # Context loading tests (NEW)
โ”‚   โ”œโ”€โ”€ ctx-simple-coding-standards.yaml
โ”‚   โ”œโ”€โ”€ ctx-simple-documentation-format.yaml
โ”‚   โ”œโ”€โ”€ ctx-simple-testing-approach.yaml
โ”‚   โ”œโ”€โ”€ ctx-multi-standards-to-docs.yaml
โ”‚   โ””โ”€โ”€ ctx-multi-error-handling-to-tests.yaml
โ”‚
โ”œโ”€โ”€ developer/                    # Developer workflow tests
โ”‚   โ”œโ”€โ”€ ctx-code-001.yaml        # Code task with context
โ”‚   โ”œโ”€โ”€ ctx-docs-001.yaml        # Docs task with context
โ”‚   โ”œโ”€โ”€ ctx-tests-001.yaml       # Tests task with context
โ”‚   โ”œโ”€โ”€ ctx-review-001.yaml      # Review task with context
โ”‚   โ”œโ”€โ”€ ctx-delegation-001.yaml  # Delegation task
โ”‚   โ”œโ”€โ”€ ctx-multi-turn-001.yaml  # Multi-turn conversation
โ”‚   โ”œโ”€โ”€ create-component.yaml    # Component creation
โ”‚   โ”œโ”€โ”€ install-dependencies.yaml
โ”‚   โ”œโ”€โ”€ install-dependencies-v2.yaml
โ”‚   โ”œโ”€โ”€ task-simple-001.yaml
โ”‚   โ””โ”€โ”€ fail-stop-001.yaml
โ”‚
โ”œโ”€โ”€ business/                     # Business analysis tests
โ”‚   โ”œโ”€โ”€ conv-simple-001.yaml
โ”‚   โ””โ”€โ”€ data-analysis.yaml
โ”‚
โ””โ”€โ”€ edge-case/                    # Edge cases
    โ”œโ”€โ”€ just-do-it.yaml
    โ”œโ”€โ”€ missing-approval-negative.yaml
    โ””โ”€โ”€ no-approval-negative.yaml

๐Ÿ”ง Test Features

Multi-Turn Support

OpenAgent tests use multi-turn prompts to simulate approval workflow:

prompts:
  - text: "What are our coding standards?"
    expectContext: true
    contextFile: "standards.md"
  
  - text: "approve"
    delayMs: 2000
  
  - text: "Create documentation about these standards"
    expectContext: true
    contextFile: "docs.md"

Smart Timeout

Complex tests use smart timeout system:

  • Base timeout: 300s (5 min) of inactivity
  • Absolute max: 600s (10 min) hard limit
  • Activity monitoring: Extends timeout while agent is working
timeout: 300000  # 5 minutes

Context Validation

Tests verify context files are loaded before execution:

behavior:
  mustUseTools: [read, write]
  requiresContext: true
  minToolCalls: 2

expectedViolations:
  - rule: context-loading
    shouldViolate: false
    severity: error

๐Ÿ“Š Test Results

Latest Run (2025-11-26)

======================================================================
SUMMARY: 5/5 context loading tests passed (0 failed)
======================================================================

โœ… ctx-simple-testing-approach          (38s)
โœ… ctx-simple-documentation-format      (26s)
โœ… ctx-simple-coding-standards          (21s)
โœ… ctx-multi-standards-to-docs         (116s)
โœ… ctx-multi-error-handling-to-tests   (148s)

Total Duration: 349 seconds (~6 minutes)
Pass Rate: 100%
Violations: 0

Context Loading Details

Context Loading:
  โœ“ Loaded: .opencode/context/core/standards/code.md
  โœ“ Timing: Context loaded 44317ms before execution

๐ŸŽฏ Key Achievements

November 26, 2025

โœ… Context Loading Tests - 5 comprehensive tests (3 simple, 2 complex)
โœ… 100% Pass Rate - All tests passing
โœ… Smart Timeout - Handles complex multi-turn tests
โœ… Fixed Evaluator - Properly detects context files
โœ… Cleanup System - Auto-cleans test artifacts
โœ… Documentation - Complete coverage documentation


๐Ÿ“š Documentation

Document Purpose
CONTEXT_LOADING_COVERAGE.md Detailed context loading test documentation
IMPLEMENTATION_SUMMARY.md Recent implementation details and fixes
docs/OPENAGENT_RULES.md OpenAgent rules reference

๐Ÿ” Test Design

Simple Test Example

id: ctx-simple-coding-standards
name: "Context Loading: Coding Standards"
description: |
  Simple test: Ask about coding standards and verify agent loads context file.

category: developer
agent: openagent
model: anthropic/claude-sonnet-4-5

prompt: "What are our coding standards for this project?"

behavior:
  mustUseAnyOf: [[read]]
  requiresContext: true
  minToolCalls: 1

expectedViolations:
  - rule: context-loading
    shouldViolate: false
    severity: error

approvalStrategy:
  type: auto-approve

timeout: 60000

tags:
  - context-loading
  - simple-test

Complex Test Example

id: ctx-multi-standards-to-docs
name: "Context Loading: Multi-Turn Standards to Documentation"
description: |
  Complex multi-turn test: Standards question โ†’ Documentation request

category: developer
agent: openagent
model: anthropic/claude-sonnet-4-5

prompts:
  - text: "What are our coding standards?"
    expectContext: true
    contextFile: "standards.md"
  
  - text: "approve"
    delayMs: 2000
  
  - text: "Can you create documentation about these standards?"
    expectContext: true
    contextFile: "docs.md"
  
  - text: "approve"
    delayMs: 2000

behavior:
  mustUseTools: [read, write]
  requiresApproval: true
  requiresContext: true
  minToolCalls: 3

expectedViolations:
  - rule: approval-gate
    shouldViolate: false
    severity: error
  
  - rule: context-loading
    shouldViolate: false
    severity: error

approvalStrategy:
  type: auto-approve

timeout: 300000  # 5 minutes

tags:
  - context-loading
  - multi-turn
  - complex-test

๐Ÿ› ๏ธ Troubleshooting

Test Timeout

Issue: Test times out on complex multi-turn scenarios
Solution: Increase timeout to 300000ms (5 minutes)

Context Not Loaded

Issue: Evaluator reports "no context loaded"
Solution: Ensure test uses multi-turn prompts with approval

Files Not Cleaned Up

Issue: Test artifacts remain in test_tmp/
Solution: Check cleanup logic in run-sdk-tests.ts


๐Ÿ“ˆ Next Steps

  1. Add More Edge Cases

    • Test with missing context files
    • Test with multiple context directories
    • Test with file attachments
  2. Performance Metrics

    • Track context load time vs execution time
    • Measure API response times
    • Monitor rate limit usage
  3. Test Coverage Expansion

    • Add tests for other agent behaviors
    • Test delegation scenarios
    • Test error handling paths

๐Ÿค Contributing

To add new tests:

  1. Create YAML file in appropriate category directory
  2. Follow test schema (see examples above)
  3. Run test to verify it works
  4. Update this README if adding new category

Last Updated: 2025-11-26
Test Framework Version: 0.1.0
OpenAgent Tests: 22
Pass Rate: 100%