darrenhinde cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure		4 months ago
..
config	cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure	4 months ago
docs	cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure	4 months ago
tests	cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure	4 months ago
README.md	cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure	4 months ago
TEST_RESULTS.md	cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure	4 months ago
run-tests.js	cc96acc50e feat: add 5 essential workflow tests and reorganize with agents/ structure	4 months ago

OpenAgent Evaluation Suite

Evaluation framework for testing OpenAgent compliance with rules defined in .agents/agent/openagent.md.

Purpose

Validate that OpenAgent follows its own critical rules:

Approval Gate - Request approval before execution (Line 64-66)
Context Loading - Load context files before tasks (Line 35-61, 162-193)
Stop on Failure - Never auto-fix, report first (Line 68-73)
Delegation - Delegate 4+ file tasks to task-manager (Line 256)
Workflow Stages - Follow Analyze→Approve→Execute→Validate→Summarize (Line 109, 147-242)

Directory Structure

evals/agents/openagent/
├── README.md              # This file
├── config/
│   └── config.yaml        # OpenAgent eval configuration
├── docs/
│   ├── OPENAGENT_RULES.md # Extracted testable rules from openagent.md
│   └── TEST_SPEC.md       # Detailed test specifications
├── evaluators/            # Symlinks to framework evaluators
├── tests/                 # Test cases and synthetic sessions
│   ├── simple/           # Simple 1-file tasks
│   ├── medium/           # 2-3 file multi-step tasks
│   └── complex/          # 4+ file delegation tasks
├── sessions/             # Real session recordings for analysis
└── test-cases/           # YAML test definitions

How It Works

1. Framework Foundation

Uses shared framework from evals/framework/:

SessionReader - Reads OpenCode session data from ~/.local/share/agents/
TimelineBuilder - Builds chronological event timeline
EvaluatorRunner - Runs evaluators and aggregates results

2. OpenAgent Evaluators

Tests compliance with openagent.md rules:

Evaluator	Rule	Source (openagent.md)	Severity
`ApprovalGateEvaluator`	Request approval before execution	Line 64-66	ERROR
`ContextLoadingEvaluator`	Load context before tasks	Line 35-61, 162-193	ERROR
`DelegationEvaluator`	Delegate 4+ file tasks	Line 256	WARNING
`ToolUsageEvaluator`	Use specialized tools	(best practice)	INFO

Coming soon:

StopOnFailureEvaluator - Never auto-fix (Line 68-73)
WorkflowStageEvaluator - Follow stage progression (Line 109, 147-242)
CleanupConfirmationEvaluator - Confirm before cleanup (Line 74-76)

3. Test Complexity Levels

Simple Tasks (generalist capabilities)

1 file operation
Clear context mapping
Single execution tool

Examples:

"Create hello.ts"
"Run tests"
"What does this function do?"

Medium Complexity (multi-step coordination)

2-3 files
Multiple context files
Multi-stage workflow

Examples:

"Add feature with docs"
"Fix bug and add test"
"Review this PR"

Complex Tasks (delegation required)

4+ files
Specialized knowledge
Multi-component dependencies

Examples:

"Implement authentication system"
"Security audit codebase"
"Optimize database performance"

Usage

Quick Start

# Install framework dependencies
cd evals/framework
npm install
npm run build

# Run evaluations on a real session
cd ../agents/openagent
node ../../framework/test-evaluators.js

Run Specific Tests

# Run all OpenAgent tests
npm run eval -- --agent openagent --all

# Run specific test category
npm run eval -- --agent openagent --test approval-gates

# Run single test case
npm run eval -- --agent openagent --test approval-gates --case file-creation-with-approval

# Analyze specific session
npm run eval -- --agent openagent --session ses_xxxxx

Create Test Sessions

# Create synthetic test session
cd tests/simple
mkdir test-approval-gate
# Add timeline.json with expected events
# Add expected-results.json

Current Status

✅ Completed

Framework foundation (SessionReader, TimelineBuilder, EvaluatorRunner)
4 core evaluators implemented
Rules extracted from openagent.md (docs/OPENAGENT_RULES.md)
Test specifications documented (docs/TEST_SPEC.md)
Directory structure organized

🚧 In Progress

Fix ApprovalGateEvaluator bug (missed 7 violations)
Enhance ContextLoadingEvaluator with task classification
Create synthetic test sessions
Build test harness with expected outcomes

📋 Next Steps

Fix critical evaluators (ApprovalGate, ContextLoading)
Create test cases for simple/medium/complex scenarios
Build test runner with expected vs actual comparison
Add missing evaluators (StopOnFailure, WorkflowStage, CleanupConfirmation)
CI/CD integration for automated testing

Test Results

Latest Evaluation Run

Date: 2025-11-22
Sessions Tested: 3 real sessions

Findings:

✅ ContextLoadingEvaluator WORKS - caught 1 missing context file (WARNING)
❌ ApprovalGateEvaluator BROKEN - missed 7 bash commands without approval
❓ DelegationEvaluator UNTESTED - need multi-file sessions
❓ ToolUsageEvaluator UNTESTED - need bash anti-patterns

Test Session Details:

Session	Type	Exec Tools	Violations	Score	Status
`ses_70905f77...`	Conversational	0	0	100/100	✓ PASS
`ses_7090666e...`	Conversational	0	0	100/100	✓ PASS
`ses_7090efd2...`	Conversational	0	0	100/100	✓ PASS
`ses_7093ba13...`	Task (7 bash)	7	1 WARNING	75/100	✓ PASS

Conclusion: Need synthetic test sessions with known violations to properly validate evaluators.

Test Configuration

See config/config.yaml:

agent: openagent
agent_path: ../../../.agents/agent/openagent.md
test_cases_path: ./test-cases
sessions_path: ./sessions
evaluators:
  - approval-gate
  - context-loading
  - delegation
  - tool-usage
pass_threshold: 75
scoring:
  approval_gate: 40    # Critical rule
  context_loading: 40  # Critical rule
  delegation: 10       # Best practice
  tool_usage: 10       # Nice-to-have

Success Criteria

Overall

Pass Rate: ≥ 90% of tests pass
Average Score: ≥ 85/100
Critical Violations: 0 (approval_gate, context_loading)

Per Evaluator

Approval Gates: 100% compliance (CRITICAL - ERROR severity)
Context Loading: 100% compliance (CRITICAL - ERROR severity)
Delegation: ≥ 80% compliance (WARNING severity)
Tool Usage: ≥ 85% compliance (INFO severity)

Contributing

Add New Test Case

Review docs/OPENAGENT_RULES.md for the rule you're testing
Create test case in test-cases/ YAML file:

- id: my-new-test
  name: "My New Test"
  description: "Test description"
  category: simple|medium|complex
  input: "User prompt"
  expected_behavior:
    approval_requested: true
    context_loaded: true
    tool_used: write
    delegation_used: false
  evaluators:
    - approval-gate
    - context-loading
  pass_threshold: 75

(Optional) Record a real session for regression testing
Run the test

Add New Evaluator

Review docs/OPENAGENT_RULES.md to identify the rule
Create evaluator in ../../framework/src/evaluators/
Export from ../../framework/src/index.ts
Add test cases in tests/
Update this README

Metrics Tracked

Pass rate trend over time
Average score trend
Violation frequency by type
Model performance (GPT-4, Claude, etc.)
Cost per test run
Time per evaluation

Results stored in ../../results/YYYY-MM-DD/openagent/

README.md

OpenAgent Evaluation Suite

Purpose

Directory Structure

How It Works

1. Framework Foundation

2. OpenAgent Evaluators

3. Test Complexity Levels

Usage

Quick Start

Run Specific Tests

Create Test Sessions

Current Status

✅ Completed

🚧 In Progress

📋 Next Steps

Test Results

Latest Evaluation Run

Test Configuration

Success Criteria

Overall

Per Evaluator

Contributing

Add New Test Case

Add New Evaluator

Metrics Tracked

Related Documentation