|
|
4 months ago | |
|---|---|---|
| .. | ||
| config | 4 months ago | |
| docs | 4 months ago | |
| tests | 4 months ago | |
| README.md | 4 months ago | |
| TEST_RESULTS.md | 4 months ago | |
| run-tests.js | 4 months ago | |
Evaluation framework for testing OpenAgent compliance with rules defined in .agents/agent/openagent.md.
Validate that OpenAgent follows its own critical rules:
evals/agents/openagent/
├── README.md # This file
├── config/
│ └── config.yaml # OpenAgent eval configuration
├── docs/
│ ├── OPENAGENT_RULES.md # Extracted testable rules from openagent.md
│ └── TEST_SPEC.md # Detailed test specifications
├── evaluators/ # Symlinks to framework evaluators
├── tests/ # Test cases and synthetic sessions
│ ├── simple/ # Simple 1-file tasks
│ ├── medium/ # 2-3 file multi-step tasks
│ └── complex/ # 4+ file delegation tasks
├── sessions/ # Real session recordings for analysis
└── test-cases/ # YAML test definitions
Uses shared framework from evals/framework/:
SessionReader - Reads OpenCode session data from ~/.local/share/agents/TimelineBuilder - Builds chronological event timelineEvaluatorRunner - Runs evaluators and aggregates resultsTests compliance with openagent.md rules:
| Evaluator | Rule | Source (openagent.md) | Severity |
|---|---|---|---|
ApprovalGateEvaluator |
Request approval before execution | Line 64-66 | ERROR |
ContextLoadingEvaluator |
Load context before tasks | Line 35-61, 162-193 | ERROR |
DelegationEvaluator |
Delegate 4+ file tasks | Line 256 | WARNING |
ToolUsageEvaluator |
Use specialized tools | (best practice) | INFO |
Coming soon:
StopOnFailureEvaluator - Never auto-fix (Line 68-73)WorkflowStageEvaluator - Follow stage progression (Line 109, 147-242)CleanupConfirmationEvaluator - Confirm before cleanup (Line 74-76)Simple Tasks (generalist capabilities)
Examples:
"Create hello.ts"
"Run tests"
"What does this function do?"
Medium Complexity (multi-step coordination)
Examples:
"Add feature with docs"
"Fix bug and add test"
"Review this PR"
Complex Tasks (delegation required)
Examples:
"Implement authentication system"
"Security audit codebase"
"Optimize database performance"
# Install framework dependencies
cd evals/framework
npm install
npm run build
# Run evaluations on a real session
cd ../agents/openagent
node ../../framework/test-evaluators.js
# Run all OpenAgent tests
npm run eval -- --agent openagent --all
# Run specific test category
npm run eval -- --agent openagent --test approval-gates
# Run single test case
npm run eval -- --agent openagent --test approval-gates --case file-creation-with-approval
# Analyze specific session
npm run eval -- --agent openagent --session ses_xxxxx
# Create synthetic test session
cd tests/simple
mkdir test-approval-gate
# Add timeline.json with expected events
# Add expected-results.json
Date: 2025-11-22
Sessions Tested: 3 real sessions
Findings:
Test Session Details:
| Session | Type | Exec Tools | Violations | Score | Status |
|---|---|---|---|---|---|
ses_70905f77... |
Conversational | 0 | 0 | 100/100 | ✓ PASS |
ses_7090666e... |
Conversational | 0 | 0 | 100/100 | ✓ PASS |
ses_7090efd2... |
Conversational | 0 | 0 | 100/100 | ✓ PASS |
ses_7093ba13... |
Task (7 bash) | 7 | 1 WARNING | 75/100 | ✓ PASS |
Conclusion: Need synthetic test sessions with known violations to properly validate evaluators.
See config/config.yaml:
agent: openagent
agent_path: ../../../.agents/agent/openagent.md
test_cases_path: ./test-cases
sessions_path: ./sessions
evaluators:
- approval-gate
- context-loading
- delegation
- tool-usage
pass_threshold: 75
scoring:
approval_gate: 40 # Critical rule
context_loading: 40 # Critical rule
delegation: 10 # Best practice
tool_usage: 10 # Nice-to-have
docs/OPENAGENT_RULES.md for the rule you're testingtest-cases/ YAML file:- id: my-new-test
name: "My New Test"
description: "Test description"
category: simple|medium|complex
input: "User prompt"
expected_behavior:
approval_requested: true
context_loaded: true
tool_used: write
delegation_used: false
evaluators:
- approval-gate
- context-loading
pass_threshold: 75
docs/OPENAGENT_RULES.md to identify the rule../../framework/src/evaluators/../../framework/src/index.tstests/Results stored in ../../results/YYYY-MM-DD/openagent/