Eval Framework - Summary

Date: November 28, 2025
Status: ✅ Ready to Test

What Was Done

1. Enhanced Evaluators ✅

ApprovalGateEvaluator - Added confidence levels, approval text capture
ContextLoadingEvaluator - Already validates correct context file for task type
All 8 evaluators working

2. Cleaned Up Tests ✅

Before: 71 files, 42 directories, 20 duplicates
After: 49 unique tests, 18 directories, 0 duplicates
Archived 22 duplicates to _archive/

3. Model Testing ✅

Grok Code Fast: ❌ CONFIRMED - Does NOT execute tools (tested 3 times)
Claude Sonnet 4.5: ✅ Works perfectly
Use Claude for all testing

Core Test Suite (8 tests - RECOMMENDED)

Minimum tests to validate OpenAgent's 4 critical rules:

Approval Gate (2 tests):

05-approval-before-execution-positive.yaml
02-missing-approval-negative.yaml

Context Loading (3 tests):

01-code-task.yaml
02-docs-task.yaml
11-wrong-context-file-negative.yaml

Stop on Failure (2 tests):

02-stop-and-report-positive.yaml
03-auto-fix-negative.yaml

Report First (1 test):

01-correct-workflow-positive.yaml

Cost: ~$0.35 | Time: ~4 min | Token savings: 84%

Full Test Structure

01-critical-rules/     22 tests (Approval, Context, Stop, Report)
06-integration/         6 tests
06-negative/            5 tests (Violation detection)
07-behavior/            4 tests
05-edge-cases/          3 tests
02-workflow-stages/     2 tests
04-execution-paths/     2 tests
08-delegation/          2 tests
09-tool-usage/          2 tests
smoke-test.yaml         1 test

Total: 49 unique tests

Run Tests

Core Suite (8 tests - START HERE)

cd evals/framework

# Run all 8 core tests
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/{approval-gate/05*,approval-gate/02*,context-loading/01*,context-loading/02*,context-loading/11*,stop-on-failure/02*,stop-on-failure/03*,report-first/01*}" \
  --model=anthropic/claude-sonnet-4-5

Cost: ~$0.35 | Time: ~4 min

All Critical Rules (22 tests)

npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/**/*.yaml" \
  --model=anthropic/claude-sonnet-4-5

Cost: ~$1 | Time: ~10 min

Full Suite (49 tests)

npm run eval:sdk -- --agent=openagent \
  --model=anthropic/claude-sonnet-4-5

Cost: ~$2 | Time: ~20 min

Key Findings

✅ Framework is production-ready
✅ Tests are clean and organized (49 unique)
✅ Core suite identified (8 tests, 84% token savings)
❌ Grok confirmed broken (0 tool calls on all tests)
✅ Claude works perfectly and is affordable

Recommendation: Start with core 8 tests, expand if needed.

SUMMARY.md 2.7 KB History Raw