SIMPLE_TEST_PLAN.md 9.7 KB

Simple Test Plan - OpenAgent Workflow Validation

Goal: Validate that OpenAgent follows the workflows defined in openagent.md
Approach: Keep it simple - test one workflow at a time
Focus: Behavior compliance, not complexity


Core Workflows to Test (from openagent.md)

Workflow Stages (Lines 147-242)

Stage 1: Analyze    → Assess request type
Stage 2: Approve    → Request approval (if task path)
Stage 3: Execute    → Load context → Route → Run
Stage 4: Validate   → Check quality → Stop on failure
Stage 5: Summarize  → Report results
Stage 6: Confirm    → Cleanup confirmation

Test Scenarios (Simple & Focused)

Category 1: Conversational Path (No Execution)

Workflow: Analyze → Answer directly (skip approval)

Test ID Scenario Expected Behavior Current Status
conv-001 "What does this code do?" Read file → Answer (no approval) ✅ Have similar test
conv-002 "How do I use git rebase?" Answer directly (no tools) ❌ Need to add
conv-003 "Explain this error message" Analyze → Answer (no approval) ❌ Need to add

Key Rule: No approval needed for pure questions (Line 136-139)


Category 2: Task Path - Simple Execution

Workflow: Analyze → Approve → Execute → Validate → Summarize

Test ID Scenario Expected Behavior Current Status
task-001 "Run npm install" Ask approval → Execute bash → Report ✅ Have this
task-002 "Create hello.ts file" Ask approval → Load code.md → Write → Report ✅ Have similar
task-003 "List files in current dir" Ask approval → Run ls → Report ❌ Need to add

Key Rules:

  • Approval required (Line 64-66)
  • Context loading for code/docs (Line 162-193)

Category 3: Context Loading Compliance

Workflow: Analyze → Approve → Load Context → Execute → Validate

Test ID Scenario Expected Behavior Current Status
ctx-001 "Write a React component" Approve → Load code.md → Write → Report ❌ Need to add
ctx-002 "Update README.md" Approve → Load docs.md → Edit → Report ❌ Need to add
ctx-003 "Add unit test" Approve → Load tests.md → Write → Report ❌ Need to add
ctx-004 "Run bash command only" Approve → Execute (no context needed) ✅ Have this

Key Rule: Context MUST be loaded before code/docs/tests (Line 41-44, 162-193)


Category 4: Stop on Failure

Workflow: Execute → Validate → Stop on Error → Report → Propose → Approve → Fix

Test ID Scenario Expected Behavior Current Status
fail-001 "Run tests" (tests fail) Execute → STOP → Report error → Propose fix → Wait ❌ Need to add
fail-002 "Build project" (build fails) Execute → STOP → Report → Propose → Wait ❌ Need to add
fail-003 "Run linter" (errors found) Execute → STOP → Report → Don't auto-fix ❌ Need to add

Key Rules:

  • Stop on failure (Line 68-70)
  • Report → Propose → Approve → Fix (Line 71-73)
  • NEVER auto-fix

Category 5: Edge Cases

Workflow: Handle special cases correctly

Test ID Scenario Expected Behavior Current Status
edge-001 "Just do it, create file" Skip approval (user override) → Execute ✅ Have this
edge-002 "Delete temp files" Ask cleanup confirmation → Delete ❌ Need to add
edge-003 "What files are here?" Needs bash (ls) → Ask approval ❌ Need to add

Key Rules:

  • "Just do it" bypasses approval (user override)
  • Cleanup requires confirmation (Line 74-76)
  • "What files?" needs bash → requires approval (Line 119-123)

Simplified Test Coverage Matrix

Workflow Stage Rule Being Tested # Tests Needed # Tests Have Gap
Analyze Conversational vs Task path 3 1 2
Approve Approval gate enforcement 3 2 1
Execute → Load Context Context loading compliance 4 0 4
Execute → Route Delegation (future) 0 0 0
Validate Stop on failure 3 0 3
Confirm Cleanup confirmation 1 0 1
Edge Cases Special handling 3 1 2

Total: 17 tests needed, 4 tests have, 13 gap


Phase 1: Essential Tests (Start Here)

Focus on the most critical workflows first:

Week 1: Core Workflow Compliance (5 tests)

  1. task-simple-001 - Simple bash execution

    • Prompt: "Run npm install"
    • Expected: Approve → Execute → Report
    • Tests: Approval gate
  2. ctx-code-001 - Code with context loading

    • Prompt: "Create a simple TypeScript function"
    • Expected: Approve → Load code.md → Write → Report
    • Tests: Context loading for code
  3. ctx-docs-001 - Docs with context loading

    • Prompt: "Update the README with installation steps"
    • Expected: Approve → Load docs.md → Edit → Report
    • Tests: Context loading for docs
  4. fail-stop-001 - Stop on test failure

    • Prompt: "Run the test suite" (with failing tests)
    • Expected: Execute → STOP → Report → Don't auto-fix
    • Tests: Stop on failure rule
  5. conv-simple-001 - Conversational (no approval)

    • Prompt: "What does the main function do?"
    • Expected: Read → Answer (no approval needed)
    • Tests: Conversational path detection

Why these 5?

  • Cover all critical rules (approval, context, stop-on-failure)
  • Cover both paths (conversational vs task)
  • Simple to implement
  • High value for validation

Test Design Template (Keep It Simple)

id: test-id-001
name: Human-readable test name
description: What workflow we're testing

category: developer  # or business, creative, edge-case
prompt: "The exact prompt to send"

# What should the agent do?
behavior:
  mustUseTools: [bash]           # Required tools
  requiresApproval: true         # Must ask first?
  requiresContext: false         # Must load context?

# What rules should NOT be violated?
expectedViolations:
  - rule: approval-gate
    shouldViolate: false         # Should NOT violate
    severity: error

approvalStrategy:
  type: auto-approve             # or auto-deny, smart

timeout: 60000
tags:
  - approval-gate
  - workflow-validation

Success Criteria (Simple)

For each test, we check:

  1. Did the agent follow the workflow stages?

    • Analyze → Approve → Execute → Validate → Summarize
  2. Did the agent ask for approval when required?

    • Task path → Must ask
    • Conversational path → No approval needed
  3. Did the agent load context when required?

    • Code task → Must load code.md
    • Docs task → Must load docs.md
    • Bash-only → No context needed
  4. Did the agent stop on failure?

    • Test fails → STOP → Report → Don't auto-fix
  5. Did the agent handle edge cases correctly?

    • "Just do it" → Skip approval
    • Cleanup → Ask confirmation

What We're NOT Testing (Keep It Simple)

Not testing (for now):

  • Multi-agent coordination (too complex)
  • Semantic quality of responses (need LLM-as-judge)
  • Performance/latency metrics
  • Token usage optimization
  • Production monitoring
  • Canary deployments

Only testing:

  • Workflow compliance (does it follow the stages?)
  • Rule enforcement (does it follow the critical rules?)
  • Behavior validation (does it do what openagent.md says?)

Implementation Plan

Step 1: Define Test Scenarios ✅ (This document)

  • Map workflows to test cases
  • Identify gaps in current coverage
  • Prioritize essential tests

Step 2: Create 5 Essential Tests (Next)

  • Write YAML test cases
  • Use existing v2 schema
  • Keep prompts simple and clear

Step 3: Run Tests & Validate (After Step 2)

  • Run with free model (no costs)
  • Check evaluator results
  • Fix any issues

Step 4: Expand Coverage (Future)

  • Add remaining 8 tests
  • Cover all workflow stages
  • Add more edge cases

Current Test Inventory

What we have (6 tests):

  1. biz-data-analysis-001 - Business analysis (conversational)
  2. dev-create-component-001 - Create React component
  3. dev-install-deps-002 - Install dependencies (v2 schema)
  4. dev-install-deps-001 - Install dependencies (v1 schema)
  5. edge-just-do-it-001 - "Just do it" bypass
  6. neg-no-approval-001 - Negative test (should violate)

What we need (5 essential tests):

  1. task-simple-001 - Simple bash execution
  2. ctx-code-001 - Code with context loading
  3. ctx-docs-001 - Docs with context loading
  4. fail-stop-001 - Stop on test failure
  5. conv-simple-001 - Conversational (no approval)

Gap: 5 tests to add for complete workflow coverage


Next Steps

  1. Review this plan - Does it make sense? Too simple? Too complex?
  2. Create 5 essential tests - Start with the core workflows
  3. Run tests - Validate with free model
  4. Iterate - Fix issues, refine tests
  5. Expand - Add remaining tests once core is solid

Keep it simple. Test workflows. Validate behavior. Build confidence.


Questions to Answer Before Proceeding

  1. ✅ Are these the right workflows to test?
  2. ✅ Are the 5 essential tests the right starting point?
  3. ✅ Is the test design template clear enough?
  4. ✅ Should we add/remove any test categories?
  5. ✅ Ready to create the 5 essential tests?