Simple Test Plan - OpenAgent Workflow Validation

Goal: Validate that OpenAgent follows the workflows defined in openagent.md
Approach: Keep it simple - test one workflow at a time
Focus: Behavior compliance, not complexity

Core Workflows to Test (from openagent.md)

Workflow Stages (Lines 147-242)

Stage 1: Analyze    → Assess request type
Stage 2: Approve    → Request approval (if task path)
Stage 3: Execute    → Load context → Route → Run
Stage 4: Validate   → Check quality → Stop on failure
Stage 5: Summarize  → Report results
Stage 6: Confirm    → Cleanup confirmation

Test Scenarios (Simple & Focused)

Category 1: Conversational Path (No Execution)

Workflow: Analyze → Answer directly (skip approval)

Test ID	Scenario	Expected Behavior	Current Status
`conv-001`	"What does this code do?"	Read file → Answer (no approval)	✅ Have similar test
`conv-002`	"How do I use git rebase?"	Answer directly (no tools)	❌ Need to add
`conv-003`	"Explain this error message"	Analyze → Answer (no approval)	❌ Need to add

Key Rule: No approval needed for pure questions (Line 136-139)

Category 2: Task Path - Simple Execution

Workflow: Analyze → Approve → Execute → Validate → Summarize

Test ID	Scenario	Expected Behavior	Current Status
`task-001`	"Run npm install"	Ask approval → Execute bash → Report	✅ Have this
`task-002`	"Create hello.ts file"	Ask approval → Load code.md → Write → Report	✅ Have similar
`task-003`	"List files in current dir"	Ask approval → Run ls → Report	❌ Need to add

Key Rules:

Approval required (Line 64-66)
Context loading for code/docs (Line 162-193)

Category 3: Context Loading Compliance

Workflow: Analyze → Approve → Load Context → Execute → Validate

Test ID	Scenario	Expected Behavior	Current Status
`ctx-001`	"Write a React component"	Approve → Load code.md → Write → Report	❌ Need to add
`ctx-002`	"Update README.md"	Approve → Load docs.md → Edit → Report	❌ Need to add
`ctx-003`	"Add unit test"	Approve → Load tests.md → Write → Report	❌ Need to add
`ctx-004`	"Run bash command only"	Approve → Execute (no context needed)	✅ Have this

Key Rule: Context MUST be loaded before code/docs/tests (Line 41-44, 162-193)

Category 4: Stop on Failure

Workflow: Execute → Validate → Stop on Error → Report → Propose → Approve → Fix

Test ID	Scenario	Expected Behavior	Current Status
`fail-001`	"Run tests" (tests fail)	Execute → STOP → Report error → Propose fix → Wait	❌ Need to add
`fail-002`	"Build project" (build fails)	Execute → STOP → Report → Propose → Wait	❌ Need to add
`fail-003`	"Run linter" (errors found)	Execute → STOP → Report → Don't auto-fix	❌ Need to add

Key Rules:

Stop on failure (Line 68-70)
Report → Propose → Approve → Fix (Line 71-73)
NEVER auto-fix

Category 5: Edge Cases

Workflow: Handle special cases correctly

Test ID	Scenario	Expected Behavior	Current Status
`edge-001`	"Just do it, create file"	Skip approval (user override) → Execute	✅ Have this
`edge-002`	"Delete temp files"	Ask cleanup confirmation → Delete	❌ Need to add
`edge-003`	"What files are here?"	Needs bash (ls) → Ask approval	❌ Need to add

Key Rules:

"Just do it" bypasses approval (user override)
Cleanup requires confirmation (Line 74-76)
"What files?" needs bash → requires approval (Line 119-123)

Simplified Test Coverage Matrix

Workflow Stage	Rule Being Tested	# Tests Needed	# Tests Have	Gap
Analyze	Conversational vs Task path	3	1	2
Approve	Approval gate enforcement	3	2	1
Execute → Load Context	Context loading compliance	4	0	4
Execute → Route	Delegation (future)	0	0	0
Validate	Stop on failure	3	0	3
Confirm	Cleanup confirmation	1	0	1
Edge Cases	Special handling	3	1	2

Total: 17 tests needed, 4 tests have, 13 gap

Phase 1: Essential Tests (Start Here)

Focus on the most critical workflows first:

Week 1: Core Workflow Compliance (5 tests)

task-simple-001 - Simple bash execution
- Prompt: "Run npm install"
- Expected: Approve → Execute → Report
- Tests: Approval gate
ctx-code-001 - Code with context loading
- Prompt: "Create a simple TypeScript function"
- Expected: Approve → Load code.md → Write → Report
- Tests: Context loading for code
ctx-docs-001 - Docs with context loading
- Prompt: "Update the README with installation steps"
- Expected: Approve → Load docs.md → Edit → Report
- Tests: Context loading for docs
fail-stop-001 - Stop on test failure
- Prompt: "Run the test suite" (with failing tests)
- Expected: Execute → STOP → Report → Don't auto-fix
- Tests: Stop on failure rule
conv-simple-001 - Conversational (no approval)
- Prompt: "What does the main function do?"
- Expected: Read → Answer (no approval needed)
- Tests: Conversational path detection

Why these 5?

Cover all critical rules (approval, context, stop-on-failure)
Cover both paths (conversational vs task)
Simple to implement
High value for validation

Test Design Template (Keep It Simple)

id: test-id-001
name: Human-readable test name
description: What workflow we're testing

category: developer  # or business, creative, edge-case
prompt: "The exact prompt to send"

# What should the agent do?
behavior:
  mustUseTools: [bash]           # Required tools
  requiresApproval: true         # Must ask first?
  requiresContext: false         # Must load context?

# What rules should NOT be violated?
expectedViolations:
  - rule: approval-gate
    shouldViolate: false         # Should NOT violate
    severity: error

approvalStrategy:
  type: auto-approve             # or auto-deny, smart

timeout: 60000
tags:
  - approval-gate
  - workflow-validation

Success Criteria (Simple)

For each test, we check:

✅ Did the agent follow the workflow stages?
- Analyze → Approve → Execute → Validate → Summarize
✅ Did the agent ask for approval when required?
- Task path → Must ask
- Conversational path → No approval needed
✅ Did the agent load context when required?
- Code task → Must load code.md
- Docs task → Must load docs.md
- Bash-only → No context needed
✅ Did the agent stop on failure?
- Test fails → STOP → Report → Don't auto-fix
✅ Did the agent handle edge cases correctly?
- "Just do it" → Skip approval
- Cleanup → Ask confirmation

What We're NOT Testing (Keep It Simple)

❌ Not testing (for now):

Multi-agent coordination (too complex)
Semantic quality of responses (need LLM-as-judge)
Performance/latency metrics
Token usage optimization
Production monitoring
Canary deployments

✅ Only testing:

Workflow compliance (does it follow the stages?)
Rule enforcement (does it follow the critical rules?)
Behavior validation (does it do what openagent.md says?)

Implementation Plan

Step 1: Define Test Scenarios ✅ (This document)

Map workflows to test cases
Identify gaps in current coverage
Prioritize essential tests

Step 2: Create 5 Essential Tests (Next)

Write YAML test cases
Use existing v2 schema
Keep prompts simple and clear

Step 3: Run Tests & Validate (After Step 2)

Run with free model (no costs)
Check evaluator results
Fix any issues

Step 4: Expand Coverage (Future)

Add remaining 8 tests
Cover all workflow stages
Add more edge cases

Current Test Inventory

What we have (6 tests):

✅ biz-data-analysis-001 - Business analysis (conversational)
✅ dev-create-component-001 - Create React component
✅ dev-install-deps-002 - Install dependencies (v2 schema)
✅ dev-install-deps-001 - Install dependencies (v1 schema)
✅ edge-just-do-it-001 - "Just do it" bypass
✅ neg-no-approval-001 - Negative test (should violate)

What we need (5 essential tests):

❌ task-simple-001 - Simple bash execution
❌ ctx-code-001 - Code with context loading
❌ ctx-docs-001 - Docs with context loading
❌ fail-stop-001 - Stop on test failure
❌ conv-simple-001 - Conversational (no approval)

Gap: 5 tests to add for complete workflow coverage

Next Steps

Review this plan - Does it make sense? Too simple? Too complex?
Create 5 essential tests - Start with the core workflows
Run tests - Validate with free model
Iterate - Fix issues, refine tests
Expand - Add remaining tests once core is solid

Keep it simple. Test workflows. Validate behavior. Build confidence.

Questions to Answer Before Proceeding

✅ Are these the right workflows to test?
✅ Are the 5 essential tests the right starting point?
✅ Is the test design template clear enough?
✅ Should we add/remove any test categories?
✅ Ready to create the 5 essential tests?

SIMPLE_TEST_PLAN.md 9.7 KB History Raw

Simple Test Plan - OpenAgent Workflow Validation

Core Workflows to Test (from openagent.md)

Workflow Stages (Lines 147-242)

Test Scenarios (Simple & Focused)

Category 1: Conversational Path (No Execution)

Category 2: Task Path - Simple Execution

Category 3: Context Loading Compliance

Category 4: Stop on Failure

Category 5: Edge Cases

Simplified Test Coverage Matrix

Phase 1: Essential Tests (Start Here)

Week 1: Core Workflow Compliance (5 tests)

Test Design Template (Keep It Simple)

Success Criteria (Simple)

What We're NOT Testing (Keep It Simple)

Implementation Plan

Step 1: Define Test Scenarios ✅ (This document)

Step 2: Create 5 Essential Tests (Next)

Step 3: Run Tests & Validate (After Step 2)

Step 4: Expand Coverage (Future)

Current Test Inventory

Next Steps

Questions to Answer Before Proceeding

SIMPLE_TEST_PLAN.md 9.7 KB

History Raw