TESTING_CONFIDENCE.md 4.2 KB

Testing System Confidence Assessment

Current State: Honest Evaluation

What Works Well ✅

Feature Opencoder OpenAgent Notes
Agent Selection ✅ Verified ✅ Verified Both agents correctly identified
Single Tool Calls ✅ Works ✅ Works list, read, glob, bash all captured
Multi-Tool Chains ✅ Works ⚠️ Partial glob→read works, but approval blocks chains
Event Capture ✅ 18-56 events ✅ 18-29 events Real-time streaming works
Tool Verification ✅ Accurate ✅ Accurate Tool names and inputs captured
File Cleanup ✅ Works ✅ Works test_tmp/ cleaned before/after

What Needs Work ⚠️

1. OpenAgent Approval Workflow Issue

Problem: OpenAgent reads context but then stops and waits for text approval before executing write/edit tools.

Evidence:

Tool Call Details:
  1. read: {"filePath":".opencode/context/core/standards/code.md"}
  
Violations:
  - missing-required-tool: Required tool 'write' was not used

Root Cause: OpenAgent's system prompt requires text-based approval before execution. Single-prompt tests don't provide this approval.

Solution Options:

  1. ✅ Use multi-turn prompts (already implemented for task-simple-001)
  2. ⚠️ Need to update ALL openagent tests that expect write/edit to use multi-turn

2. Tool Flexibility

Problem: Agents sometimes use list instead of bash ls.

Solution: ✅ Fixed with mustUseAnyOf - allows alternative tools.

3. Approval Count Always 0

Observation: Approvals given: 0 even when tools execute.

Reason: The permission.request events are for tool-level permissions (dangerous commands), not text-based approval. OpenAgent's text approval is different.

Confidence Levels

Test Type Confidence Reason
Opencoder - Read Operations 🟢 HIGH Works perfectly, verified
Opencoder - Multi-tool Chains 🟢 HIGH glob→read verified
Opencoder - Bash/List 🟢 HIGH Both tools work
OpenAgent - Read Operations 🟢 HIGH Context loading verified
OpenAgent - Multi-turn Approval 🟡 MEDIUM Works but needs more testing
OpenAgent - Write/Edit 🔴 LOW Blocked by approval workflow
OpenAgent - Context→Write Chain 🔴 LOW Stops after context read

Tests That Need Multi-Turn Updates

These openagent tests expect write/edit but use single prompts:

  1. ctx-code-001.yaml - Expects read→write
  2. ctx-code-001-claude.yaml - Expects read→write
  3. ctx-docs-001.yaml - Expects read→edit
  4. ctx-tests-001.yaml - Expects read→write
  5. ctx-multi-turn-001.yaml - Already multi-turn ✅
  6. create-component.yaml - Expects write

Recommended Actions

Immediate (High Priority)

  1. Update openagent write/edit tests to multi-turn: ```yaml prompts:

    • text: "Create a file..."
    • text: "Yes, proceed" delayMs: 2000 ```
  2. Add mustUseAnyOf where tools are interchangeable:

    behavior:
     mustUseAnyOf: [[bash], [list]]
    

Future Improvements

  1. Add text content verification - Check agent's text output contains expected phrases
  2. Add timing verification - Ensure context loaded BEFORE execution
  3. Add file creation verification - Check test_tmp/ for expected files

Multi-Step Workflow Testing

What We CAN Test Now

  1. Read chains: glob → read (verified ✅)
  2. Context loading: read context file (verified ✅)
  3. Multi-turn conversations: prompt → approval → execute (verified ✅)

What We CANNOT Test Yet

  1. Full write workflows: Need multi-turn for openagent
  2. Edit workflows: Need multi-turn for openagent
  3. Delegation chains: task tool → subagent (not tested)

Summary

Agent Simple Tasks Multi-Step Write/Edit Confidence
Opencoder 🟢 HIGH
OpenAgent ⚠️ 🟡 MEDIUM

Bottom Line:

  • Opencoder tests are reliable and working
  • OpenAgent tests need multi-turn prompts for write/edit operations
  • The framework itself is solid, but test cases need updating