|
|
@@ -0,0 +1,120 @@
|
|
|
+# Testing System Confidence Assessment
|
|
|
+
|
|
|
+## Current State: Honest Evaluation
|
|
|
+
|
|
|
+### What Works Well ✅
|
|
|
+
|
|
|
+| Feature | Opencoder | OpenAgent | Notes |
|
|
|
+|---------|-----------|-----------|-------|
|
|
|
+| Agent Selection | ✅ Verified | ✅ Verified | Both agents correctly identified |
|
|
|
+| Single Tool Calls | ✅ Works | ✅ Works | list, read, glob, bash all captured |
|
|
|
+| Multi-Tool Chains | ✅ Works | ⚠️ Partial | glob→read works, but approval blocks chains |
|
|
|
+| Event Capture | ✅ 18-56 events | ✅ 18-29 events | Real-time streaming works |
|
|
|
+| Tool Verification | ✅ Accurate | ✅ Accurate | Tool names and inputs captured |
|
|
|
+| File Cleanup | ✅ Works | ✅ Works | test_tmp/ cleaned before/after |
|
|
|
+
|
|
|
+### What Needs Work ⚠️
|
|
|
+
|
|
|
+#### 1. OpenAgent Approval Workflow Issue
|
|
|
+
|
|
|
+**Problem**: OpenAgent reads context but then **stops and waits for text approval** before executing write/edit tools.
|
|
|
+
|
|
|
+**Evidence**:
|
|
|
+```
|
|
|
+Tool Call Details:
|
|
|
+ 1. read: {"filePath":".opencode/context/core/standards/code.md"}
|
|
|
+
|
|
|
+Violations:
|
|
|
+ - missing-required-tool: Required tool 'write' was not used
|
|
|
+```
|
|
|
+
|
|
|
+**Root Cause**: OpenAgent's system prompt requires text-based approval before execution. Single-prompt tests don't provide this approval.
|
|
|
+
|
|
|
+**Solution Options**:
|
|
|
+1. ✅ Use multi-turn prompts (already implemented for task-simple-001)
|
|
|
+2. ⚠️ Need to update ALL openagent tests that expect write/edit to use multi-turn
|
|
|
+
|
|
|
+#### 2. Tool Flexibility
|
|
|
+
|
|
|
+**Problem**: Agents sometimes use `list` instead of `bash ls`.
|
|
|
+
|
|
|
+**Solution**: ✅ Fixed with `mustUseAnyOf` - allows alternative tools.
|
|
|
+
|
|
|
+#### 3. Approval Count Always 0
|
|
|
+
|
|
|
+**Observation**: `Approvals given: 0` even when tools execute.
|
|
|
+
|
|
|
+**Reason**: The `permission.request` events are for tool-level permissions (dangerous commands), not text-based approval. OpenAgent's text approval is different.
|
|
|
+
|
|
|
+### Confidence Levels
|
|
|
+
|
|
|
+| Test Type | Confidence | Reason |
|
|
|
+|-----------|------------|--------|
|
|
|
+| **Opencoder - Read Operations** | 🟢 HIGH | Works perfectly, verified |
|
|
|
+| **Opencoder - Multi-tool Chains** | 🟢 HIGH | glob→read verified |
|
|
|
+| **Opencoder - Bash/List** | 🟢 HIGH | Both tools work |
|
|
|
+| **OpenAgent - Read Operations** | 🟢 HIGH | Context loading verified |
|
|
|
+| **OpenAgent - Multi-turn Approval** | 🟡 MEDIUM | Works but needs more testing |
|
|
|
+| **OpenAgent - Write/Edit** | 🔴 LOW | Blocked by approval workflow |
|
|
|
+| **OpenAgent - Context→Write Chain** | 🔴 LOW | Stops after context read |
|
|
|
+
|
|
|
+### Tests That Need Multi-Turn Updates
|
|
|
+
|
|
|
+These openagent tests expect write/edit but use single prompts:
|
|
|
+
|
|
|
+1. `ctx-code-001.yaml` - Expects read→write
|
|
|
+2. `ctx-code-001-claude.yaml` - Expects read→write
|
|
|
+3. `ctx-docs-001.yaml` - Expects read→edit
|
|
|
+4. `ctx-tests-001.yaml` - Expects read→write
|
|
|
+5. `ctx-multi-turn-001.yaml` - Already multi-turn ✅
|
|
|
+6. `create-component.yaml` - Expects write
|
|
|
+
|
|
|
+### Recommended Actions
|
|
|
+
|
|
|
+#### Immediate (High Priority)
|
|
|
+
|
|
|
+1. **Update openagent write/edit tests to multi-turn**:
|
|
|
+ ```yaml
|
|
|
+ prompts:
|
|
|
+ - text: "Create a file..."
|
|
|
+ - text: "Yes, proceed"
|
|
|
+ delayMs: 2000
|
|
|
+ ```
|
|
|
+
|
|
|
+2. **Add `mustUseAnyOf` where tools are interchangeable**:
|
|
|
+ ```yaml
|
|
|
+ behavior:
|
|
|
+ mustUseAnyOf: [[bash], [list]]
|
|
|
+ ```
|
|
|
+
|
|
|
+#### Future Improvements
|
|
|
+
|
|
|
+1. **Add text content verification** - Check agent's text output contains expected phrases
|
|
|
+2. **Add timing verification** - Ensure context loaded BEFORE execution
|
|
|
+3. **Add file creation verification** - Check test_tmp/ for expected files
|
|
|
+
|
|
|
+### Multi-Step Workflow Testing
|
|
|
+
|
|
|
+#### What We CAN Test Now
|
|
|
+
|
|
|
+1. **Read chains**: glob → read (verified ✅)
|
|
|
+2. **Context loading**: read context file (verified ✅)
|
|
|
+3. **Multi-turn conversations**: prompt → approval → execute (verified ✅)
|
|
|
+
|
|
|
+#### What We CANNOT Test Yet
|
|
|
+
|
|
|
+1. **Full write workflows**: Need multi-turn for openagent
|
|
|
+2. **Edit workflows**: Need multi-turn for openagent
|
|
|
+3. **Delegation chains**: task tool → subagent (not tested)
|
|
|
+
|
|
|
+### Summary
|
|
|
+
|
|
|
+| Agent | Simple Tasks | Multi-Step | Write/Edit | Confidence |
|
|
|
+|-------|--------------|------------|------------|------------|
|
|
|
+| **Opencoder** | ✅ | ✅ | ✅ | 🟢 HIGH |
|
|
|
+| **OpenAgent** | ✅ | ⚠️ | ❌ | 🟡 MEDIUM |
|
|
|
+
|
|
|
+**Bottom Line**:
|
|
|
+- Opencoder tests are reliable and working
|
|
|
+- OpenAgent tests need multi-turn prompts for write/edit operations
|
|
|
+- The framework itself is solid, but test cases need updating
|