4 months ago · f872007919
--- a/evals/TESTING_CONFIDENCE.md
+++ b/evals/TESTING_CONFIDENCE.md
@@ -0,0 +1,120 @@
 
				+# Testing System Confidence Assessment
			
 
				+
			
 
				+## Current State: Honest Evaluation
			
 
				+
			
 
				+### What Works Well ✅
			
 
				+
			
 
				+| Feature | Opencoder | OpenAgent | Notes |
			
 
				+|---------|-----------|-----------|-------|
			
 
				+| Agent Selection | ✅ Verified | ✅ Verified | Both agents correctly identified |
			
 
				+| Single Tool Calls | ✅ Works | ✅ Works | list, read, glob, bash all captured |
			
 
				+| Multi-Tool Chains | ✅ Works | ⚠️ Partial | glob→read works, but approval blocks chains |
			
 
				+| Event Capture | ✅ 18-56 events | ✅ 18-29 events | Real-time streaming works |
			
 
				+| Tool Verification | ✅ Accurate | ✅ Accurate | Tool names and inputs captured |
			
 
				+| File Cleanup | ✅ Works | ✅ Works | test_tmp/ cleaned before/after |
			
 
				+
			
 
				+### What Needs Work ⚠️
			
 
				+
			
 
				+#### 1. OpenAgent Approval Workflow Issue
			
 
				+
			
 
				+**Problem**: OpenAgent reads context but then **stops and waits for text approval** before executing write/edit tools.
			
 
				+
			
 
				+**Evidence**:
			
 
				+```
			
 
				+Tool Call Details:
			
 
				+  1. read: {"filePath":".opencode/context/core/standards/code.md"}
			
 
				+  
			
 
				+Violations:
			
 
				+  - missing-required-tool: Required tool 'write' was not used
			
 
				+```
			
 
				+
			
 
				+**Root Cause**: OpenAgent's system prompt requires text-based approval before execution. Single-prompt tests don't provide this approval.
			
 
				+
			
 
				+**Solution Options**:
			
 
				+1. ✅ Use multi-turn prompts (already implemented for task-simple-001)
			
 
				+2. ⚠️ Need to update ALL openagent tests that expect write/edit to use multi-turn
			
 
				+
			
 
				+#### 2. Tool Flexibility
			
 
				+
			
 
				+**Problem**: Agents sometimes use `list` instead of `bash ls`.
			
 
				+
			
 
				+**Solution**: ✅ Fixed with `mustUseAnyOf` - allows alternative tools.
			
 
				+
			
 
				+#### 3. Approval Count Always 0
			
 
				+
			
 
				+**Observation**: `Approvals given: 0` even when tools execute.
			
 
				+
			
 
				+**Reason**: The `permission.request` events are for tool-level permissions (dangerous commands), not text-based approval. OpenAgent's text approval is different.
			
 
				+
			
 
				+### Confidence Levels
			
 
				+
			
 
				+| Test Type | Confidence | Reason |
			
 
				+|-----------|------------|--------|
			
 
				+| **Opencoder - Read Operations** | 🟢 HIGH | Works perfectly, verified |
			
 
				+| **Opencoder - Multi-tool Chains** | 🟢 HIGH | glob→read verified |
			
 
				+| **Opencoder - Bash/List** | 🟢 HIGH | Both tools work |
			
 
				+| **OpenAgent - Read Operations** | 🟢 HIGH | Context loading verified |
			
 
				+| **OpenAgent - Multi-turn Approval** | 🟡 MEDIUM | Works but needs more testing |
			
 
				+| **OpenAgent - Write/Edit** | 🔴 LOW | Blocked by approval workflow |
			
 
				+| **OpenAgent - Context→Write Chain** | 🔴 LOW | Stops after context read |
			
 
				+
			
 
				+### Tests That Need Multi-Turn Updates
			
 
				+
			
 
				+These openagent tests expect write/edit but use single prompts:
			
 
				+
			
 
				+1. `ctx-code-001.yaml` - Expects read→write
			
 
				+2. `ctx-code-001-claude.yaml` - Expects read→write
			
 
				+3. `ctx-docs-001.yaml` - Expects read→edit
			
 
				+4. `ctx-tests-001.yaml` - Expects read→write
			
 
				+5. `ctx-multi-turn-001.yaml` - Already multi-turn ✅
			
 
				+6. `create-component.yaml` - Expects write
			
 
				+
			
 
				+### Recommended Actions
			
 
				+
			
 
				+#### Immediate (High Priority)
			
 
				+
			
 
				+1. **Update openagent write/edit tests to multi-turn**:
			
 
				+   ```yaml
			
 
				+   prompts:
			
 
				+     - text: "Create a file..."
			
 
				+     - text: "Yes, proceed"
			
 
				+       delayMs: 2000
			
 
				+   ```
			
 
				+
			
 
				+2. **Add `mustUseAnyOf` where tools are interchangeable**:
			
 
				+   ```yaml
			
 
				+   behavior:
			
 
				+     mustUseAnyOf: [[bash], [list]]
			
 
				+   ```
			
 
				+
			
 
				+#### Future Improvements
			
 
				+
			
 
				+1. **Add text content verification** - Check agent's text output contains expected phrases
			
 
				+2. **Add timing verification** - Ensure context loaded BEFORE execution
			
 
				+3. **Add file creation verification** - Check test_tmp/ for expected files
			
 
				+
			
 
				+### Multi-Step Workflow Testing
			
 
				+
			
 
				+#### What We CAN Test Now
			
 
				+
			
 
				+1. **Read chains**: glob → read (verified ✅)
			
 
				+2. **Context loading**: read context file (verified ✅)
			
 
				+3. **Multi-turn conversations**: prompt → approval → execute (verified ✅)
			
 
				+
			
 
				+#### What We CANNOT Test Yet
			
 
				+
			
 
				+1. **Full write workflows**: Need multi-turn for openagent
			
 
				+2. **Edit workflows**: Need multi-turn for openagent
			
 
				+3. **Delegation chains**: task tool → subagent (not tested)
			
 
				+
			
 
				+### Summary
			
 
				+
			
 
				+| Agent | Simple Tasks | Multi-Step | Write/Edit | Confidence |
			
 
				+|-------|--------------|------------|------------|------------|
			
 
				+| **Opencoder** | ✅ | ✅ | ✅ | 🟢 HIGH |
			
 
				+| **OpenAgent** | ✅ | ⚠️ | ❌ | 🟡 MEDIUM |
			
 
				+
			
 
				+**Bottom Line**: 
			
 
				+- Opencoder tests are reliable and working
			
 
				+- OpenAgent tests need multi-turn prompts for write/edit operations
			
 
				+- The framework itself is solid, but test cases need updating
			
--- a/evals/agents/openagent/tests/developer/ctx-code-001.yaml
+++ b/evals/agents/openagent/tests/developer/ctx-code-001.yaml
@@ -9,10 +9,19 @@ description: |
 
				 
			
 
				 category: developer
			
 
				 agent: openagent
			
 
				+model: anthropic/claude-sonnet-4-5
			
 
				 
			
 
				-prompt: |
			
 
				-  Create a simple TypeScript function called 'add' that takes two numbers and returns their sum.
			
 
				-  Save it to evals/test_tmp/math.ts
			
 
				+# Multi-turn: OpenAgent requires text approval before writing
			
 
				+prompts:
			
 
				+  - text: |
			
 
				+      Create a simple TypeScript function called 'add' that takes two numbers and returns their sum.
			
 
				+      Save it to evals/test_tmp/math.ts
			
 
				+    expectContext: true
			
 
				+    contextFile: ".opencode/context/core/standards/code.md"
			
 
				+  
			
 
				+  - text: |
			
 
				+      Yes, proceed with the plan. Execute it now.
			
 
				+    delayMs: 2000
			
 
				 
			
 
				 # Expected behavior
			
 
				 behavior:
			
@@ -37,7 +46,7 @@ expectedViolations:
 
				 approvalStrategy:
			
 
				   type: auto-approve
			
 
				 
			
 
				-timeout: 60000
			
 
				+timeout: 120000
			
 
				 
			
 
				 tags:
			
 
				   - workflow-validation
			
--- a/evals/agents/openagent/tests/developer/ctx-docs-001.yaml
+++ b/evals/agents/openagent/tests/developer/ctx-docs-001.yaml
@@ -9,17 +9,26 @@ description: |
 
				 
			
 
				 category: developer
			
 
				 agent: openagent
			
 
				+model: anthropic/claude-sonnet-4-5
			
 
				 
			
 
				-prompt: |
			
 
				-  Create a README.md file at evals/test_tmp/README.md with a section called "Installation" 
			
 
				-  with instructions on how to install the project dependencies.
			
 
				+# Multi-turn: OpenAgent requires text approval before writing
			
 
				+prompts:
			
 
				+  - text: |
			
 
				+      Create a README.md file at evals/test_tmp/README.md with a section called "Installation" 
			
 
				+      with instructions on how to install the project dependencies.
			
 
				+    expectContext: true
			
 
				+    contextFile: ".opencode/context/core/standards/docs.md"
			
 
				+  
			
 
				+  - text: |
			
 
				+      Yes, proceed with the plan. Execute it now.
			
 
				+    delayMs: 2000
			
 
				 
			
 
				 # Expected behavior
			
 
				 behavior:
			
 
				-  mustUseTools: [read, edit]   # Must read context + README, then edit
			
 
				+  mustUseAnyOf: [[read, write], [read, edit]]  # May use write or edit
			
 
				   requiresApproval: true
			
 
				   requiresContext: true         # MUST load docs.md before editing
			
 
				-  minToolCalls: 2               # At least: read context + edit file
			
 
				+  minToolCalls: 2               # At least: read context + write/edit file
			
 
				 
			
 
				 # Expected violations
			
 
				 expectedViolations:
			
@@ -37,7 +46,7 @@ expectedViolations:
 
				 approvalStrategy:
			
 
				   type: auto-approve
			
 
				 
			
 
				-timeout: 60000
			
 
				+timeout: 120000
			
 
				 
			
 
				 tags:
			
 
				   - workflow-validation
			
--- a/evals/agents/openagent/tests/developer/ctx-tests-001.yaml
+++ b/evals/agents/openagent/tests/developer/ctx-tests-001.yaml
@@ -9,10 +9,19 @@ description: |
 
				 
			
 
				 category: developer
			
 
				 agent: openagent
			
 
				+model: anthropic/claude-sonnet-4-5
			
 
				 
			
 
				-prompt: |
			
 
				-  Write a test for the add function in evals/test_tmp/math.ts.
			
 
				-  Create the test file at evals/test_tmp/math.test.ts
			
 
				+# Multi-turn: OpenAgent requires text approval before writing
			
 
				+prompts:
			
 
				+  - text: |
			
 
				+      Write a test for the add function in evals/test_tmp/math.ts.
			
 
				+      Create the test file at evals/test_tmp/math.test.ts
			
 
				+    expectContext: true
			
 
				+    contextFile: ".opencode/context/core/standards/tests.md"
			
 
				+  
			
 
				+  - text: |
			
 
				+      Yes, proceed with the plan. Execute it now.
			
 
				+    delayMs: 2000
			
 
				 
			
 
				 # Expected behavior
			
 
				 behavior:
			
@@ -37,7 +46,7 @@ expectedViolations:
 
				 approvalStrategy:
			
 
				   type: auto-approve
			
 
				 
			
 
				-timeout: 60000
			
 
				+timeout: 120000
			
 
				 
			
 
				 tags:
			
 
				   - workflow-validation
			
--- a/evals/agents/openagent/tests/developer/task-simple-001.yaml
+++ b/evals/agents/openagent/tests/developer/task-simple-001.yaml
@@ -25,7 +25,7 @@ prompts:
 
				 
			
 
				 # Expected behavior after approval
			
 
				 behavior:
			
 
				-  mustUseTools: [bash]
			
 
				+  mustUseAnyOf: [[bash], [list]]  # Agent may use list instead of bash
			
 
				   minToolCalls: 1
			
 
				   # First response should contain approval request
			
 
				   shouldContainInAnyMessage: