Browse Source

fix(evals): update openagent tests to use multi-turn prompts for write operations

OpenAgent requires text-based approval before executing write/edit tools.
Updated tests to use multi-turn prompts:
1. First prompt: Request the task
2. Second prompt: 'Yes, proceed' to approve

Also added TESTING_CONFIDENCE.md documenting current test reliability:
- Opencoder: HIGH confidence (4/4 tests passing)
- OpenAgent: MEDIUM confidence (multi-turn works, context loading needs verification)
darrenhinde 4 months ago
parent
commit
f872007919

+ 120 - 0
evals/TESTING_CONFIDENCE.md

@@ -0,0 +1,120 @@
+# Testing System Confidence Assessment
+
+## Current State: Honest Evaluation
+
+### What Works Well ✅
+
+| Feature | Opencoder | OpenAgent | Notes |
+|---------|-----------|-----------|-------|
+| Agent Selection | ✅ Verified | ✅ Verified | Both agents correctly identified |
+| Single Tool Calls | ✅ Works | ✅ Works | list, read, glob, bash all captured |
+| Multi-Tool Chains | ✅ Works | ⚠️ Partial | glob→read works, but approval blocks chains |
+| Event Capture | ✅ 18-56 events | ✅ 18-29 events | Real-time streaming works |
+| Tool Verification | ✅ Accurate | ✅ Accurate | Tool names and inputs captured |
+| File Cleanup | ✅ Works | ✅ Works | test_tmp/ cleaned before/after |
+
+### What Needs Work ⚠️
+
+#### 1. OpenAgent Approval Workflow Issue
+
+**Problem**: OpenAgent reads context but then **stops and waits for text approval** before executing write/edit tools.
+
+**Evidence**:
+```
+Tool Call Details:
+  1. read: {"filePath":".opencode/context/core/standards/code.md"}
+  
+Violations:
+  - missing-required-tool: Required tool 'write' was not used
+```
+
+**Root Cause**: OpenAgent's system prompt requires text-based approval before execution. Single-prompt tests don't provide this approval.
+
+**Solution Options**:
+1. ✅ Use multi-turn prompts (already implemented for task-simple-001)
+2. ⚠️ Need to update ALL openagent tests that expect write/edit to use multi-turn
+
+#### 2. Tool Flexibility
+
+**Problem**: Agents sometimes use `list` instead of `bash ls`.
+
+**Solution**: ✅ Fixed with `mustUseAnyOf` - allows alternative tools.
+
+#### 3. Approval Count Always 0
+
+**Observation**: `Approvals given: 0` even when tools execute.
+
+**Reason**: The `permission.request` events are for tool-level permissions (dangerous commands), not text-based approval. OpenAgent's text approval is different.
+
+### Confidence Levels
+
+| Test Type | Confidence | Reason |
+|-----------|------------|--------|
+| **Opencoder - Read Operations** | 🟢 HIGH | Works perfectly, verified |
+| **Opencoder - Multi-tool Chains** | 🟢 HIGH | glob→read verified |
+| **Opencoder - Bash/List** | 🟢 HIGH | Both tools work |
+| **OpenAgent - Read Operations** | 🟢 HIGH | Context loading verified |
+| **OpenAgent - Multi-turn Approval** | 🟡 MEDIUM | Works but needs more testing |
+| **OpenAgent - Write/Edit** | 🔴 LOW | Blocked by approval workflow |
+| **OpenAgent - Context→Write Chain** | 🔴 LOW | Stops after context read |
+
+### Tests That Need Multi-Turn Updates
+
+These openagent tests expect write/edit but use single prompts:
+
+1. `ctx-code-001.yaml` - Expects read→write
+2. `ctx-code-001-claude.yaml` - Expects read→write
+3. `ctx-docs-001.yaml` - Expects read→edit
+4. `ctx-tests-001.yaml` - Expects read→write
+5. `ctx-multi-turn-001.yaml` - Already multi-turn ✅
+6. `create-component.yaml` - Expects write
+
+### Recommended Actions
+
+#### Immediate (High Priority)
+
+1. **Update openagent write/edit tests to multi-turn**:
+   ```yaml
+   prompts:
+     - text: "Create a file..."
+     - text: "Yes, proceed"
+       delayMs: 2000
+   ```
+
+2. **Add `mustUseAnyOf` where tools are interchangeable**:
+   ```yaml
+   behavior:
+     mustUseAnyOf: [[bash], [list]]
+   ```
+
+#### Future Improvements
+
+1. **Add text content verification** - Check agent's text output contains expected phrases
+2. **Add timing verification** - Ensure context loaded BEFORE execution
+3. **Add file creation verification** - Check test_tmp/ for expected files
+
+### Multi-Step Workflow Testing
+
+#### What We CAN Test Now
+
+1. **Read chains**: glob → read (verified ✅)
+2. **Context loading**: read context file (verified ✅)
+3. **Multi-turn conversations**: prompt → approval → execute (verified ✅)
+
+#### What We CANNOT Test Yet
+
+1. **Full write workflows**: Need multi-turn for openagent
+2. **Edit workflows**: Need multi-turn for openagent
+3. **Delegation chains**: task tool → subagent (not tested)
+
+### Summary
+
+| Agent | Simple Tasks | Multi-Step | Write/Edit | Confidence |
+|-------|--------------|------------|------------|------------|
+| **Opencoder** | ✅ | ✅ | ✅ | 🟢 HIGH |
+| **OpenAgent** | ✅ | ⚠️ | ❌ | 🟡 MEDIUM |
+
+**Bottom Line**: 
+- Opencoder tests are reliable and working
+- OpenAgent tests need multi-turn prompts for write/edit operations
+- The framework itself is solid, but test cases need updating

+ 13 - 4
evals/agents/openagent/tests/developer/ctx-code-001.yaml

@@ -9,10 +9,19 @@ description: |
 
 category: developer
 agent: openagent
+model: anthropic/claude-sonnet-4-5
 
-prompt: |
-  Create a simple TypeScript function called 'add' that takes two numbers and returns their sum.
-  Save it to evals/test_tmp/math.ts
+# Multi-turn: OpenAgent requires text approval before writing
+prompts:
+  - text: |
+      Create a simple TypeScript function called 'add' that takes two numbers and returns their sum.
+      Save it to evals/test_tmp/math.ts
+    expectContext: true
+    contextFile: ".opencode/context/core/standards/code.md"
+  
+  - text: |
+      Yes, proceed with the plan. Execute it now.
+    delayMs: 2000
 
 # Expected behavior
 behavior:
@@ -37,7 +46,7 @@ expectedViolations:
 approvalStrategy:
   type: auto-approve
 
-timeout: 60000
+timeout: 120000
 
 tags:
   - workflow-validation

+ 15 - 6
evals/agents/openagent/tests/developer/ctx-docs-001.yaml

@@ -9,17 +9,26 @@ description: |
 
 category: developer
 agent: openagent
+model: anthropic/claude-sonnet-4-5
 
-prompt: |
-  Create a README.md file at evals/test_tmp/README.md with a section called "Installation" 
-  with instructions on how to install the project dependencies.
+# Multi-turn: OpenAgent requires text approval before writing
+prompts:
+  - text: |
+      Create a README.md file at evals/test_tmp/README.md with a section called "Installation" 
+      with instructions on how to install the project dependencies.
+    expectContext: true
+    contextFile: ".opencode/context/core/standards/docs.md"
+  
+  - text: |
+      Yes, proceed with the plan. Execute it now.
+    delayMs: 2000
 
 # Expected behavior
 behavior:
-  mustUseTools: [read, edit]   # Must read context + README, then edit
+  mustUseAnyOf: [[read, write], [read, edit]]  # May use write or edit
   requiresApproval: true
   requiresContext: true         # MUST load docs.md before editing
-  minToolCalls: 2               # At least: read context + edit file
+  minToolCalls: 2               # At least: read context + write/edit file
 
 # Expected violations
 expectedViolations:
@@ -37,7 +46,7 @@ expectedViolations:
 approvalStrategy:
   type: auto-approve
 
-timeout: 60000
+timeout: 120000
 
 tags:
   - workflow-validation

+ 13 - 4
evals/agents/openagent/tests/developer/ctx-tests-001.yaml

@@ -9,10 +9,19 @@ description: |
 
 category: developer
 agent: openagent
+model: anthropic/claude-sonnet-4-5
 
-prompt: |
-  Write a test for the add function in evals/test_tmp/math.ts.
-  Create the test file at evals/test_tmp/math.test.ts
+# Multi-turn: OpenAgent requires text approval before writing
+prompts:
+  - text: |
+      Write a test for the add function in evals/test_tmp/math.ts.
+      Create the test file at evals/test_tmp/math.test.ts
+    expectContext: true
+    contextFile: ".opencode/context/core/standards/tests.md"
+  
+  - text: |
+      Yes, proceed with the plan. Execute it now.
+    delayMs: 2000
 
 # Expected behavior
 behavior:
@@ -37,7 +46,7 @@ expectedViolations:
 approvalStrategy:
   type: auto-approve
 
-timeout: 60000
+timeout: 120000
 
 tags:
   - workflow-validation

+ 1 - 1
evals/agents/openagent/tests/developer/task-simple-001.yaml

@@ -25,7 +25,7 @@ prompts:
 
 # Expected behavior after approval
 behavior:
-  mustUseTools: [bash]
+  mustUseAnyOf: [[bash], [list]]  # Agent may use list instead of bash
   minToolCalls: 1
   # First response should contain approval request
   shouldContainInAnyMessage: