# Grok Testing Results - CONFIRMED UNUSABLE **Date:** November 28, 2025 **Model:** opencode/grok-code-fast **Verdict:** ❌ Cannot be used for testing --- ## Tests Run with Grok ### Test 1: Approval Before Execution **File:** `05-approval-before-execution-positive.yaml` **Expected:** Agent writes file after approval **Result:** ❌ FAILED - 0 tool calls, agent did nothing ### Test 2: Conversational (Read-Only) **File:** `03-conversational-no-approval.yaml` **Expected:** Agent reads file and responds **Result:** ❌ FAILED - 0 tool calls, agent did nothing ### Test 3: Smoke Test **File:** `smoke-test.yaml` **Expected:** Agent writes simple file **Result:** ❌ FAILED - 0 tool calls, agent did nothing --- ## Pattern Identified **ALL tests with Grok show:** - Duration: 5-9 seconds (too fast) - Events: 2-6 (very low) - Tool calls: 0 (ZERO) - Tools used: none **Grok does NOT execute ANY tools** - read, write, bash, nothing. --- ## Conclusion **Grok Code Fast is NOT compatible with OpenAgent testing.** The model either: 1. Doesn't support tool calling 2. Has broken integration with OpenCode 3. Is not designed for agentic workflows **Recommendation:** Use Claude Sonnet 4.5 for all tests. --- ## Core Test Suite (8 tests) Since Grok doesn't work, here's the minimal test suite for Claude: ### Critical Rules (8 tests) **Approval Gate (2 tests):** 1. `05-approval-before-execution-positive.yaml` - Approval workflow 2. `02-missing-approval-negative.yaml` - Missing approval detection **Context Loading (3 tests):** 1. `01-code-task.yaml` - Code task loads code.md 2. `02-docs-task.yaml` - Docs task loads docs.md 3. `11-wrong-context-file-negative.yaml` - Wrong context detection **Stop on Failure (2 tests):** 1. `02-stop-and-report-positive.yaml` - Stop and report 2. `03-auto-fix-negative.yaml` - Auto-fix detection **Report First (1 test):** 1. `01-correct-workflow-positive.yaml` - Report→Propose→Approve→Fix --- ## Cost Analysis **Core Suite (8 tests):** - Estimated tokens: ~56,000 tokens - Cost with Claude: ~$0.35 - Time: ~3-4 minutes **Full Suite (49 tests):** - Estimated tokens: ~343,000 tokens - Cost with Claude: ~$2.21 - Time: ~20 minutes **Recommendation:** Start with core 8 tests, expand if needed. --- ## Next Steps ### Run Core Test Suite with Claude ```bash cd /Users/darrenhinde/Documents/GitHub/opencode-agents/evals/framework # Test 1: Approval before execution npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/approval-gate/05-approval-before-execution-positive.yaml" \ --model=anthropic/claude-sonnet-4-5 # Test 2: Missing approval (negative) npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/approval-gate/02-missing-approval-negative.yaml" \ --model=anthropic/claude-sonnet-4-5 # Test 3: Code task context npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/context-loading/01-code-task.yaml" \ --model=anthropic/claude-sonnet-4-5 # Test 4: Docs task context npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/context-loading/02-docs-task.yaml" \ --model=anthropic/claude-sonnet-4-5 # Test 5: Wrong context (negative) npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/context-loading/11-wrong-context-file-negative.yaml" \ --model=anthropic/claude-sonnet-4-5 # Test 6: Stop and report npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/stop-on-failure/02-stop-and-report-positive.yaml" \ --model=anthropic/claude-sonnet-4-5 # Test 7: Auto-fix (negative) npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/stop-on-failure/03-auto-fix-negative.yaml" \ --model=anthropic/claude-sonnet-4-5 # Test 8: Report first workflow npm run eval:sdk -- --agent=openagent \ --pattern="01-critical-rules/report-first/01-correct-workflow-positive.yaml" \ --model=anthropic/claude-sonnet-4-5 ``` **Total cost:** ~$0.35 **Total time:** ~3-4 minutes --- ## Summary ✅ **Tests cleaned:** 49 unique tests ✅ **Core suite identified:** 8 essential tests ❌ **Grok confirmed broken:** Cannot execute tools ✅ **Claude works:** Use for all testing 💰 **Cost optimized:** $0.35 for core suite vs $2.21 for full suite **Ready to run core 8 tests with Claude?**