Grok Testing Results - CONFIRMED UNUSABLE

Date: November 28, 2025
Model: opencode/grok-code-fast
Verdict: ❌ Cannot be used for testing

Tests Run with Grok

Test 1: Approval Before Execution

File: 05-approval-before-execution-positive.yaml
Expected: Agent writes file after approval
Result: ❌ FAILED - 0 tool calls, agent did nothing

Test 2: Conversational (Read-Only)

File: 03-conversational-no-approval.yaml
Expected: Agent reads file and responds
Result: ❌ FAILED - 0 tool calls, agent did nothing

Test 3: Smoke Test

File: smoke-test.yaml
Expected: Agent writes simple file
Result: ❌ FAILED - 0 tool calls, agent did nothing

Pattern Identified

ALL tests with Grok show:

Duration: 5-9 seconds (too fast)
Events: 2-6 (very low)
Tool calls: 0 (ZERO)
Tools used: none

Grok does NOT execute ANY tools - read, write, bash, nothing.

Conclusion

Grok Code Fast is NOT compatible with OpenAgent testing.

The model either:

Doesn't support tool calling
Has broken integration with OpenCode
Is not designed for agentic workflows

Recommendation: Use Claude Sonnet 4.5 for all tests.

Core Test Suite (8 tests)

Since Grok doesn't work, here's the minimal test suite for Claude:

Critical Rules (8 tests)

Approval Gate (2 tests):

05-approval-before-execution-positive.yaml - Approval workflow
02-missing-approval-negative.yaml - Missing approval detection

Context Loading (3 tests):

01-code-task.yaml - Code task loads code.md
02-docs-task.yaml - Docs task loads docs.md
11-wrong-context-file-negative.yaml - Wrong context detection

Stop on Failure (2 tests):

02-stop-and-report-positive.yaml - Stop and report
03-auto-fix-negative.yaml - Auto-fix detection

Report First (1 test):

01-correct-workflow-positive.yaml - Report→Propose→Approve→Fix

Cost Analysis

Core Suite (8 tests):

Estimated tokens: ~56,000 tokens
Cost with Claude: ~$0.35
Time: ~3-4 minutes

Full Suite (49 tests):

Estimated tokens: ~343,000 tokens
Cost with Claude: ~$2.21
Time: ~20 minutes

Recommendation: Start with core 8 tests, expand if needed.

Next Steps

Run Core Test Suite with Claude

cd /Users/darrenhinde/Documents/GitHub/opencode-agents/evals/framework

# Test 1: Approval before execution
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/approval-gate/05-approval-before-execution-positive.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 2: Missing approval (negative)
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/approval-gate/02-missing-approval-negative.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 3: Code task context
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/context-loading/01-code-task.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 4: Docs task context
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/context-loading/02-docs-task.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 5: Wrong context (negative)
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/context-loading/11-wrong-context-file-negative.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 6: Stop and report
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/stop-on-failure/02-stop-and-report-positive.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 7: Auto-fix (negative)
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/stop-on-failure/03-auto-fix-negative.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 8: Report first workflow
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/report-first/01-correct-workflow-positive.yaml" \
  --model=anthropic/claude-sonnet-4-5

Total cost: ~$0.35
Total time: ~3-4 minutes

Summary

✅ Tests cleaned: 49 unique tests
✅ Core suite identified: 8 essential tests
❌ Grok confirmed broken: Cannot execute tools
✅ Claude works: Use for all testing
💰 Cost optimized: $0.35 for core suite vs $2.21 for full suite

Ready to run core 8 tests with Claude?

GROK_TEST_RESULTS.md 4.2 KB History Raw