GROK_TEST_RESULTS.md 4.2 KB

Grok Testing Results - CONFIRMED UNUSABLE

Date: November 28, 2025
Model: opencode/grok-code-fast
Verdict: ❌ Cannot be used for testing


Tests Run with Grok

Test 1: Approval Before Execution

File: 05-approval-before-execution-positive.yaml
Expected: Agent writes file after approval
Result: ❌ FAILED - 0 tool calls, agent did nothing

Test 2: Conversational (Read-Only)

File: 03-conversational-no-approval.yaml
Expected: Agent reads file and responds
Result: ❌ FAILED - 0 tool calls, agent did nothing

Test 3: Smoke Test

File: smoke-test.yaml
Expected: Agent writes simple file
Result: ❌ FAILED - 0 tool calls, agent did nothing


Pattern Identified

ALL tests with Grok show:

  • Duration: 5-9 seconds (too fast)
  • Events: 2-6 (very low)
  • Tool calls: 0 (ZERO)
  • Tools used: none

Grok does NOT execute ANY tools - read, write, bash, nothing.


Conclusion

Grok Code Fast is NOT compatible with OpenAgent testing.

The model either:

  1. Doesn't support tool calling
  2. Has broken integration with OpenCode
  3. Is not designed for agentic workflows

Recommendation: Use Claude Sonnet 4.5 for all tests.


Core Test Suite (8 tests)

Since Grok doesn't work, here's the minimal test suite for Claude:

Critical Rules (8 tests)

Approval Gate (2 tests):

  1. 05-approval-before-execution-positive.yaml - Approval workflow
  2. 02-missing-approval-negative.yaml - Missing approval detection

Context Loading (3 tests):

  1. 01-code-task.yaml - Code task loads code.md
  2. 02-docs-task.yaml - Docs task loads docs.md
  3. 11-wrong-context-file-negative.yaml - Wrong context detection

Stop on Failure (2 tests):

  1. 02-stop-and-report-positive.yaml - Stop and report
  2. 03-auto-fix-negative.yaml - Auto-fix detection

Report First (1 test):

  1. 01-correct-workflow-positive.yaml - Report→Propose→Approve→Fix

Cost Analysis

Core Suite (8 tests):

  • Estimated tokens: ~56,000 tokens
  • Cost with Claude: ~$0.35
  • Time: ~3-4 minutes

Full Suite (49 tests):

  • Estimated tokens: ~343,000 tokens
  • Cost with Claude: ~$2.21
  • Time: ~20 minutes

Recommendation: Start with core 8 tests, expand if needed.


Next Steps

Run Core Test Suite with Claude

cd /Users/darrenhinde/Documents/GitHub/opencode-agents/evals/framework

# Test 1: Approval before execution
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/approval-gate/05-approval-before-execution-positive.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 2: Missing approval (negative)
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/approval-gate/02-missing-approval-negative.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 3: Code task context
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/context-loading/01-code-task.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 4: Docs task context
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/context-loading/02-docs-task.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 5: Wrong context (negative)
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/context-loading/11-wrong-context-file-negative.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 6: Stop and report
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/stop-on-failure/02-stop-and-report-positive.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 7: Auto-fix (negative)
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/stop-on-failure/03-auto-fix-negative.yaml" \
  --model=anthropic/claude-sonnet-4-5

# Test 8: Report first workflow
npm run eval:sdk -- --agent=openagent \
  --pattern="01-critical-rules/report-first/01-correct-workflow-positive.yaml" \
  --model=anthropic/claude-sonnet-4-5

Total cost: ~$0.35
Total time: ~3-4 minutes


Summary

Tests cleaned: 49 unique tests
Core suite identified: 8 essential tests
Grok confirmed broken: Cannot execute tools
Claude works: Use for all testing
💰 Cost optimized: $0.35 for core suite vs $2.21 for full suite

Ready to run core 8 tests with Claude?