ContextScout Integration Test Findings

Date: 2026-01-09
Status: Tests Created, Initial Run Complete

Summary

We created comprehensive tests for ContextScout integration and ran initial tests. Here's what we learned:

Key Findings

1. OpenAgent Does NOT Use ContextScout Proactively ❌

Test: 04-implicit-discovery.yaml - "How does the registry system work?"

Expected: OpenAgent should delegate to ContextScout to discover registry context files

Actual: OpenAgent used grep and read directly to find information

Used grep to search for "registry"
Read registry.json directly
Used grep to search for "auto-detect"
Eventually found .opencode/context/openagents-repo/core-concepts/registry.md
Did NOT use the task tool to delegate to ContextScout

Conclusion: OpenAgent is NOT proactively using ContextScout for discovery. It's doing its own searching.

2. OpenAgent Provides Minimal Responses for Discovery Requests ⚠️

Test: 02-unknown-domain-discovery.yaml - "Explain how the eval framework works"

Expected: OpenAgent should use ContextScout and provide comprehensive answer

Actual: OpenAgent made only 1 tool call and provided minimal response

Very short execution time (14.5s)
Only 1 tool call (insufficient for discovery)
Did NOT delegate to ContextScout

Conclusion: OpenAgent may be treating discovery requests as conversational rather than requiring deep context loading.

3. ContextScout Cannot Be Tested in True Standalone Mode 🔧

Test: 01-code-standards-discovery.yaml via --subagent=contextscout

Expected: ContextScout runs as mode: primary and uses glob/read directly

Actual: Even in "standalone" mode, ContextScout is wrapped by a parent agent

Test runner forces mode: primary (confirmed in debug logs)
But ContextScout still tries to delegate to itself via task tool
Tool calls show: task → ContextScout (recursive!)
The test framework captures parent agent's tool calls, not nested subagent's

Conclusion: The eval framework's "standalone" mode doesn't truly run subagents standalone. They're still invoked through a parent wrapper.

4. Test Framework Limitations 📊

Issue: When testing subagents, the framework captures the parent agent's tool usage, not the subagent's internal tool usage.

Impact:

Can't validate that ContextScout uses glob/read/grep internally
Can only validate that parent agent delegates to ContextScout
Behavior expectations (mustUseTools) check parent, not subagent

Workaround Needed:

Option A: Test ContextScout via delegation mode (--subagent=contextscout --delegate)
Option B: Modify test framework to capture nested subagent tool calls
Option C: Create integration tests that check end-to-end behavior

Test Results Summary

Test	Agent	Expected Behavior	Actual Behavior	Status
04-implicit-discovery	OpenAgent	Use ContextScout	Used grep/read directly	❌ FAIL
02-unknown-domain	OpenAgent	Use ContextScout	Minimal response, no delegation	❌ FAIL
01-code-standards	ContextScout	Use glob/read	Delegated to itself (wrapper issue)	❌ FAIL

Root Cause Analysis

Why OpenAgent Doesn't Use ContextScout

Looking at OpenAgent's prompt (.opencode/agent/core/openagent.md):

Stage 3.0: DiscoverContext is marked as optional="true"

<step id="3.0" name="DiscoverContext" optional="true">
  OPTIONAL: Use ContextScout to discover relevant context files intelligently
  
  When to use ContextScout:
  - Unfamiliar with project structure or domain
  - Need to find domain-specific patterns or standards
  - Looking for examples, guides, or error solutions
  - Want to ensure you have all relevant context before proceeding

Problem: The step is optional, and the criteria are vague. OpenAgent can easily skip this step and use its own tools (grep/read) instead.

Why this happens:

OpenAgent has access to grep and read tools
Using grep/read is faster than delegating to ContextScout
The "when to use" criteria are subjective
No enforcement mechanism to ensure ContextScout is used

Recommendations

Short Term: Fix OpenAgent Prompt

Option 1: Make ContextScout usage more explicit for unfamiliar domains

<step id="3.0" name="DiscoverContext" required="conditional">
  Check if task involves unfamiliar domain:
  - Registry system, eval framework, agent creation, etc.
  
  IF unfamiliar domain:
    MUST delegate to ContextScout for discovery
  ELSE:
    MAY load known context directly

Option 2: Add a decision tree

Is this a standard task (code/docs/tests)? 
  YES → Load known context directly (.opencode/context/core/standards/*)
  NO → Is this domain-specific (registry, evals, agents)?
    YES → MUST use ContextScout for discovery
    NO → Proceed with available context

Medium Term: Improve Test Framework

Issue: Can't test subagents in true standalone mode

Solution: Add nested tool call capture

Track tool calls from parent AND subagents
Separate validation for parent vs subagent behavior
Add nestedToolCalls to test expectations

Long Term: Evaluate ContextScout Value

Question: Is ContextScout actually needed?

Evidence:

OpenAgent can discover context using grep/read directly
ContextScout adds delegation overhead
Agents aren't using it proactively

Options:

Keep ContextScout: Make usage mandatory for specific scenarios
Simplify: Remove ContextScout, improve direct discovery
Hybrid: Use ContextScout only for complex multi-domain queries

Next Steps

Immediate Actions

✅ Document findings (this file)
⏳ Decide on ContextScout strategy:
- Make it mandatory for unfamiliar domains?
- Remove it and improve direct discovery?
- Keep it optional but improve prompts?
⏳ Fix test framework to support true subagent testing
⏳ Update OpenAgent/OpenCoder prompts based on decision

Test Strategy Going Forward

For now, focus on integration tests:

Test that OpenAgent CAN delegate to ContextScout (delegation mode)
Test end-to-end behavior (does agent find correct context?)
Don't worry about internal tool usage until framework supports it

Create simpler tests:

Does OpenAgent find correct context files? (regardless of how)
Does OpenAgent load context before execution?
Does OpenAgent provide accurate answers?

Files Created

OpenAgent Integration Tests

evals/agents/core/openagent/tests/contextscout-integration/04-implicit-discovery.yaml
evals/agents/core/openagent/tests/contextscout-integration/05-multi-domain-comprehensive.yaml

OpenCoder Integration Tests

evals/agents/core/opencoder/tests/contextscout-integration/01-implicit-pattern-discovery.yaml
evals/agents/core/opencoder/tests/contextscout-integration/README.md

ContextScout Functionality Tests

evals/agents/ContextScout/tests/01-code-standards-discovery.yaml
evals/agents/ContextScout/tests/02-domain-specific-discovery.yaml
evals/agents/ContextScout/tests/03-bad-request-handling.yaml
evals/agents/ContextScout/tests/04-multi-domain-comprehensive.yaml
evals/agents/ContextScout/tests/05-tool-usage-validation.yaml
evals/agents/ContextScout/tests/README.md

Documentation

evals/agents/CONTEXTSCOUT_INTEGRATION_TESTS.md - Comprehensive test suite overview
evals/agents/CONTEXTSCOUT_TEST_FINDINGS.md - This file

Framework Updates

Updated evals/framework/src/sdk/run-sdk-tests.ts - Added contextscout to subagent maps
Updated evals/framework/src/sdk/test-runner.ts - Added contextscout to agent map
Updated evals/agents/ContextScout/config/config.yaml - Added test suites

Conclusion

The tests revealed that OpenAgent and OpenCoder are NOT proactively using ContextScout as intended. They're using their own discovery tools (grep/read) instead, which is actually faster but potentially less comprehensive.

Decision needed: Should we:

Enforce ContextScout usage for unfamiliar domains?
Remove ContextScout and improve direct discovery?
Keep current approach (optional ContextScout)?

Test framework limitation: Can't validate subagent internal behavior in standalone mode. Need to either fix framework or focus on integration/delegation tests.

Next conversation: Decide on ContextScout strategy and update agent prompts accordingly.

CONTEXTSCOUT_TEST_FINDINGS.md 8.5 KB History Raw