Darren Hinde f669cac34c feat: repository review and MVI context system implementation (#85)		3 months ago
..
01-known-context-direct-load.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
02-unknown-domain-discovery.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
03-accuracy-correct-files.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
04-implicit-discovery.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
05-multi-domain-comprehensive.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
06-agent-creation-uses-contextscout.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
07-content-creation-uses-contextscout.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
08-known-domain-no-contextscout.yaml	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
README.md	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago
TEST_RESULTS.md	f669cac34c feat: repository review and MVI context system implementation (#85)	3 months ago

ContextScout Integration Tests

Purpose: Validate that OpenAgent uses ContextScout effectively - at the right time, for the right reasons, with the right outcomes.

Created: 2026-01-07
Status: Ready to Run

The Big Question

Should we use ContextScout to help agents discover context, or is it adding unnecessary complexity?

These tests answer:

✅ When should agents use ContextScout vs. direct loading?
✅ Does ContextScout improve accuracy and completeness?
✅ What's the performance trade-off?
✅ Are agents using it predictably?

Test Suite

Test 1: Known Context - Direct Loading ⚡

File: 01-known-context-direct-load.yaml

Scenario: "Write a fibonacci function"

Expected: Agent should DIRECTLY load code.md without using ContextScout

Why: For standard tasks (code/docs/tests), the context path is well-known. ContextScout adds overhead without value.

Success Criteria:

✅ Loads .opencode/context/core/standards/code.md
✅ Does NOT use task tool (no ContextScout delegation)
✅ Completes in <30 seconds

Test 2: Unknown Domain - Discovery 🔍

File: 02-unknown-domain-discovery.yaml

Scenario: "Explain how the eval framework works"

Expected: Agent should USE ContextScout to discover eval-specific context

Why: For domain-specific topics, agents don't know what context exists. ContextScout discovers relevant files.

Success Criteria:

✅ Delegates to ContextScout
✅ Finds .opencode/context/openagents-repo/core-concepts/evals.md
✅ Loads discovered files
✅ Provides comprehensive answer

Test 3: Accuracy - Correct Files 🎯

File: 03-accuracy-correct-files.yaml

Scenario: "What are the MVI principles?"

Expected: ContextScout finds the CORRECT file (mvi.md), not random files

Why: Validates ContextScout's search accuracy and relevance filtering.

Success Criteria:

✅ Uses ContextScout to search
✅ Finds .opencode/context/core/context-system/standards/mvi.md
✅ No irrelevant files loaded
✅ Accurate explanation of MVI

Running the Tests

Run All Integration Tests

cd evals/framework
npm run eval:sdk -- --agent=core/openagent --pattern="contextscout-integration/*.yaml"

Run Individual Tests

# Test 1: Known context (should NOT use ContextScout)
npm run eval:sdk -- --agent=core/openagent --pattern="01-known-context-direct-load.yaml"

# Test 2: Unknown domain (should use ContextScout)
npm run eval:sdk -- --agent=core/openagent --pattern="02-unknown-domain-discovery.yaml"

# Test 3: Accuracy (ContextScout finds correct files)
npm run eval:sdk -- --agent=core/openagent --pattern="03-accuracy-correct-files.yaml"

Run Without Evaluators (Focus on Behavior)

npm run eval:sdk -- --agent=core/openagent --pattern="contextscout-integration/*.yaml" --no-evaluators

Interpreting Results

Scenario A: ContextScout is "The Way Forward" ✅

If tests show:

Test 1: Agent loads code.md directly (fast, no ContextScout)
Test 2: Agent uses ContextScout, finds eval context (accurate discovery)
Test 3: ContextScout finds mvi.md correctly (high accuracy)

Conclusion: ContextScout is valuable for discovery, agents use it intelligently

Action: Keep ContextScout, refine when/how agents use it

Scenario B: ContextScout Adds Overhead Without Value ❌

If tests show:

Test 1: Agent uses ContextScout for simple task (unnecessary overhead)
Test 2: Agent doesn't use ContextScout, misses domain context (inconsistent)
Test 3: ContextScout finds wrong files (poor accuracy)

Conclusion: ContextScout isn't helping, agents aren't using it predictably

Action: Remove ContextScout OR fix agent decision-making

Scenario C: Mixed Results - Needs Refinement ⚠️

If tests show:

Test 1: Sometimes uses ContextScout, sometimes doesn't (unpredictable)
Test 2: Uses ContextScout but takes too long (>60s)
Test 3: Finds correct files but also loads irrelevant ones (noisy)

Conclusion: ContextScout has potential but needs tuning

Action: Refine agent prompts, improve ContextScout search, add decision criteria

Decision Matrix

Test Result	Interpretation	Action
All 3 pass	✅ ContextScout working as designed	Keep it, document best practices
Test 1 fails (uses ContextScout)	⚠️ Agents using it when they shouldn't	Refine agent decision logic
Test 2 fails (no ContextScout)	⚠️ Agents not using it when they should	Update agent prompts
Test 3 fails (wrong files)	❌ ContextScout search accuracy poor	Improve ContextScout search logic
All 3 fail	❌ ContextScout not ready	Disable or redesign

Success Metrics

ContextScout is "Worth It" if:

Accuracy: Finds correct files 90%+ of the time (Test 3)
Efficiency: Agents skip it for known tasks (Test 1)
Discovery: Agents use it for unknown domains (Test 2)
Speed: Adds <30s overhead for discovery (Test 2)
Predictability: Consistent behavior across runs

ContextScout is "Not Worth It" if:

Overhead: Used for simple tasks where direct loading is faster
Inaccuracy: Finds wrong files >20% of the time
Unpredictability: Random usage, no clear pattern
Complexity: Makes workflows harder to understand
Slowness: Adds >60s to task completion

Next Steps

Run Phase 1 tests (these 3 tests)
Analyze results using decision matrix above
Decide: Is ContextScout improving workflows?
If YES: Document best practices, add more tests
If NO: Disable ContextScout, use direct loading only
If MIXED: Refine agent prompts and ContextScout search

README.md

ContextScout Integration Tests

The Big Question

Test Suite

Test 1: Known Context - Direct Loading ⚡

Test 2: Unknown Domain - Discovery 🔍

Test 3: Accuracy - Correct Files 🎯

Running the Tests

Run All Integration Tests

Run Individual Tests

Run Without Evaluators (Focus on Behavior)

Interpreting Results

Scenario A: ContextScout is "The Way Forward" ✅

Scenario B: ContextScout Adds Overhead Without Value ❌

Scenario C: Mixed Results - Needs Refinement ⚠️

Decision Matrix

Success Metrics

ContextScout is "Worth It" if:

ContextScout is "Not Worth It" if:

Next Steps

Related Documentation