# ContextScout Evaluation Tests **Purpose**: Validate that ContextScout correctly discovers context files, uses appropriate tools, and handles various request types. **Created**: 2026-01-09 **Status**: Ready to Run --- ## Test Suite Overview ### Test 1: Code Standards Discovery ⭐ **File**: `01-code-standards-discovery.yaml` **Validates**: - Uses glob/read/grep for discovery (not bash) - Finds `.opencode/context/core/standards/code-quality.md` - Returns exact paths with priority ratings - Includes line ranges for key sections **Expected**: ✅ Finds code-quality.md with ⭐⭐⭐⭐⭐ priority --- ### Test 2: Domain-Specific Discovery 🎯 **File**: `02-domain-specific-discovery.yaml` **Validates**: - Searches domain-specific directories (openagents-repo) - Checks navigation.md first - Finds eval framework context - Prioritizes domain files over generic ones **Expected**: ✅ Finds evals.md and related files with correct priorities --- ### Test 3: Bad Request Handling ⚠️ **File**: `03-bad-request-handling.yaml` **Validates**: - Handles vague/invalid queries gracefully - Doesn't fabricate non-existent files - Reports honestly when nothing found - Suggests alternatives or clarifications **Expected**: ✅ Reports "no files found" without fabricating paths --- ### Test 4: Multi-Domain Comprehensive 🌐 **File**: `04-multi-domain-comprehensive.yaml` **Validates**: - Discovers files across multiple domains - Finds all relevant files (agent, code, test, eval) - Prioritizes correctly (critical > high > medium) - Provides comprehensive loading strategy **Expected**: ✅ Finds 4-5 files across domains with correct priorities --- ### Test 5: Tool Usage Validation 🔒 **File**: `05-tool-usage-validation.yaml` **Validates**: - Read-only constraint enforcement - Uses glob/read/grep appropriately - NEVER uses write/edit/bash - Respects tool permissions **Expected**: ✅ Only uses read/glob/grep, never write/edit/bash --- ## Running Tests ### Run All ContextScout Tests ```bash cd evals/framework npm run eval:sdk -- --agent=ContextScout ``` ### Run Individual Tests ```bash # Test 1: Code standards discovery npm run eval:sdk -- --agent=ContextScout --pattern="01-code-standards-discovery.yaml" # Test 2: Domain-specific discovery npm run eval:sdk -- --agent=ContextScout --pattern="02-domain-specific-discovery.yaml" # Test 3: Bad request handling npm run eval:sdk -- --agent=ContextScout --pattern="03-bad-request-handling.yaml" # Test 4: Multi-domain comprehensive npm run eval:sdk -- --agent=ContextScout --pattern="04-multi-domain-comprehensive.yaml" # Test 5: Tool usage validation npm run eval:sdk -- --agent=ContextScout --pattern="05-tool-usage-validation.yaml" ``` ### Run with Debug ```bash npm run eval:sdk -- --agent=ContextScout --debug ``` --- ## Success Criteria ### ContextScout is "Working Correctly" if: 1. ✅ **Discovery**: Uses glob/read/grep to find files (Test 1, 2, 4) 2. ✅ **Accuracy**: Finds correct files 95%+ of the time (Test 1, 2, 4) 3. ✅ **Priorities**: Rates files correctly (critical > high > medium) (Test 2, 4) 4. ✅ **Read-Only**: Never uses write/edit/bash (Test 5) 5. ✅ **Error Handling**: Handles bad requests gracefully (Test 3) 6. ✅ **Comprehensive**: Finds all relevant files for multi-domain queries (Test 4) ### ContextScout is "Broken" if: 1. ❌ **Fabrication**: Makes up file paths without verification 2. ❌ **Wrong Tools**: Uses bash instead of glob/read/grep 3. ❌ **Violations**: Uses write/edit tools (read-only violation) 4. ❌ **Incomplete**: Misses critical files in multi-domain search 5. ❌ **Poor Accuracy**: Finds wrong files >20% of the time 6. ❌ **Bad Errors**: Crashes or provides unhelpful errors on bad requests --- ## Expected Outcomes ### Test 1: Code Standards Discovery ``` ✅ PASS - Used glob to search for "code" and "standards" - Found: .opencode/context/core/standards/code-quality.md - Priority: ⭐⭐⭐⭐⭐ (critical) - Included line ranges for key sections - No write/edit/bash tools used ``` ### Test 2: Domain-Specific Discovery ``` ✅ PASS - Checked navigation.md first - Found: .opencode/context/openagents-repo/core-concepts/evals.md (⭐⭐⭐⭐⭐) - Also found: guides/testing-agent.md (⭐⭐⭐⭐) - Prioritized domain-specific over generic - Provided loading strategy ``` ### Test 3: Bad Request Handling ``` ✅ PASS - Used glob to search for "quantum blockchain AI" - Found no relevant files - Reported honestly: "No context files found for this topic" - Suggested alternatives: "Available topics: agents, evals, registry..." - Did NOT fabricate paths ``` ### Test 4: Multi-Domain Comprehensive ``` ✅ PASS - Found 5 files across domains: 1. ⭐⭐⭐⭐⭐ guides/adding-agent.md 2. ⭐⭐⭐⭐⭐ core-concepts/agents.md 3. ⭐⭐⭐⭐ standards/code-quality.md 4. ⭐⭐⭐⭐ standards/test-coverage.md 5. ⭐⭐⭐⭐ core-concepts/evals.md - Correct priority order - Provided loading strategy ``` ### Test 5: Tool Usage Validation ``` ✅ PASS - Used glob for discovery - Used read for content - Did NOT use write/edit/bash - Respected read-only constraints ``` --- ## Debugging Failed Tests ### If Test 1 Fails (Code Standards) **Problem**: ContextScout didn't find code-quality.md **Check**: - Did it use glob to search? - Did it search in .opencode/context/core/standards/? - Did it verify file exists before returning? ### If Test 2 Fails (Domain-Specific) **Problem**: ContextScout didn't find eval context **Check**: - Did it check navigation.md first? - Did it search in openagents-repo directory? - Did it prioritize domain files correctly? ### If Test 3 Fails (Bad Request) **Problem**: ContextScout fabricated files or crashed **Check**: - Did it still use tools to search? - Did it report honestly when nothing found? - Did it provide helpful suggestions? ### If Test 4 Fails (Multi-Domain) **Problem**: ContextScout missed files or wrong priorities **Check**: - Did it search multiple directories? - Did it find all 4-5 expected files? - Were priorities correct (critical first)? ### If Test 5 Fails (Tool Usage) **Problem**: ContextScout used forbidden tools **Check**: - Did it use write or edit? (VIOLATION) - Did it use bash instead of glob/read? (VIOLATION) - Check tool call logs for violations --- ## Related Documentation - [ContextScout Agent](.opencode/agent/ContextScout.md) - [OpenAgent Integration Tests](../../openagent/tests/contextscout-integration/) - [Eval Framework Concepts](.opencode/context/openagents-repo/core-concepts/evals.md) --- **Key Insight**: ContextScout must be a reliable, read-only discovery tool that uses appropriate tools (glob/read/grep), finds correct files, and handles errors gracefully. These tests validate all critical behaviors.