4 months ago · f98a23c1d4
--- a/evals/DOCUMENTATION_CLEANUP.md
+++ b/evals/DOCUMENTATION_CLEANUP.md
@@ -1,273 +0,0 @@
 
				-# Documentation Cleanup Summary
			
 
				-
			
 
				-**Date**: 2025-11-26  
			
 
				-**Status**: ✅ Complete
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Changes Made
			
 
				-
			
 
				-### Files Deleted (3)
			
 
				-
			
 
				-1. **`evals/framework/SESSION_STORAGE_FIX.md`** (173 lines)
			
 
				-   - **Reason**: Historical fix documentation, no longer relevant
			
 
				-   - **Status**: ✅ Deleted
			
 
				-
			
 
				-2. **`evals/TESTING_CONFIDENCE.md`** (121 lines)
			
 
				-   - **Reason**: Outdated, superseded by IMPLEMENTATION_SUMMARY.md
			
 
				-   - **Content**: Old test confidence assessment from before context loading fixes
			
 
				-   - **Status**: ✅ Deleted
			
 
				-
			
 
				-3. **`evals/agents/openagent/TEST_REVIEW.md`** (325 lines)
			
 
				-   - **Reason**: Outdated test review from Nov 25 (before context loading fixes)
			
 
				-   - **Content**: Old test results, superseded by CONTEXT_LOADING_COVERAGE.md and IMPLEMENTATION_SUMMARY.md
			
 
				-   - **Status**: ✅ Deleted
			
 
				-
			
 
				-### Files Renamed (1)
			
 
				-
			
 
				-1. **`evals/SYSTEM_REVIEW.md` → `evals/ARCHITECTURE.md`**
			
 
				-   - **Reason**: More descriptive name for system architecture review
			
 
				-   - **Content**: Comprehensive architecture review (456 lines)
			
 
				-   - **Status**: ✅ Renamed
			
 
				-
			
 
				-### Files Created (2)
			
 
				-
			
 
				-1. **`evals/GETTING_STARTED.md`** (NEW - 450 lines)
			
 
				-   - **Purpose**: Consolidated quick start guide
			
 
				-   - **Content**: 
			
 
				-     - Running tests
			
 
				-     - Understanding results
			
 
				-     - Creating new tests
			
 
				-     - Debugging
			
 
				-     - Common issues
			
 
				-   - **Replaces**: Scattered information from README.md and HOW_TESTS_WORK.md
			
 
				-   - **Status**: ✅ Created
			
 
				-
			
 
				-2. **`evals/DOCUMENTATION_CLEANUP.md`** (THIS FILE)
			
 
				-   - **Purpose**: Track documentation cleanup changes
			
 
				-   - **Status**: ✅ Created
			
 
				-
			
 
				-### Files Updated (3)
			
 
				-
			
 
				-1. **`evals/README.md`** (322 → 280 lines)
			
 
				-   - **Changes**:
			
 
				-     - More concise overview
			
 
				-     - Points to GETTING_STARTED.md for details
			
 
				-     - Updated with recent achievements (Nov 26)
			
 
				-     - Added context loading tests section
			
 
				-     - Added smart timeout system section
			
 
				-     - Updated test coverage numbers
			
 
				-   - **Status**: ✅ Updated
			
 
				-
			
 
				-2. **`evals/agents/openagent/README.md`** (85 → 350 lines)
			
 
				-   - **Changes**:
			
 
				-     - Comprehensive test coverage section
			
 
				-     - Detailed context loading tests documentation
			
 
				-     - Test structure overview
			
 
				-     - Running instructions
			
 
				-     - Test design examples
			
 
				-     - Troubleshooting section
			
 
				-   - **Status**: ✅ Updated
			
 
				-
			
 
				-3. **`evals/HOW_TESTS_WORK.md`** (308 lines)
			
 
				-   - **Changes**: None (kept as-is for detailed technical reference)
			
 
				-   - **Status**: ✅ Kept
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Documentation Structure (After Cleanup)
			
 
				-
			
 
				-### Top-Level Documentation
			
 
				-
			
 
				-```
			
 
				-evals/
			
 
				-├── README.md                     # System overview (UPDATED)
			
 
				-├── GETTING_STARTED.md            # Quick start guide (NEW)
			
 
				-├── HOW_TESTS_WORK.md             # Detailed test execution guide
			
 
				-├── ARCHITECTURE.md               # System architecture review (RENAMED)
			
 
				-└── DOCUMENTATION_CLEANUP.md      # This file (NEW)
			
 
				-```
			
 
				-
			
 
				-### Framework Documentation
			
 
				-
			
 
				-```
			
 
				-evals/framework/
			
 
				-├── README.md                     # Framework documentation
			
 
				-├── SDK_EVAL_README.md            # Complete SDK guide
			
 
				-├── docs/
			
 
				-│   ├── architecture-overview.md # Framework architecture
			
 
				-│   └── test-design-guide.md     # Test design philosophy
			
 
				-└── run-tests-batch.sh            # Batch test runner
			
 
				-```
			
 
				-
			
 
				-### Agent Documentation
			
 
				-
			
 
				-```
			
 
				-evals/agents/openagent/
			
 
				-├── README.md                     # OpenAgent test suite (UPDATED)
			
 
				-├── CONTEXT_LOADING_COVERAGE.md   # Context loading tests
			
 
				-├── IMPLEMENTATION_SUMMARY.md     # Recent implementation
			
 
				-└── docs/
			
 
				-    └── OPENAGENT_RULES.md        # OpenAgent rules reference
			
 
				-```
			
 
				-
			
 
				-### Results Documentation
			
 
				-
			
 
				-```
			
 
				-evals/results/
			
 
				-├── README.md                     # Results dashboard guide
			
 
				-├── index.html                    # Interactive dashboard
			
 
				-└── serve.sh                      # One-command server
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Documentation Flow
			
 
				-
			
 
				-### For New Users
			
 
				-
			
 
				-1. **Start**: `README.md` - System overview
			
 
				-2. **Next**: `GETTING_STARTED.md` - Quick start guide
			
 
				-3. **Then**: Run tests and view results
			
 
				-4. **Deep Dive**: `HOW_TESTS_WORK.md` - Detailed explanations
			
 
				-
			
 
				-### For Test Authors
			
 
				-
			
 
				-1. **Start**: `GETTING_STARTED.md` - Creating tests section
			
 
				-2. **Reference**: `framework/docs/test-design-guide.md` - Design philosophy
			
 
				-3. **Examples**: `agents/openagent/README.md` - Test examples
			
 
				-4. **Rules**: `agents/openagent/docs/OPENAGENT_RULES.md` - Agent rules
			
 
				-
			
 
				-### For Developers
			
 
				-
			
 
				-1. **Start**: `ARCHITECTURE.md` - System architecture
			
 
				-2. **Framework**: `framework/SDK_EVAL_README.md` - Complete SDK guide
			
 
				-3. **Implementation**: `agents/openagent/IMPLEMENTATION_SUMMARY.md` - Recent changes
			
 
				-4. **Technical**: `HOW_TESTS_WORK.md` - Execution details
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Benefits of Cleanup
			
 
				-
			
 
				-### Before Cleanup
			
 
				-
			
 
				-- ❌ 19 markdown files (excluding node_modules)
			
 
				-- ❌ Outdated information (Nov 25 test reviews)
			
 
				-- ❌ Duplicate content (testing confidence in multiple places)
			
 
				-- ❌ Unclear entry point for new users
			
 
				-- ❌ Historical fix documentation cluttering framework/
			
 
				-
			
 
				-### After Cleanup
			
 
				-
			
 
				-- ✅ 16 markdown files (3 deleted, 2 new, net -1)
			
 
				-- ✅ All information current (Nov 26)
			
 
				-- ✅ No duplicate content
			
 
				-- ✅ Clear entry point (GETTING_STARTED.md)
			
 
				-- ✅ Clean framework directory
			
 
				-- ✅ Better organization
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Documentation Quality Metrics
			
 
				-
			
 
				-### Coverage
			
 
				-
			
 
				-| Audience | Documentation | Status |
			
 
				-|----------|---------------|--------|
			
 
				-| New Users | GETTING_STARTED.md | ✅ Complete |
			
 
				-| Test Authors | test-design-guide.md | ✅ Complete |
			
 
				-| Developers | ARCHITECTURE.md | ✅ Complete |
			
 
				-| OpenAgent Users | agents/openagent/README.md | ✅ Complete |
			
 
				-| Results Users | results/README.md | ✅ Complete |
			
 
				-
			
 
				-### Accuracy
			
 
				-
			
 
				-| Document | Last Updated | Accuracy |
			
 
				-|----------|--------------|----------|
			
 
				-| README.md | 2025-11-26 | ✅ Current |
			
 
				-| GETTING_STARTED.md | 2025-11-26 | ✅ Current |
			
 
				-| HOW_TESTS_WORK.md | 2025-11-26 | ✅ Current |
			
 
				-| ARCHITECTURE.md | 2025-11-26 | ✅ Current |
			
 
				-| agents/openagent/README.md | 2025-11-26 | ✅ Current |
			
 
				-| CONTEXT_LOADING_COVERAGE.md | 2025-11-26 | ✅ Current |
			
 
				-| IMPLEMENTATION_SUMMARY.md | 2025-11-26 | ✅ Current |
			
 
				-
			
 
				-### Maintainability
			
 
				-
			
 
				-- ✅ Clear naming conventions
			
 
				-- ✅ Logical organization
			
 
				-- ✅ No duplicate content
			
 
				-- ✅ Cross-references between docs
			
 
				-- ✅ Easy to find information
			
 
				-- ✅ Easy to update
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Maintenance Guidelines
			
 
				-
			
 
				-### When to Update Documentation
			
 
				-
			
 
				-1. **After Major Features**
			
 
				-   - Update README.md with new features
			
 
				-   - Update GETTING_STARTED.md with new usage examples
			
 
				-   - Create/update implementation summaries
			
 
				-
			
 
				-2. **After Bug Fixes**
			
 
				-   - Update relevant documentation
			
 
				-   - Add to troubleshooting sections if needed
			
 
				-
			
 
				-3. **Monthly Review**
			
 
				-   - Check for outdated information
			
 
				-   - Update test coverage numbers
			
 
				-   - Review and consolidate if needed
			
 
				-
			
 
				-### What to Delete
			
 
				-
			
 
				-- Historical fix documentation (after 3 months)
			
 
				-- Outdated test reviews (superseded by new ones)
			
 
				-- Duplicate content (consolidate instead)
			
 
				-- Temporary investigation notes
			
 
				-
			
 
				-### What to Keep
			
 
				-
			
 
				-- Architecture documentation
			
 
				-- Test design guides
			
 
				-- Getting started guides
			
 
				-- Current implementation summaries
			
 
				-- Troubleshooting guides
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Next Review
			
 
				-
			
 
				-**Scheduled**: 2025-12-26 (1 month)
			
 
				-
			
 
				-**Review Checklist**:
			
 
				-- [ ] Check for outdated information
			
 
				-- [ ] Update test coverage numbers
			
 
				-- [ ] Review new features added
			
 
				-- [ ] Check for duplicate content
			
 
				-- [ ] Verify all links work
			
 
				-- [ ] Update "Last Updated" dates
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Summary
			
 
				-
			
 
				-✅ **3 files deleted** (outdated/duplicate content)  
			
 
				-✅ **1 file renamed** (better clarity)  
			
 
				-✅ **2 files created** (better organization)  
			
 
				-✅ **3 files updated** (current information)  
			
 
				-✅ **Net result**: Cleaner, more organized, more maintainable documentation
			
 
				-
			
 
				-**Documentation is now**:
			
 
				-- Current (all Nov 26, 2025)
			
 
				-- Well-organized (clear structure)
			
 
				-- Easy to navigate (clear entry points)
			
 
				-- Comprehensive (covers all audiences)
			
 
				-- Maintainable (no duplicates, clear guidelines)
			
 
				-
			
 
				----
			
 
				-
			
 
				-**Cleanup Completed**: 2025-11-26  
			
 
				-**Next Review**: 2025-12-26
			
--- a/evals/SCRIPTS_ORGANIZATION.md
+++ b/evals/SCRIPTS_ORGANIZATION.md
@@ -1,367 +0,0 @@
 
				-# Scripts Organization Summary
			
 
				-
			
 
				-**Date**: 2025-11-26  
			
 
				-**Status**: ✅ Complete
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Changes Made
			
 
				-
			
 
				-### Before Organization
			
 
				-
			
 
				-```
			
 
				-evals/framework/
			
 
				-├── check-agent.mjs
			
 
				-├── debug-claude-session.mjs
			
 
				-├── debug-session.mjs
			
 
				-├── debug-session.ts
			
 
				-├── inspect-session.mjs
			
 
				-├── run-tests-batch.sh
			
 
				-├── test-agent-direct.ts
			
 
				-├── test-event-inspector.js
			
 
				-├── test-session-reader.mjs
			
 
				-├── test-simplified-approach.mjs
			
 
				-├── test-timeline.ts
			
 
				-├── verify-timeline.ts
			
 
				-└── ... (other framework files)
			
 
				-```
			
 
				-
			
 
				-**Issues**:
			
 
				-- ❌ 12 scripts cluttering framework root
			
 
				-- ❌ No clear organization
			
 
				-- ❌ Hard to find specific scripts
			
 
				-- ❌ Unclear which scripts are for what purpose
			
 
				-
			
 
				----
			
 
				-
			
 
				-### After Organization
			
 
				-
			
 
				-```
			
 
				-evals/framework/
			
 
				-├── scripts/
			
 
				-│   ├── debug/                    # Debugging scripts (4 files)
			
 
				-│   │   ├── debug-session.mjs
			
 
				-│   │   ├── debug-session.ts
			
 
				-│   │   ├── debug-claude-session.mjs
			
 
				-│   │   └── inspect-session.mjs
			
 
				-│   │
			
 
				-│   ├── test/                     # Test scripts (6 files)
			
 
				-│   │   ├── test-agent-direct.ts
			
 
				-│   │   ├── test-event-inspector.js
			
 
				-│   │   ├── test-session-reader.mjs
			
 
				-│   │   ├── test-simplified-approach.mjs
			
 
				-│   │   ├── test-timeline.ts
			
 
				-│   │   └── verify-timeline.ts
			
 
				-│   │
			
 
				-│   ├── utils/                    # Utility scripts (2 files)
			
 
				-│   │   ├── run-tests-batch.sh
			
 
				-│   │   └── check-agent.mjs
			
 
				-│   │
			
 
				-│   └── README.md                 # Script documentation
			
 
				-│
			
 
				-└── ... (other framework files)
			
 
				-```
			
 
				-
			
 
				-**Benefits**:
			
 
				-- ✅ Clean framework root
			
 
				-- ✅ Clear organization by purpose
			
 
				-- ✅ Easy to find scripts
			
 
				-- ✅ Comprehensive documentation
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Script Categories
			
 
				-
			
 
				-### Debug Scripts (4 files)
			
 
				-
			
 
				-Scripts for debugging sessions, events, and agent behavior.
			
 
				-
			
 
				-| Script | Purpose | Lines |
			
 
				-|--------|---------|-------|
			
 
				-| `debug-session.mjs` | Debug session data and timeline | ~40 |
			
 
				-| `debug-session.ts` | TypeScript version of session debugger | ~100 |
			
 
				-| `debug-claude-session.mjs` | Debug Claude-specific sessions | ~50 |
			
 
				-| `inspect-session.mjs` | Inspect most recent session events | ~80 |
			
 
				-
			
 
				-**Usage**:
			
 
				-```bash
			
 
				-node scripts/debug/inspect-session.mjs
			
 
				-node scripts/debug/debug-session.mjs <session-id>
			
 
				-npx tsx scripts/debug/debug-session.ts <session-id>
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-### Test Scripts (6 files)
			
 
				-
			
 
				-Scripts for testing framework components during development.
			
 
				-
			
 
				-| Script | Purpose | Lines |
			
 
				-|--------|---------|-------|
			
 
				-| `test-agent-direct.ts` | Direct agent execution test | ~150 |
			
 
				-| `test-event-inspector.js` | Test event capture system | ~40 |
			
 
				-| `test-session-reader.mjs` | Test session reader | ~60 |
			
 
				-| `test-simplified-approach.mjs` | Test simplified test approach | ~100 |
			
 
				-| `test-timeline.ts` | Test timeline builder | ~90 |
			
 
				-| `verify-timeline.ts` | Verify timeline accuracy | ~100 |
			
 
				-
			
 
				-**Usage**:
			
 
				-```bash
			
 
				-npx tsx scripts/test/test-agent-direct.ts
			
 
				-node scripts/test/test-event-inspector.js
			
 
				-npx tsx scripts/test/verify-timeline.ts
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-### Utility Scripts (2 files)
			
 
				-
			
 
				-General utility scripts for running tests and managing the framework.
			
 
				-
			
 
				-| Script | Purpose | Lines |
			
 
				-|--------|---------|-------|
			
 
				-| `run-tests-batch.sh` | Run tests in batches | ~100 |
			
 
				-| `check-agent.mjs` | Check agent availability | ~30 |
			
 
				-
			
 
				-**Usage**:
			
 
				-```bash
			
 
				-./scripts/utils/run-tests-batch.sh openagent 3 10
			
 
				-node scripts/utils/check-agent.mjs
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Documentation Updates
			
 
				-
			
 
				-### Files Updated
			
 
				-
			
 
				-1. **`evals/README.md`**
			
 
				-   - Updated `run-tests-batch.sh` path references
			
 
				-   - Updated directory structure
			
 
				-
			
 
				-2. **`evals/GETTING_STARTED.md`**
			
 
				-   - Updated batch execution examples
			
 
				-   - Updated script paths
			
 
				-
			
 
				-3. **`evals/agents/openagent/README.md`**
			
 
				-   - Updated batch execution examples
			
 
				-   - Updated script paths
			
 
				-
			
 
				-4. **`evals/agents/openagent/IMPLEMENTATION_SUMMARY.md`**
			
 
				-   - Updated script references
			
 
				-   - Updated directory structure
			
 
				-
			
 
				-5. **`evals/DOCUMENTATION_CLEANUP.md`**
			
 
				-   - Updated directory structure
			
 
				-
			
 
				-6. **`evals/framework/README.md`**
			
 
				-   - Added scripts section
			
 
				-   - Added quick examples
			
 
				-
			
 
				-### New Documentation
			
 
				-
			
 
				-1. **`evals/framework/scripts/README.md`** (NEW - 200 lines)
			
 
				-   - Comprehensive script documentation
			
 
				-   - Usage examples for all scripts
			
 
				-   - Development workflow guide
			
 
				-   - Script templates
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Path Changes
			
 
				-
			
 
				-### Old Paths → New Paths
			
 
				-
			
 
				-| Old Path | New Path |
			
 
				-|----------|----------|
			
 
				-| `run-tests-batch.sh` | `scripts/utils/run-tests-batch.sh` |
			
 
				-| `check-agent.mjs` | `scripts/utils/check-agent.mjs` |
			
 
				-| `debug-session.mjs` | `scripts/debug/debug-session.mjs` |
			
 
				-| `debug-session.ts` | `scripts/debug/debug-session.ts` |
			
 
				-| `debug-claude-session.mjs` | `scripts/debug/debug-claude-session.mjs` |
			
 
				-| `inspect-session.mjs` | `scripts/debug/inspect-session.mjs` |
			
 
				-| `test-agent-direct.ts` | `scripts/test/test-agent-direct.ts` |
			
 
				-| `test-event-inspector.js` | `scripts/test/test-event-inspector.js` |
			
 
				-| `test-session-reader.mjs` | `scripts/test/test-session-reader.mjs` |
			
 
				-| `test-simplified-approach.mjs` | `scripts/test/test-simplified-approach.mjs` |
			
 
				-| `test-timeline.ts` | `scripts/test/test-timeline.ts` |
			
 
				-| `verify-timeline.ts` | `scripts/test/verify-timeline.ts` |
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Migration Guide
			
 
				-
			
 
				-### For Users
			
 
				-
			
 
				-If you have scripts or documentation referencing the old paths:
			
 
				-
			
 
				-```bash
			
 
				-# Old
			
 
				-./run-tests-batch.sh openagent 3 10
			
 
				-
			
 
				-# New
			
 
				-./scripts/utils/run-tests-batch.sh openagent 3 10
			
 
				-```
			
 
				-
			
 
				-### For Developers
			
 
				-
			
 
				-If you have custom scripts importing from these files:
			
 
				-
			
 
				-```javascript
			
 
				-// Old
			
 
				-import { SessionReader } from './dist/collector/session-reader.js';
			
 
				-
			
 
				-// New (from scripts directory)
			
 
				-import { SessionReader } from '../../dist/collector/session-reader.js';
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Benefits
			
 
				-
			
 
				-### Organization
			
 
				-
			
 
				-- ✅ **Clear structure** - Scripts grouped by purpose
			
 
				-- ✅ **Easy navigation** - Know where to find scripts
			
 
				-- ✅ **Clean root** - Framework root no longer cluttered
			
 
				-- ✅ **Scalable** - Easy to add new scripts
			
 
				-
			
 
				-### Documentation
			
 
				-
			
 
				-- ✅ **Comprehensive README** - All scripts documented
			
 
				-- ✅ **Usage examples** - Clear examples for each script
			
 
				-- ✅ **Development workflow** - Guide for using scripts
			
 
				-- ✅ **Templates** - Easy to create new scripts
			
 
				-
			
 
				-### Maintainability
			
 
				-
			
 
				-- ✅ **Easier to maintain** - Clear organization
			
 
				-- ✅ **Easier to find** - Logical grouping
			
 
				-- ✅ **Easier to update** - Centralized documentation
			
 
				-- ✅ **Easier to extend** - Clear patterns
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Statistics
			
 
				-
			
 
				-### Before
			
 
				-
			
 
				-- **Total scripts**: 12
			
 
				-- **In framework root**: 12
			
 
				-- **Organized**: 0
			
 
				-- **Documented**: Minimal
			
 
				-
			
 
				-### After
			
 
				-
			
 
				-- **Total scripts**: 12 (same)
			
 
				-- **In framework root**: 0
			
 
				-- **Organized**: 12 (100%)
			
 
				-- **Documented**: Comprehensive (200+ lines)
			
 
				-
			
 
				-### File Count
			
 
				-
			
 
				-- **Debug scripts**: 4
			
 
				-- **Test scripts**: 6
			
 
				-- **Utility scripts**: 2
			
 
				-- **Documentation**: 1 (README.md)
			
 
				-- **Total**: 13 files (12 scripts + 1 doc)
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Maintenance Guidelines
			
 
				-
			
 
				-### Adding New Scripts
			
 
				-
			
 
				-1. **Determine category**:
			
 
				-   - Debug? → `scripts/debug/`
			
 
				-   - Test? → `scripts/test/`
			
 
				-   - Utility? → `scripts/utils/`
			
 
				-
			
 
				-2. **Create script** in appropriate directory
			
 
				-
			
 
				-3. **Update `scripts/README.md`**:
			
 
				-   - Add to table
			
 
				-   - Add usage example
			
 
				-
			
 
				-4. **Test the script**:
			
 
				-   ```bash
			
 
				-   npm run build
			
 
				-   node scripts/debug/my-script.mjs
			
 
				-   ```
			
 
				-
			
 
				-### Removing Obsolete Scripts
			
 
				-
			
 
				-1. **Delete the script file**
			
 
				-
			
 
				-2. **Update `scripts/README.md`**:
			
 
				-   - Remove from table
			
 
				-   - Remove usage example
			
 
				-
			
 
				-3. **Check for references**:
			
 
				-   ```bash
			
 
				-   rg "my-script" --type md
			
 
				-   ```
			
 
				-
			
 
				-### Updating Scripts
			
 
				-
			
 
				-1. **Make changes to script**
			
 
				-
			
 
				-2. **Test changes**:
			
 
				-   ```bash
			
 
				-   npm run build
			
 
				-   node scripts/debug/my-script.mjs
			
 
				-   ```
			
 
				-
			
 
				-3. **Update documentation** if usage changed
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Next Steps
			
 
				-
			
 
				-### Immediate
			
 
				-
			
 
				-- ✅ Scripts organized
			
 
				-- ✅ Documentation updated
			
 
				-- ✅ References updated
			
 
				-- ✅ README created
			
 
				-
			
 
				-### Future Enhancements
			
 
				-
			
 
				-1. **Add more debug scripts**
			
 
				-   - Session comparison tool
			
 
				-   - Event diff tool
			
 
				-   - Performance profiler
			
 
				-
			
 
				-2. **Add more test scripts**
			
 
				-   - Integration test runner
			
 
				-   - Performance benchmarks
			
 
				-   - Stress tests
			
 
				-
			
 
				-3. **Add more utilities**
			
 
				-   - Test result analyzer
			
 
				-   - Coverage reporter
			
 
				-   - Cleanup utilities
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Summary
			
 
				-
			
 
				-✅ **12 scripts organized** into 3 categories  
			
 
				-✅ **Framework root cleaned** (0 scripts remaining)  
			
 
				-✅ **Comprehensive documentation** (200+ lines)  
			
 
				-✅ **All references updated** (6 files)  
			
 
				-✅ **Clear structure** for future additions
			
 
				-
			
 
				-**Organization is now**:
			
 
				-- Clean and organized
			
 
				-- Well-documented
			
 
				-- Easy to navigate
			
 
				-- Easy to maintain
			
 
				-- Easy to extend
			
 
				-
			
 
				----
			
 
				-
			
 
				-**Organization Completed**: 2025-11-26  
			
 
				-**Scripts Organized**: 12  
			
 
				-**Documentation Created**: 1 README (200+ lines)  
			
 
				-**Files Updated**: 6
			
--- a/evals/agents/AGENT_TESTING_GUIDE.md
+++ b/evals/agents/AGENT_TESTING_GUIDE.md
@@ -1,417 +0,0 @@
 
				-# Agent Testing Guide - Agent-Agnostic Architecture
			
 
				-
			
 
				-## Overview
			
 
				-
			
 
				-Our evaluation framework is designed to be **agent-agnostic**, making it easy to test multiple agents with the same infrastructure.
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Architecture Layers
			
 
				-
			
 
				-### **Layer 1: Framework (Agent-Agnostic)**
			
 
				-```
			
 
				-evals/framework/
			
 
				-├── src/
			
 
				-│   ├── sdk/              # Test runner (works with any agent)
			
 
				-│   ├── evaluators/       # Generic behavior checks
			
 
				-│   └── types/            # Shared types
			
 
				-```
			
 
				-
			
 
				-**Purpose:** Shared infrastructure that works with **any agent**
			
 
				-
			
 
				-**Key Components:**
			
 
				-- `TestRunner` - Executes tests for any agent
			
 
				-- `Evaluators` - Check generic behaviors (approval, context, tools)
			
 
				-- `EventStreamHandler` - Captures events from any agent
			
 
				-- `TestCaseSchema` - Universal test format
			
 
				-
			
 
				----
			
 
				-
			
 
				-### **Layer 2: Agent-Specific Tests**
			
 
				-```
			
 
				-evals/agents/
			
 
				-├── openagent/           # OpenAgent-specific tests
			
 
				-│   ├── tests/
			
 
				-│   └── docs/
			
 
				-├── opencoder/           # OpenCoder-specific tests (future)
			
 
				-│   ├── tests/
			
 
				-│   └── docs/
			
 
				-└── shared/              # Tests for ANY agent
			
 
				-    └── tests/
			
 
				-```
			
 
				-
			
 
				-**Purpose:** Organize tests by agent for easy management
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Directory Structure
			
 
				-
			
 
				-```
			
 
				-evals/
			
 
				-├── framework/                          # SHARED FRAMEWORK
			
 
				-│   ├── src/
			
 
				-│   │   ├── sdk/
			
 
				-│   │   │   ├── test-runner.ts         # Reads 'agent' field from YAML
			
 
				-│   │   │   ├── client-manager.ts      # Routes to correct agent
			
 
				-│   │   │   └── test-case-schema.ts    # Universal schema
			
 
				-│   │   └── evaluators/
			
 
				-│   │       ├── approval-gate-evaluator.ts    # Works for any agent
			
 
				-│   │       ├── context-loading-evaluator.ts  # Works for any agent
			
 
				-│   │       └── tool-usage-evaluator.ts       # Works for any agent
			
 
				-│   └── package.json
			
 
				-│
			
 
				-├── agents/
			
 
				-│   ├── openagent/                      # OPENAGENT TESTS
			
 
				-│   │   ├── tests/
			
 
				-│   │   │   ├── developer/
			
 
				-│   │   │   │   ├── task-simple-001.yaml      # agent: openagent
			
 
				-│   │   │   │   ├── ctx-code-001.yaml         # agent: openagent
			
 
				-│   │   │   │   └── ctx-docs-001.yaml         # agent: openagent
			
 
				-│   │   │   ├── business/
			
 
				-│   │   │   │   └── conv-simple-001.yaml      # agent: openagent
			
 
				-│   │   │   └── edge-case/
			
 
				-│   │   │       └── fail-stop-001.yaml        # agent: openagent
			
 
				-│   │   └── docs/
			
 
				-│   │       └── OPENAGENT_RULES.md            # OpenAgent-specific rules
			
 
				-│   │
			
 
				-│   ├── opencoder/                      # OPENCODER TESTS (future)
			
 
				-│   │   ├── tests/
			
 
				-│   │   │   ├── developer/
			
 
				-│   │   │   │   ├── refactor-code-001.yaml    # agent: opencoder
			
 
				-│   │   │   │   └── optimize-perf-001.yaml    # agent: opencoder
			
 
				-│   │   └── docs/
			
 
				-│   │       └── OPENCODER_RULES.md            # OpenCoder-specific rules
			
 
				-│   │
			
 
				-│   └── shared/                         # SHARED TESTS (any agent)
			
 
				-│       ├── tests/
			
 
				-│       │   └── common/
			
 
				-│       │       ├── approval-gate-basic.yaml  # agent: ${AGENT}
			
 
				-│       │       └── tool-usage-basic.yaml     # agent: ${AGENT}
			
 
				-│       └── README.md
			
 
				-│
			
 
				-└── README.md
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## How Agent Selection Works
			
 
				-
			
 
				-### **1. Test Specifies Agent**
			
 
				-
			
 
				-```yaml
			
 
				-# openagent/tests/developer/task-simple-001.yaml
			
 
				-id: task-simple-001
			
 
				-name: Simple Bash Execution
			
 
				-agent: openagent              # ← Specifies which agent to test
			
 
				-prompt: "Run npm install"
			
 
				-```
			
 
				-
			
 
				-### **2. Test Runner Routes to Agent**
			
 
				-
			
 
				-```typescript
			
 
				-// framework/src/sdk/test-runner.ts
			
 
				-async runTest(testCase: TestCase) {
			
 
				-  // Get agent from test case
			
 
				-  const agent = testCase.agent || 'openagent';
			
 
				-  
			
 
				-  // Route to specified agent
			
 
				-  const result = await this.clientManager.sendPrompt(
			
 
				-    sessionId,
			
 
				-    testCase.prompt,
			
 
				-    { agent }  // ← SDK routes to correct agent
			
 
				-  );
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-### **3. Evaluators Check Generic Behaviors**
			
 
				-
			
 
				-```typescript
			
 
				-// framework/src/evaluators/approval-gate-evaluator.ts
			
 
				-export class ApprovalGateEvaluator extends BaseEvaluator {
			
 
				-  async evaluate(timeline: TimelineEvent[]) {
			
 
				-    // Check if ANY agent asked for approval
			
 
				-    // Works for openagent, opencoder, or any future agent
			
 
				-    
			
 
				-    const approvalRequested = timeline.some(event => 
			
 
				-      event.type === 'approval_request'
			
 
				-    );
			
 
				-    
			
 
				-    if (!approvalRequested) {
			
 
				-      violations.push({
			
 
				-        type: 'approval-gate-missing',
			
 
				-        severity: 'error',
			
 
				-        message: 'Agent executed without requesting approval'
			
 
				-      });
			
 
				-    }
			
 
				-  }
			
 
				-}
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Running Tests Per Agent
			
 
				-
			
 
				-### **Run All Tests for Specific Agent**
			
 
				-
			
 
				-```bash
			
 
				-# Run ALL OpenAgent tests
			
 
				-npm run eval:sdk -- --pattern="openagent/**/*.yaml"
			
 
				-
			
 
				-# Run ALL OpenCoder tests
			
 
				-npm run eval:sdk -- --pattern="opencoder/**/*.yaml"
			
 
				-```
			
 
				-
			
 
				-### **Run Specific Category**
			
 
				-
			
 
				-```bash
			
 
				-# Run OpenAgent developer tests
			
 
				-npm run eval:sdk -- --pattern="openagent/developer/*.yaml"
			
 
				-
			
 
				-# Run OpenCoder developer tests
			
 
				-npm run eval:sdk -- --pattern="opencoder/developer/*.yaml"
			
 
				-```
			
 
				-
			
 
				-### **Run Shared Tests for Different Agents**
			
 
				-
			
 
				-```bash
			
 
				-# Run shared tests for OpenAgent
			
 
				-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=openagent
			
 
				-
			
 
				-# Run shared tests for OpenCoder
			
 
				-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=opencoder
			
 
				-```
			
 
				-
			
 
				-### **Run Single Test**
			
 
				-
			
 
				-```bash
			
 
				-# Run specific test
			
 
				-npx tsx src/sdk/show-test-details.ts openagent/developer/task-simple-001.yaml
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Adding a New Agent
			
 
				-
			
 
				-### **Step 1: Create Agent Directory**
			
 
				-
			
 
				-```bash
			
 
				-mkdir -p evals/agents/my-new-agent/tests/{developer,business,edge-case}
			
 
				-mkdir -p evals/agents/my-new-agent/docs
			
 
				-```
			
 
				-
			
 
				-### **Step 2: Create Agent Rules Document**
			
 
				-
			
 
				-```bash
			
 
				-# Document agent-specific rules
			
 
				-touch evals/agents/my-new-agent/docs/MY_NEW_AGENT_RULES.md
			
 
				-```
			
 
				-
			
 
				-### **Step 3: Copy Shared Tests**
			
 
				-
			
 
				-```bash
			
 
				-# Copy shared tests as starting point
			
 
				-cp evals/agents/shared/tests/common/*.yaml \
			
 
				-   evals/agents/my-new-agent/tests/developer/
			
 
				-
			
 
				-# Update agent field
			
 
				-sed -i 's/agent: openagent/agent: my-new-agent/g' \
			
 
				-  evals/agents/my-new-agent/tests/developer/*.yaml
			
 
				-```
			
 
				-
			
 
				-### **Step 4: Add Agent-Specific Tests**
			
 
				-
			
 
				-```yaml
			
 
				-# my-new-agent/tests/developer/custom-test-001.yaml
			
 
				-id: custom-test-001
			
 
				-name: My New Agent Custom Test
			
 
				-agent: my-new-agent           # ← Your new agent
			
 
				-prompt: "Agent-specific prompt"
			
 
				-
			
 
				-behavior:
			
 
				-  mustUseTools: [bash]
			
 
				-  requiresApproval: true
			
 
				-
			
 
				-expectedViolations:
			
 
				-  - rule: approval-gate
			
 
				-    shouldViolate: false
			
 
				-```
			
 
				-
			
 
				-### **Step 5: Run Tests**
			
 
				-
			
 
				-```bash
			
 
				-npm run eval:sdk -- --pattern="my-new-agent/**/*.yaml"
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Test Organization Best Practices
			
 
				-
			
 
				-### **1. Agent-Specific Tests**
			
 
				-Put in `agents/{agent}/tests/`
			
 
				-
			
 
				-**When to use:**
			
 
				-- Tests specific to agent's unique features
			
 
				-- Tests for agent-specific rules
			
 
				-- Tests that won't work for other agents
			
 
				-
			
 
				-**Example:**
			
 
				-```yaml
			
 
				-# openagent/tests/developer/ctx-code-001.yaml
			
 
				-# OpenAgent-specific: Tests context loading from openagent.md
			
 
				-agent: openagent
			
 
				-behavior:
			
 
				-  requiresContext: true  # OpenAgent-specific rule
			
 
				-```
			
 
				-
			
 
				-### **2. Shared Tests**
			
 
				-Put in `agents/shared/tests/common/`
			
 
				-
			
 
				-**When to use:**
			
 
				-- Tests that work for ANY agent
			
 
				-- Tests for universal rules (approval, tool usage)
			
 
				-- Tests you want to run across multiple agents
			
 
				-
			
 
				-**Example:**
			
 
				-```yaml
			
 
				-# shared/tests/common/approval-gate-basic.yaml
			
 
				-# Works for ANY agent
			
 
				-agent: openagent  # Default, can be overridden
			
 
				-behavior:
			
 
				-  requiresApproval: true  # Universal rule
			
 
				-```
			
 
				-
			
 
				-### **3. Category Organization**
			
 
				-
			
 
				-```
			
 
				-tests/
			
 
				-├── developer/      # Developer workflow tests
			
 
				-├── business/       # Business/analysis tests
			
 
				-├── creative/       # Content creation tests
			
 
				-└── edge-case/      # Edge cases and error handling
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Evaluator Design (Agent-Agnostic)
			
 
				-
			
 
				-### **Good: Generic Behavior Check**
			
 
				-
			
 
				-```typescript
			
 
				-// ✅ Works for any agent
			
 
				-export class ApprovalGateEvaluator extends BaseEvaluator {
			
 
				-  async evaluate(timeline: TimelineEvent[]) {
			
 
				-    // Check generic behavior: did agent ask for approval?
			
 
				-    const hasApproval = timeline.some(e => e.type === 'approval_request');
			
 
				-    
			
 
				-    if (!hasApproval) {
			
 
				-      violations.push({
			
 
				-        type: 'approval-gate-missing',
			
 
				-        message: 'Agent did not request approval'
			
 
				-      });
			
 
				-    }
			
 
				-  }
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-### **Bad: Agent-Specific Logic**
			
 
				-
			
 
				-```typescript
			
 
				-// ❌ Hardcoded to specific agent
			
 
				-export class OpenAgentSpecificEvaluator extends BaseEvaluator {
			
 
				-  async evaluate(timeline: TimelineEvent[]) {
			
 
				-    // Don't do this - ties evaluator to specific agent
			
 
				-    if (sessionInfo.agent === 'openagent') {
			
 
				-      // OpenAgent-specific checks
			
 
				-    }
			
 
				-  }
			
 
				-}
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Benefits of Agent-Agnostic Design
			
 
				-
			
 
				-### **1. Easy to Add New Agents**
			
 
				-- Copy shared tests
			
 
				-- Update `agent` field
			
 
				-- Add agent-specific tests
			
 
				-- Run tests
			
 
				-
			
 
				-### **2. Consistent Behavior Across Agents**
			
 
				-- Same evaluators check all agents
			
 
				-- Same test format for all agents
			
 
				-- Easy to compare agent behaviors
			
 
				-
			
 
				-### **3. Reduced Duplication**
			
 
				-- Shared tests written once
			
 
				-- Evaluators work for all agents
			
 
				-- Framework code reused
			
 
				-
			
 
				-### **4. Easy Maintenance**
			
 
				-- Update evaluator once, affects all agents
			
 
				-- Update shared test once, affects all agents
			
 
				-- Clear separation of concerns
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Example: Testing Two Agents
			
 
				-
			
 
				-### **OpenAgent Test**
			
 
				-```yaml
			
 
				-# openagent/tests/developer/create-file.yaml
			
 
				-id: openagent-create-file-001
			
 
				-agent: openagent
			
 
				-prompt: "Create hello.ts"
			
 
				-
			
 
				-behavior:
			
 
				-  requiresContext: true  # OpenAgent loads code.md
			
 
				-```
			
 
				-
			
 
				-### **OpenCoder Test**
			
 
				-```yaml
			
 
				-# opencoder/tests/developer/create-file.yaml
			
 
				-id: opencoder-create-file-001
			
 
				-agent: opencoder
			
 
				-prompt: "Create hello.ts"
			
 
				-
			
 
				-behavior:
			
 
				-  requiresContext: false  # OpenCoder might not need context
			
 
				-```
			
 
				-
			
 
				-### **Shared Test (Works for Both)**
			
 
				-```yaml
			
 
				-# shared/tests/common/create-file.yaml
			
 
				-id: shared-create-file-001
			
 
				-agent: openagent  # Default
			
 
				-prompt: "Create hello.ts"
			
 
				-
			
 
				-behavior:
			
 
				-  requiresApproval: true  # Both agents should ask
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Summary
			
 
				-
			
 
				-**Framework Layer:**
			
 
				-- ✅ Agent-agnostic test runner
			
 
				-- ✅ Generic evaluators
			
 
				-- ✅ Universal test schema
			
 
				-
			
 
				-**Agent Layer:**
			
 
				-- ✅ Agent-specific tests in `agents/{agent}/`
			
 
				-- ✅ Shared tests in `agents/shared/`
			
 
				-- ✅ Agent-specific rules in `docs/`
			
 
				-
			
 
				-**Benefits:**
			
 
				-- ✅ Easy to add new agents
			
 
				-- ✅ Consistent behavior validation
			
 
				-- ✅ Reduced duplication
			
 
				-- ✅ Clear organization
			
 
				-
			
 
				-**To test a new agent:**
			
 
				-1. Create directory: `agents/my-agent/`
			
 
				-2. Copy shared tests
			
 
				-3. Update `agent` field
			
 
				-4. Add agent-specific tests
			
 
				-5. Run: `npm run eval:sdk -- --pattern="my-agent/**/*.yaml"`
			
--- a/evals/agents/openagent/CONTEXT_LOADING_COVERAGE.md
+++ b/evals/agents/openagent/CONTEXT_LOADING_COVERAGE.md
@@ -1,298 +0,0 @@
 
				-# Context Loading Test Coverage
			
 
				-
			
 
				-## Overview
			
 
				-
			
 
				-This document describes the context loading tests created to verify OpenAgent correctly loads context files before responding to user queries and executing tasks.
			
 
				-
			
 
				-**Test Location**: `evals/agents/openagent/tests/context-loading/`
			
 
				-
			
 
				-**Total Tests**: 5 (3 simple, 2 complex multi-turn)
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Test Results Summary
			
 
				-
			
 
				-**Run Date**: 2025-11-26  
			
 
				-**Pass Rate**: 3/5 (60%)  
			
 
				-**Total Duration**: 430 seconds (~7 minutes)
			
 
				-
			
 
				-| Test ID | Type | Status | Duration | Notes |
			
 
				-|---------|------|--------|----------|-------|
			
 
				-| ctx-simple-testing-approach | Simple | ✅ PASS | 35s | Loaded testing docs correctly |
			
 
				-| ctx-simple-documentation-format | Simple | ✅ PASS | 19s | Loaded docs.md correctly |
			
 
				-| ctx-simple-coding-standards | Simple | ✅ PASS | 20s | Loaded code.md correctly |
			
 
				-| ctx-multi-standards-to-docs | Complex | ❌ FAIL | 109s | No context loaded before execution |
			
 
				-| ctx-multi-error-handling-to-tests | Complex | ❌ FAIL | 246s | Timeout on prompt 4 |
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Test Descriptions
			
 
				-
			
 
				-### Simple Tests (Read-Only)
			
 
				-
			
 
				-#### 1. `ctx-simple-coding-standards.yaml`
			
 
				-**Prompt**: "What are our coding standards for this project?"
			
 
				-
			
 
				-**Expected Behavior**:
			
 
				-- Load `code.md` or `standards.md` before responding
			
 
				-- Reference project-specific standards
			
 
				-
			
 
				-**Result**: ✅ **PASSED**
			
 
				-- Agent loaded `.opencode/context/core/standards/code.md`
			
 
				-- 1 read operation performed
			
 
				-- No violations detected
			
 
				-
			
 
				----
			
 
				-
			
 
				-#### 2. `ctx-simple-documentation-format.yaml`
			
 
				-**Prompt**: "What format should I use for documentation in this project?"
			
 
				-
			
 
				-**Expected Behavior**:
			
 
				-- Load `docs.md` or `documentation.md` before responding
			
 
				-- Reference project-specific documentation standards
			
 
				-
			
 
				-**Result**: ✅ **PASSED**
			
 
				-- Agent loaded `.opencode/context/core/standards/docs.md`
			
 
				-- 1 read operation performed
			
 
				-- No violations detected
			
 
				-
			
 
				----
			
 
				-
			
 
				-#### 3. `ctx-simple-testing-approach.yaml`
			
 
				-**Prompt**: "What's our testing strategy for this project?"
			
 
				-
			
 
				-**Expected Behavior**:
			
 
				-- Load `tests.md` or `testing.md` before responding
			
 
				-- Reference project-specific testing standards
			
 
				-
			
 
				-**Result**: ✅ **PASSED**
			
 
				-- Agent loaded multiple testing-related files:
			
 
				-  - `evals/HOW_TESTS_WORK.md`
			
 
				-  - `evals/README.md`
			
 
				-  - `evals/TESTING_CONFIDENCE.md`
			
 
				-  - `evals/agents/AGENT_TESTING_GUIDE.md`
			
 
				-- 4 read operations performed
			
 
				-- No violations detected
			
 
				-
			
 
				----
			
 
				-
			
 
				-### Complex Tests (Multi-Turn with File Creation)
			
 
				-
			
 
				-#### 4. `ctx-multi-standards-to-docs.yaml`
			
 
				-**Scenario**: Standards question → Documentation request → Format question
			
 
				-
			
 
				-**Turn 1**: "What are our coding standards?"
			
 
				-- Expected: Load `standards.md` or `code.md`
			
 
				-
			
 
				-**Turn 2**: "Can you create documentation about these standards in evals/test_tmp/coding-standards-doc.md?"
			
 
				-- Expected: Load `docs.md` (documentation format)
			
 
				-- Expected: Write file to `evals/test_tmp/`
			
 
				-
			
 
				-**Turn 3**: "What will the documentation structure look like?"
			
 
				-- Expected: Reference both standards and docs context
			
 
				-
			
 
				-**Result**: ❌ **FAILED**
			
 
				-- Agent loaded context files correctly:
			
 
				-  - `.opencode/context/core/standards/code.md` (2x)
			
 
				-  - `.opencode/context/core/standards/docs.md` (1x)
			
 
				-- Agent wrote file successfully
			
 
				-- **Violation**: "No context loaded before execution" (warning)
			
 
				-- **Issue**: Context loading evaluator flagged timing issue
			
 
				-
			
 
				-**Files Created**: `evals/test_tmp/coding-standards-doc.md` (cleaned up after test)
			
 
				-
			
 
				----
			
 
				-
			
 
				-#### 5. `ctx-multi-error-handling-to-tests.yaml`
			
 
				-**Scenario**: Error handling question → Test request → Coverage policy
			
 
				-
			
 
				-**Turn 1**: "How should we handle errors in this project?"
			
 
				-- Expected: Load `standards.md` or `processes.md`
			
 
				-
			
 
				-**Turn 2**: "Can you write tests for error handling in evals/test_tmp/error-handling.test.ts?"
			
 
				-- Expected: Load `tests.md` (testing standards)
			
 
				-- Expected: Write test file to `evals/test_tmp/`
			
 
				-
			
 
				-**Turn 3**: "What's our test coverage policy?"
			
 
				-- Expected: Reference test-related context
			
 
				-
			
 
				-**Result**: ❌ **FAILED**
			
 
				-- **Error**: "Prompt 4 execution timed out"
			
 
				-- Test exceeded 180-second timeout
			
 
				-- Likely due to complex multi-turn conversation with file creation
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Cleanup Verification
			
 
				-
			
 
				-✅ **Cleanup System Working Correctly**
			
 
				-
			
 
				-**Before Tests**:
			
 
				-- Cleaned up 1 file from previous runs
			
 
				-
			
 
				-**After Tests**:
			
 
				-- Cleaned up 2 files created during tests
			
 
				-- `test_tmp/` contains only:
			
 
				-  - `.gitignore`
			
 
				-  - `README.md`
			
 
				-
			
 
				-**Cleanup Logic**: `evals/framework/src/sdk/run-sdk-tests.ts`
			
 
				-- Runs before test execution
			
 
				-- Runs after test execution
			
 
				-- Preserves only `.gitignore` and `README.md`
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Key Findings
			
 
				-
			
 
				-### ✅ Positive Results
			
 
				-
			
 
				-1. **Simple Context Loading Works**: All 3 simple tests passed
			
 
				-   - Agent correctly identifies and loads relevant context files
			
 
				-   - Agent reads context BEFORE responding
			
 
				-   - No violations in simple scenarios
			
 
				-
			
 
				-2. **Cleanup System Reliable**: 
			
 
				-   - Files created during tests are properly cleaned up
			
 
				-   - No test artifacts left in project root
			
 
				-   - `test_tmp/` directory isolation working
			
 
				-
			
 
				-3. **Context File Discovery**:
			
 
				-   - Agent successfully finds context files in `.opencode/context/core/standards/`
			
 
				-   - Agent loads multiple relevant files when appropriate
			
 
				-
			
 
				-### ⚠️ Issues Identified
			
 
				-
			
 
				-1. **Multi-Turn Context Loading**: 
			
 
				-   - Complex multi-turn tests show timing issues
			
 
				-   - Context loading evaluator flagging warnings even when files are loaded
			
 
				-   - May need to adjust evaluator logic for multi-turn scenarios
			
 
				-
			
 
				-2. **Timeout on Complex Tests**:
			
 
				-   - 180-second timeout insufficient for some multi-turn tests
			
 
				-   - Test 5 timed out on prompt 4
			
 
				-   - May need to increase timeout or simplify test scenarios
			
 
				-
			
 
				-3. **False Positive Warning**:
			
 
				-   - Test 4 loaded context correctly but still got "no-context-loaded" warning
			
 
				-   - Evaluator may not be detecting context loads in multi-turn conversations
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Recommendations
			
 
				-
			
 
				-### Immediate Actions
			
 
				-
			
 
				-1. **Increase Timeout for Complex Tests**
			
 
				-   - Change from 180s to 300s (5 minutes)
			
 
				-   - Add timeout configuration per test
			
 
				-
			
 
				-2. **Fix Context Loading Evaluator**
			
 
				-   - Review timing detection logic for multi-turn tests
			
 
				-   - Ensure evaluator tracks context loads across all prompts
			
 
				-
			
 
				-3. **Simplify Complex Tests**
			
 
				-   - Reduce number of turns in multi-turn tests
			
 
				-   - Focus on specific context loading scenarios
			
 
				-
			
 
				-### Future Enhancements
			
 
				-
			
 
				-1. **Add More Edge Cases**
			
 
				-   - Test context loading with missing files
			
 
				-   - Test context loading with multiple context directories
			
 
				-   - Test context loading with file attachments
			
 
				-
			
 
				-2. **Add Performance Metrics**
			
 
				-   - Track time between context load and execution
			
 
				-   - Measure context file read performance
			
 
				-   - Monitor API rate limits
			
 
				-
			
 
				-3. **Batch Test Execution**
			
 
				-   - Run tests in smaller batches to avoid API timeouts
			
 
				-   - Add retry logic for transient failures
			
 
				-   - Implement test result caching
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Running These Tests
			
 
				-
			
 
				-### Run All Context Loading Tests
			
 
				-```bash
			
 
				-cd evals/framework
			
 
				-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"
			
 
				-```
			
 
				-
			
 
				-### Run Individual Test
			
 
				-```bash
			
 
				-npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"
			
 
				-```
			
 
				-
			
 
				-### Run with Debug Output
			
 
				-```bash
			
 
				-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml" --debug
			
 
				-```
			
 
				-
			
 
				-### View Results Dashboard
			
 
				-```bash
			
 
				-cd ../results
			
 
				-./serve.sh
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Test File Structure
			
 
				-
			
 
				-Each test follows this structure:
			
 
				-
			
 
				-```yaml
			
 
				-id: test-id
			
 
				-name: "Test Name"
			
 
				-description: |
			
 
				-  Detailed description of what the test validates
			
 
				-  
			
 
				-category: developer
			
 
				-agent: openagent
			
 
				-model: anthropic/claude-sonnet-4-5
			
 
				-
			
 
				-# Single prompt OR multi-turn prompts
			
 
				-prompt: "Single prompt text"
			
 
				-# OR
			
 
				-prompts:
			
 
				-  - text: "First prompt"
			
 
				-    expectContext: true
			
 
				-    contextFile: "standards.md"
			
 
				-  - text: "approve"
			
 
				-    delayMs: 2000
			
 
				-
			
 
				-# Expected behavior
			
 
				-behavior:
			
 
				-  mustUseTools: [read, write]
			
 
				-  requiresContext: true
			
 
				-  minToolCalls: 1
			
 
				-
			
 
				-# Expected violations
			
 
				-expectedViolations:
			
 
				-  - rule: context-loading
			
 
				-    shouldViolate: false
			
 
				-    severity: error
			
 
				-
			
 
				-# Approval strategy
			
 
				-approvalStrategy:
			
 
				-  type: auto-approve
			
 
				-
			
 
				-timeout: 60000
			
 
				-
			
 
				-tags:
			
 
				-  - context-loading
			
 
				-  - simple-test
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Maintenance
			
 
				-
			
 
				-**Last Updated**: 2025-11-26  
			
 
				-**Test Framework Version**: 0.1.0  
			
 
				-**OpenAgent Version**: Latest  
			
 
				-
			
 
				-**Next Review**: After fixing context loading evaluator timing logic
			
--- a/evals/agents/openagent/IMPLEMENTATION_SUMMARY.md
+++ b/evals/agents/openagent/IMPLEMENTATION_SUMMARY.md
@@ -1,256 +0,0 @@
 
				-# Context Loading Tests - Implementation Summary
			
 
				-
			
 
				-**Date**: 2025-11-26  
			
 
				-**Status**: ✅ **COMPLETE - ALL TESTS PASSING (5/5)**
			
 
				-
			
 
				----
			
 
				-
			
 
				-## What We Built
			
 
				-
			
 
				-### 1. **5 Context Loading Tests** ✅
			
 
				-Created comprehensive test suite to verify OpenAgent loads context files correctly:
			
 
				-
			
 
				-**Simple Tests (3)** - Single prompt, read-only
			
 
				-- `ctx-simple-coding-standards.yaml` - Coding standards query
			
 
				-- `ctx-simple-documentation-format.yaml` - Documentation format query  
			
 
				-- `ctx-simple-testing-approach.yaml` - Testing strategy query
			
 
				-
			
 
				-**Complex Tests (2)** - Multi-turn with file creation
			
 
				-- `ctx-multi-standards-to-docs.yaml` - Standards → Documentation creation
			
 
				-- `ctx-multi-error-handling-to-tests.yaml` - Error handling → Test creation
			
 
				-
			
 
				-### 2. **Smart Timeout System** ✅
			
 
				-Implemented intelligent timeout handling for multi-turn tests:
			
 
				-- **Activity monitoring**: Checks if events are still streaming
			
 
				-- **Base timeout**: 300s (5 minutes) of inactivity triggers timeout
			
 
				-- **Absolute max**: 600s (10 minutes) hard limit
			
 
				-- **Prevents false timeouts**: Extends timeout while agent is active
			
 
				-
			
 
				-**Code**: `evals/framework/src/sdk/test-runner.ts` - `withSmartTimeout()` method
			
 
				-
			
 
				-### 3. **Fixed Context Loading Evaluator** ✅
			
 
				-Corrected evaluator to properly detect context files in multi-turn sessions:
			
 
				-
			
 
				-**Issues Fixed**:
			
 
				-- ❌ **Before**: File paths extracted from wrong location (`tool.data.input.filePath`)
			
 
				-- ✅ **After**: Correctly extracts from `tool.data.state.input.filePath`
			
 
				-- ❌ **Before**: Only checked context before FIRST execution
			
 
				-- ✅ **After**: Checks context for ALL executions requiring it
			
 
				-- ❌ **Before**: False positives on multi-turn tests
			
 
				-- ✅ **After**: Properly tracks context across multiple prompts
			
 
				-
			
 
				-**Code**: `evals/framework/src/evaluators/context-loading-evaluator.ts`
			
 
				-
			
 
				-### 4. **Batch Test Runner** ✅
			
 
				-Created helper script for running tests in controlled batches:
			
 
				-- Configurable batch size (default: 3 tests)
			
 
				-- Configurable delay between batches (default: 10s)
			
 
				-- Prevents API rate limits
			
 
				-- Better resource management
			
 
				-
			
 
				-**Script**: `evals/framewor./scripts/utils/run-tests-batch.sh`
			
 
				-
			
 
				-**Usage**:
			
 
				-```bash
			
 
				-cd evals/framework
			
 
				-./scripts/utils/run-tests-batch.sh openagent 3 10
			
 
				-```
			
 
				-
			
 
				-### 5. **Cleanup System Verified** ✅
			
 
				-Confirmed automatic cleanup working correctly:
			
 
				-- Cleans `test_tmp/` before tests
			
 
				-- Cleans `test_tmp/` after tests
			
 
				-- Preserves only `.gitignore` and `README.md`
			
 
				-- No test artifacts left behind
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Test Results
			
 
				-
			
 
				-### Final Run: 100% Pass Rate 🎉
			
 
				-
			
 
				-| Test | Type | Duration | Status | Context Files Loaded |
			
 
				-|------|------|----------|--------|---------------------|
			
 
				-| ctx-simple-testing-approach | Simple | 38s | ✅ PASS | 4 files (README, HOW_TESTS_WORK, etc.) |
			
 
				-| ctx-simple-documentation-format | Simple | 26s | ✅ PASS | docs.md |
			
 
				-| ctx-simple-coding-standards | Simple | 21s | ✅ PASS | code.md |
			
 
				-| ctx-multi-standards-to-docs | Complex | 116s | ✅ PASS | code.md, docs.md (44s before execution) |
			
 
				-| ctx-multi-error-handling-to-tests | Complex | 148s | ✅ PASS | code.md, tests.md (58s before execution) |
			
 
				-
			
 
				-**Total Duration**: 349 seconds (~6 minutes)  
			
 
				-**Pass Rate**: 5/5 (100%)  
			
 
				-**Violations**: 0
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Key Findings
			
 
				-
			
 
				-### ✅ **OpenAgent Context Loading Works Correctly**
			
 
				-
			
 
				-1. **Simple queries**: Agent loads appropriate context files before responding
			
 
				-2. **Multi-turn conversations**: Agent loads context for each execution phase
			
 
				-3. **File creation**: Agent loads both standards AND format context before writing
			
 
				-4. **Timing**: Context loaded 44-58 seconds before execution (plenty of time)
			
 
				-
			
 
				-### ✅ **Test Infrastructure is Solid**
			
 
				-
			
 
				-1. **Same session tracking**: Multi-turn tests use single session (verified)
			
 
				-2. **Smart timeout**: Prevents false timeouts while catching real hangs
			
 
				-3. **Cleanup**: No test artifacts left behind
			
 
				-4. **Evaluators**: Accurately detect context loading behavior
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Technical Details
			
 
				-
			
 
				-### Session Tracking (Multi-Turn)
			
 
				-```typescript
			
 
				-// Single session created once
			
 
				-const session = await this.client.createSession({ title: testCase.name });
			
 
				-sessionId = session.id;
			
 
				-
			
 
				-// All prompts use SAME session
			
 
				-for (let i = 0; i < testCase.prompts.length; i++) {
			
 
				-  await this.client.sendPrompt(sessionId, { text: msg.text, ... });
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-### Smart Timeout Logic
			
 
				-```typescript
			
 
				-// Base timeout: 300s of inactivity
			
 
				-// Max timeout: 600s absolute
			
 
				-await this.withSmartTimeout(
			
 
				-  promptPromise,
			
 
				-  300000,  // 5 min activity timeout
			
 
				-  600000,  // 10 min absolute max
			
 
				-  `Prompt ${i + 1} execution timed out`
			
 
				-);
			
 
				-```
			
 
				-
			
 
				-### Context File Detection
			
 
				-```typescript
			
 
				-// Fixed file path extraction
			
 
				-const filePath = tool.data?.state?.input?.filePath ||  // ✅ NEW
			
 
				-                tool.data?.state?.input?.path ||
			
 
				-                tool.data?.input?.filePath ||          // Old fallback
			
 
				-                tool.data?.input?.path;
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Files Modified
			
 
				-
			
 
				-### New Files Created
			
 
				-```
			
 
				-evals/agents/openagent/tests/context-loading/
			
 
				-├── ctx-simple-coding-standards.yaml
			
 
				-├── ctx-simple-documentation-format.yaml
			
 
				-├── ctx-simple-testing-approach.yaml
			
 
				-├── ctx-multi-standards-to-docs.yaml
			
 
				-└── ctx-multi-error-handling-to-tests.yaml
			
 
				-
			
 
				-evals/agents/openagent/
			
 
				-├── CONTEXT_LOADING_COVERAGE.md
			
 
				-└── IMPLEMENTATION_SUMMARY.md (this file)
			
 
				-
			
 
				-evals/framework/
			
 
				-└── scripts/
			
 
				-```
			
 
				-
			
 
				-### Files Modified
			
 
				-```
			
 
				-evals/framework/src/sdk/test-runner.ts
			
 
				-  - Added withSmartTimeout() method
			
 
				-  - Updated multi-turn test execution to use smart timeout
			
 
				-
			
 
				-evals/framework/src/evaluators/context-loading-evaluator.ts
			
 
				-  - Fixed file path extraction (tool.data.state.input.filePath)
			
 
				-  - Added multi-turn execution checking
			
 
				-  - Improved violation detection
			
 
				-
			
 
				-evals/agents/openagent/tests/context-loading/*.yaml
			
 
				-  - Increased timeout from 180s to 300s for complex tests
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Recommendations Completed
			
 
				-
			
 
				-### ✅ Recommendation 1: Fix Timeout Issue
			
 
				-- **Status**: COMPLETE
			
 
				-- **Solution**: Implemented smart timeout with activity monitoring
			
 
				-- **Result**: No more false timeouts, complex tests complete successfully
			
 
				-
			
 
				-### ✅ Recommendation 2: Fix Context Loading Evaluator  
			
 
				-- **Status**: COMPLETE
			
 
				-- **Solution**: Fixed file path extraction and multi-turn tracking
			
 
				-- **Result**: Evaluator correctly detects context loading in all scenarios
			
 
				-
			
 
				-### ✅ Recommendation 3: Batch Test Execution
			
 
				-- **Status**: COMPLETE
			
 
				-- **Solution**: Created `run-tests-batch.sh` script
			
 
				-- **Result**: Can run tests in controlled batches with delays
			
 
				-
			
 
				----
			
 
				-
			
 
				-## How to Use
			
 
				-
			
 
				-### Run All Context Loading Tests
			
 
				-```bash
			
 
				-cd evals/framework
			
 
				-npm run eval:sdk -- --agent=openagent --pattern="context-loading/*.yaml"
			
 
				-```
			
 
				-
			
 
				-### Run Single Test
			
 
				-```bash
			
 
				-npm run eval:sdk -- --agent=openagent --pattern="context-loading/ctx-simple-coding-standards.yaml"
			
 
				-```
			
 
				-
			
 
				-### Run in Batches (Avoid API Limits)
			
 
				-```bash
			
 
				-./scripts/utils/run-tests-batch.sh openagent 3 10
			
 
				-# Args: agent, batch_size, delay_seconds
			
 
				-```
			
 
				-
			
 
				-### View Results Dashboard
			
 
				-```bash
			
 
				-cd ../results
			
 
				-./serve.sh
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Next Steps (Optional Enhancements)
			
 
				-
			
 
				-1. **Add More Edge Cases**
			
 
				-   - Test with missing context files
			
 
				-   - Test with multiple context directories
			
 
				-   - Test with file attachments
			
 
				-
			
 
				-2. **Performance Metrics**
			
 
				-   - Track context load time vs execution time
			
 
				-   - Measure API response times
			
 
				-   - Monitor rate limit usage
			
 
				-
			
 
				-3. **Test Coverage Expansion**
			
 
				-   - Add tests for other agent behaviors
			
 
				-   - Test delegation scenarios
			
 
				-   - Test error handling paths
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Conclusion
			
 
				-
			
 
				-✅ **All objectives achieved**  
			
 
				-✅ **100% test pass rate**  
			
 
				-✅ **OpenAgent context loading verified working correctly**  
			
 
				-✅ **Test infrastructure improved and reliable**  
			
 
				-✅ **Documentation complete**
			
 
				-
			
 
				-The context loading test suite is production-ready and provides comprehensive coverage of OpenAgent's context file loading behavior across both simple and complex multi-turn scenarios.
			
 
				-
			
 
				----
			
 
				-
			
 
				-**Maintained by**: OpenCode Agents Team  
			
 
				-**Last Updated**: 2025-11-26  
			
 
				-**Test Framework Version**: 0.1.0
			
--- a/evals/agents/opencoder/README.md
+++ b/evals/agents/opencoder/README.md
@@ -1,41 +0,0 @@
 
				-# Opencoder Agent Tests
			
 
				-
			
 
				-Tests for the `opencoder` agent - a development-focused agent that executes code tasks directly.
			
 
				-
			
 
				-## Agent Characteristics
			
 
				-
			
 
				-- **Mode**: Primary development agent
			
 
				-- **Behavior**: Executes tools directly without text-based approval workflow
			
 
				-- **Best for**: Code implementation, bash commands, file operations
			
 
				-- **Approval**: Uses tool permission system (auto-approve in tests)
			
 
				-
			
 
				-## Test Categories
			
 
				-
			
 
				-### Developer Tests (`tests/developer/`)
			
 
				-- Bash command execution
			
 
				-- File operations
			
 
				-- Code implementation tasks
			
 
				-
			
 
				-### Business Tests (`tests/business/`)
			
 
				-- Data analysis tasks
			
 
				-- Report generation
			
 
				-
			
 
				-### Edge Cases (`tests/edge-case/`)
			
 
				-- Error handling
			
 
				-- Permission boundaries
			
 
				-
			
 
				-## Running Tests
			
 
				-
			
 
				-```bash
			
 
				-cd evals/framework
			
 
				-npx tsx src/sdk/run-sdk-tests.ts --agent opencoder
			
 
				-```
			
 
				-
			
 
				-## Key Differences from OpenAgent
			
 
				-
			
 
				-| Feature | Opencoder | OpenAgent |
			
 
				-|---------|-----------|-----------|
			
 
				-| Approval | Tool permission system | Text-based + tool permission |
			
 
				-| Workflow | Direct execution | Analyze→Approve→Execute→Validate |
			
 
				-| Context Loading | On-demand | Mandatory before execution |
			
 
				-| Best for | Simple tasks | Complex workflows |
			
--- a/evals/agents/shared/README.md
+++ b/evals/agents/shared/README.md
@@ -1,74 +0,0 @@
 
				-# Shared Test Cases
			
 
				-
			
 
				-Tests in this directory are **agent-agnostic** and can be used to test **any agent** that follows the same core rules.
			
 
				-
			
 
				-## Purpose
			
 
				-
			
 
				-Shared tests validate **universal behaviors** that all agents should follow:
			
 
				-- Approval gate enforcement
			
 
				-- Tool usage patterns
			
 
				-- Basic workflow compliance
			
 
				-- Error handling
			
 
				-
			
 
				-## Usage
			
 
				-
			
 
				-### Run Shared Tests for OpenAgent
			
 
				-```bash
			
 
				-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=openagent
			
 
				-```
			
 
				-
			
 
				-### Run Shared Tests for OpenCoder
			
 
				-```bash
			
 
				-npm run eval:sdk -- --pattern="shared/**/*.yaml" --agent=opencoder
			
 
				-```
			
 
				-
			
 
				-### Override Agent in Test File
			
 
				-```yaml
			
 
				-# In the YAML file
			
 
				-agent: openagent  # Change to opencoder, or any other agent
			
 
				-```
			
 
				-
			
 
				-## Test Categories
			
 
				-
			
 
				-### `common/` - Universal Rules
			
 
				-Tests that apply to **all agents**:
			
 
				-- `approval-gate-basic.yaml` - Basic approval enforcement
			
 
				-- `tool-usage-basic.yaml` - Basic tool selection (future)
			
 
				-- `error-handling-basic.yaml` - Basic error handling (future)
			
 
				-
			
 
				-## Adding New Shared Tests
			
 
				-
			
 
				-1. Create test in `shared/tests/common/`
			
 
				-2. Use generic prompts (not agent-specific)
			
 
				-3. Test universal behaviors only
			
 
				-4. Tag with `shared-test` and `agent-agnostic`
			
 
				-5. Document which agents it applies to
			
 
				-
			
 
				-## Example
			
 
				-
			
 
				-```yaml
			
 
				-id: shared-example-001
			
 
				-name: Example Shared Test
			
 
				-category: edge-case
			
 
				-agent: openagent  # Default, can be overridden
			
 
				-
			
 
				-prompt: "Generic prompt that works for any agent"
			
 
				-
			
 
				-behavior:
			
 
				-  requiresApproval: true  # Universal rule
			
 
				-
			
 
				-expectedViolations:
			
 
				-  - rule: approval-gate
			
 
				-    shouldViolate: false
			
 
				-
			
 
				-tags:
			
 
				-  - shared-test
			
 
				-  - agent-agnostic
			
 
				-```
			
 
				-
			
 
				-## Benefits
			
 
				-
			
 
				-1. **Reduce Duplication** - Write once, test multiple agents
			
 
				-2. **Consistency** - Same tests ensure consistent behavior
			
 
				-3. **Easy Comparison** - Compare agent behaviors side-by-side
			
 
				-4. **Faster Onboarding** - New agents inherit core test suite
			
--- a/evals/framework/scripts/README.md
+++ b/evals/framework/scripts/README.md
@@ -1,195 +0,0 @@
 
				-# Framework Scripts
			
 
				-
			
 
				-Utility scripts for debugging, testing, and development.
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Directory Structure
			
 
				-
			
 
				-```
			
 
				-scripts/
			
 
				-├── debug/          # Debugging scripts for sessions and events
			
 
				-├── test/           # Test scripts for framework development
			
 
				-├── utils/          # Utility scripts (batch runner, etc.)
			
 
				-└── README.md       # This file
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Debug Scripts (`debug/`)
			
 
				-
			
 
				-Scripts for debugging sessions, events, and agent behavior.
			
 
				-
			
 
				-| Script | Purpose | Usage |
			
 
				-|--------|---------|-------|
			
 
				-| `debug-session.mjs` | Debug session data and timeline | `node scripts/debug/debug-session.mjs <session-id>` |
			
 
				-| `debug-session.ts` | TypeScript version of session debugger | `npx tsx scripts/debug/debug-session.ts <session-id>` |
			
 
				-| `debug-claude-session.mjs` | Debug Claude-specific sessions | `node scripts/debug/debug-claude-session.mjs <session-id>` |
			
 
				-| `inspect-session.mjs` | Inspect most recent session events | `node scripts/debug/inspect-session.mjs` |
			
 
				-
			
 
				-### Examples
			
 
				-
			
 
				-```bash
			
 
				-# Debug a specific session
			
 
				-node scripts/debug/debug-session.mjs ses_abc123
			
 
				-
			
 
				-# Inspect latest session
			
 
				-node scripts/debug/inspect-session.mjs
			
 
				-
			
 
				-# Debug with TypeScript
			
 
				-npx tsx scripts/debug/debug-session.ts ses_abc123
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Test Scripts (`test/`)
			
 
				-
			
 
				-Scripts for testing framework components during development.
			
 
				-
			
 
				-| Script | Purpose | Usage |
			
 
				-|--------|---------|-------|
			
 
				-| `test-agent-direct.ts` | Direct agent execution test | `npx tsx scripts/test/test-agent-direct.ts` |
			
 
				-| `test-event-inspector.js` | Test event capture system | `node scripts/test/test-event-inspector.js` |
			
 
				-| `test-session-reader.mjs` | Test session reader | `node scripts/test/test-session-reader.mjs` |
			
 
				-| `test-simplified-approach.mjs` | Test simplified test approach | `node scripts/test/test-simplified-approach.mjs` |
			
 
				-| `test-timeline.ts` | Test timeline builder | `npx tsx scripts/test/test-timeline.ts` |
			
 
				-| `verify-timeline.ts` | Verify timeline accuracy | `npx tsx scripts/test/verify-timeline.ts` |
			
 
				-
			
 
				-### Examples
			
 
				-
			
 
				-```bash
			
 
				-# Test agent execution
			
 
				-npx tsx scripts/test/test-agent-direct.ts
			
 
				-
			
 
				-# Test event capture
			
 
				-node scripts/test/test-event-inspector.js
			
 
				-
			
 
				-# Verify timeline
			
 
				-npx tsx scripts/test/verify-timeline.ts
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Utility Scripts (`utils/`)
			
 
				-
			
 
				-General utility scripts for running tests and managing the framework.
			
 
				-
			
 
				-| Script | Purpose | Usage |
			
 
				-|--------|---------|-------|
			
 
				-| `run-tests-batch.sh` | Run tests in batches | `./scripts/utils/run-tests-batch.sh <agent> <batch-size> <delay>` |
			
 
				-| `check-agent.mjs` | Check agent availability | `node scripts/utils/check-agent.mjs` |
			
 
				-
			
 
				-### Examples
			
 
				-
			
 
				-```bash
			
 
				-# Run tests in batches of 3 with 10s delay
			
 
				-./scripts/utils/run-tests-batch.sh openagent 3 10
			
 
				-
			
 
				-# Check if agent is available
			
 
				-node scripts/utils/check-agent.mjs
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Development Workflow
			
 
				-
			
 
				-### Debugging a Failed Test
			
 
				-
			
 
				-1. Run test with debug flag:
			
 
				-   ```bash
			
 
				-   npm run eval:sdk -- --pattern="my-test.yaml" --debug
			
 
				-   ```
			
 
				-
			
 
				-2. Note the session ID from output
			
 
				-
			
 
				-3. Inspect the session:
			
 
				-   ```bash
			
 
				-   node scripts/debug/inspect-session.mjs
			
 
				-   # or
			
 
				-   node scripts/debug/debug-session.mjs <session-id>
			
 
				-   ```
			
 
				-
			
 
				-4. Check timeline events:
			
 
				-   ```bash
			
 
				-   npx tsx scripts/debug/debug-session.ts <session-id>
			
 
				-   ```
			
 
				-
			
 
				-### Testing Framework Changes
			
 
				-
			
 
				-1. Make changes to framework code
			
 
				-
			
 
				-2. Build:
			
 
				-   ```bash
			
 
				-   npm run build
			
 
				-   ```
			
 
				-
			
 
				-3. Test specific component:
			
 
				-   ```bash
			
 
				-   npx tsx scripts/test/test-timeline.ts
			
 
				-   ```
			
 
				-
			
 
				-4. Run full test suite:
			
 
				-   ```bash
			
 
				-   npm run eval:sdk
			
 
				-   ```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Script Dependencies
			
 
				-
			
 
				-All scripts require the framework to be built first:
			
 
				-
			
 
				-```bash
			
 
				-npm run build
			
 
				-```
			
 
				-
			
 
				-Some scripts use:
			
 
				-- `@opencode-ai/sdk` - For SDK client
			
 
				-- `tsx` - For TypeScript execution
			
 
				-- Framework dist files - Built TypeScript output
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Adding New Scripts
			
 
				-
			
 
				-### Debug Script Template
			
 
				-
			
 
				-```javascript
			
 
				-// scripts/debug/my-debug-script.mjs
			
 
				-import { SessionReader } from '../../dist/collector/session-reader.js';
			
 
				-import { createOpencodeClient } from '@opencode-ai/sdk';
			
 
				-
			
 
				-const client = createOpencodeClient({
			
 
				-  baseUrl: 'http://localhost:3721'
			
 
				-});
			
 
				-
			
 
				-// Your debug logic here
			
 
				-```
			
 
				-
			
 
				-### Test Script Template
			
 
				-
			
 
				-```typescript
			
 
				-// scripts/test/my-test-script.ts
			
 
				-#!/usr/bin/env npx tsx
			
 
				-
			
 
				-import { TestRunner } from '../../dist/sdk/test-runner.js';
			
 
				-
			
 
				-async function runTest() {
			
 
				-  // Your test logic here
			
 
				-}
			
 
				-
			
 
				-runTest().catch(console.error);
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Maintenance
			
 
				-
			
 
				-- **Keep scripts organized** - Put debug scripts in `debug/`, test scripts in `test/`
			
 
				-- **Update this README** - When adding new scripts
			
 
				-- **Remove obsolete scripts** - Delete scripts that are no longer needed
			
 
				-- **Document usage** - Add clear usage examples
			
 
				-
			
 
				----
			
 
				-
			
 
				-**Last Updated**: 2025-11-26
			
--- a/evals/results/README.md
+++ b/evals/results/README.md
@@ -1,279 +0,0 @@
 
				-# 📊 Test Results Dashboard
			
 
				-
			
 
				-Interactive dashboard for visualizing OpenCode agent test results.
			
 
				-
			
 
				-## ⚡ Quick Reference
			
 
				-
			
 
				-```bash
			
 
				-# Run tests
			
 
				-cd evals/framework && npm run eval:sdk -- --agent=opencoder
			
 
				-
			
 
				-# View dashboard (auto-opens browser, auto-shuts down)
			
 
				-cd evals/results && ./serve.sh
			
 
				-```
			
 
				-
			
 
				-That's it! 🎉
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Quick Start
			
 
				-
			
 
				-1. **Run Tests:**
			
 
				-   ```bash
			
 
				-   cd evals/framework
			
 
				-   npm run eval:sdk -- --agent=opencoder
			
 
				-   npm run eval:sdk -- --agent=openagent
			
 
				-   ```
			
 
				-
			
 
				-2. **View Dashboard:**
			
 
				-   
			
 
				-   **Option A: One-Command Solution (Easiest)** ⭐
			
 
				-   ```bash
			
 
				-   cd evals/results
			
 
				-   ./serve.sh
			
 
				-   ```
			
 
				-   - Auto-opens browser
			
 
				-   - Loads dashboard
			
 
				-   - Auto-shuts down after 15 seconds
			
 
				-   - Dashboard stays cached in browser!
			
 
				-   
			
 
				-   **Custom timeout:**
			
 
				-   ```bash
			
 
				-   ./serve.sh 8000 30  # Port 8000, 30 second timeout
			
 
				-   ```
			
 
				-   
			
 
				-   **Option B: Keep Server Running**
			
 
				-   ```bash
			
 
				-   cd evals/results
			
 
				-   python3 -m http.server 8000
			
 
				-   ```
			
 
				-   Press Ctrl+C to stop manually
			
 
				-   
			
 
				-   **Option C: Direct File Access**
			
 
				-   ```bash
			
 
				-   open evals/results/index.html
			
 
				-   ```
			
 
				-   ⚠️ Note: Some browsers block loading JSON from local files. If you see an error, use Option A or B.
			
 
				-
			
 
				-## Features
			
 
				-
			
 
				-### 📈 Overview Stats
			
 
				-- **Total Tests** - Count across all agents
			
 
				-- **Pass Rate** - Percentage of passing tests
			
 
				-- **Failed Tests** - Number of failures
			
 
				-- **Avg Duration** - Average test execution time
			
 
				-
			
 
				-### 📊 Trend Chart
			
 
				-- Visual representation of pass rate over time
			
 
				-- Shows last 30 days of test runs
			
 
				-- Helps identify regressions
			
 
				-
			
 
				-### 🔍 Filters
			
 
				-- **Agent** - Filter by openagent, opencoder, etc.
			
 
				-- **Category** - Developer, business, creative, edge-case
			
 
				-- **Status** - All, passed only, or failed only
			
 
				-- **Time Range** - Latest, today, last 7 days, last 30 days
			
 
				-
			
 
				-### 🔎 Search
			
 
				-- Real-time search across test IDs
			
 
				-- Case-insensitive matching
			
 
				-
			
 
				-### 📋 Test Table
			
 
				-- **Sortable Columns** - Click any header to sort
			
 
				-- **Expandable Rows** - Click a row to see details
			
 
				-- **Violation Details** - See error messages and severity
			
 
				-
			
 
				-### 🌙 Dark Mode
			
 
				-- Toggle with moon/sun icon in header
			
 
				-- Preference saved to localStorage
			
 
				-- Easy on the eyes for long sessions
			
 
				-
			
 
				-### 📥 Export
			
 
				-- Export filtered results to CSV
			
 
				-- Includes all test metadata
			
 
				-- Perfect for external analysis
			
 
				-
			
 
				-## File Structure
			
 
				-
			
 
				-```
			
 
				-results/
			
 
				-├── index.html              # Dashboard (open this)
			
 
				-├── serve.sh                # Helper script to start HTTP server
			
 
				-├── latest.json             # Most recent test run
			
 
				-├── history/
			
 
				-│   └── 2025-11/
			
 
				-│       ├── 26-115759-opencoder.json
			
 
				-│       └── 26-115850-openagent.json
			
 
				-├── .gitignore              # Retention policy
			
 
				-└── README.md               # This file
			
 
				-```
			
 
				-
			
 
				-## JSON Format
			
 
				-
			
 
				-Each result file contains:
			
 
				-
			
 
				-```json
			
 
				-{
			
 
				-  "meta": {
			
 
				-    "timestamp": "2025-11-26T11:59:36.365Z",
			
 
				-    "agent": "openagent",
			
 
				-    "model": "opencode/grok-code-fast",
			
 
				-    "framework_version": "0.1.0",
			
 
				-    "git_commit": "f872007"
			
 
				-  },
			
 
				-  "summary": {
			
 
				-    "total": 8,
			
 
				-    "passed": 6,
			
 
				-    "failed": 2,
			
 
				-    "duration_ms": 32450,
			
 
				-    "pass_rate": 0.75
			
 
				-  },
			
 
				-  "by_category": {
			
 
				-    "developer": { "passed": 5, "total": 6 },
			
 
				-    "business": { "passed": 1, "total": 1 },
			
 
				-    "edge-case": { "passed": 0, "total": 1 }
			
 
				-  },
			
 
				-  "tests": [
			
 
				-    {
			
 
				-      "id": "task-simple-001",
			
 
				-      "category": "developer",
			
 
				-      "passed": true,
			
 
				-      "duration_ms": 4200,
			
 
				-      "events": 23,
			
 
				-      "approvals": 2,
			
 
				-      "violations": {
			
 
				-        "total": 0,
			
 
				-        "errors": 0,
			
 
				-        "warnings": 0
			
 
				-      }
			
 
				-    }
			
 
				-  ]
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-## Retention Policy
			
 
				-
			
 
				-Results are automatically managed:
			
 
				-
			
 
				-- ✅ **Latest Run** - Always kept (`latest.json`)
			
 
				-- ✅ **Current Month** - All results committed to git
			
 
				-- ✅ **Previous Month** - All results committed to git
			
 
				-- ❌ **Older than 60 days** - Kept locally, not committed
			
 
				-
			
 
				-This keeps the repo size manageable while preserving recent history.
			
 
				-
			
 
				-## Tips
			
 
				-
			
 
				-### Quick View Workflow
			
 
				-The fastest way to view results:
			
 
				-```bash
			
 
				-cd evals/results && ./serve.sh
			
 
				-```
			
 
				-- ✅ Opens browser automatically
			
 
				-- ✅ Loads all data
			
 
				-- ✅ Shuts down after 15 seconds
			
 
				-- ✅ Dashboard stays functional (data cached)
			
 
				-- ✅ No manual cleanup needed
			
 
				-
			
 
				-**Want to keep exploring?** Press Ctrl+C during countdown to keep server running.
			
 
				-
			
 
				-### Comparing Agents
			
 
				-1. Set **Time Range** to "Latest Run"
			
 
				-2. Set **Agent** to "All Agents"
			
 
				-3. Compare pass rates and durations
			
 
				-
			
 
				-### Finding Flaky Tests
			
 
				-1. Set **Time Range** to "Last 30 Days"
			
 
				-2. Look for tests that alternate between pass/fail
			
 
				-3. Check violation details for patterns
			
 
				-
			
 
				-### Tracking Improvements
			
 
				-1. Run tests regularly (daily/weekly)
			
 
				-2. Watch the trend chart for improvements
			
 
				-3. Export CSV for deeper analysis
			
 
				-
			
 
				-### Debugging Failures
			
 
				-1. Filter **Status** to "Failed Only"
			
 
				-2. Click on a failed test row
			
 
				-3. Review violation details
			
 
				-4. Check error messages and severity
			
 
				-
			
 
				-## Browser Compatibility
			
 
				-
			
 
				-- ✅ Chrome/Edge (recommended)
			
 
				-- ✅ Firefox
			
 
				-- ✅ Safari
			
 
				-- ⚠️ IE11 (not supported)
			
 
				-
			
 
				-## Performance
			
 
				-
			
 
				-- **Dashboard Size:** ~31KB (no dependencies except Chart.js CDN)
			
 
				-- **Load Time:** < 1 second for 100 tests
			
 
				-- **Memory:** Minimal (pure JavaScript, no frameworks)
			
 
				-
			
 
				-## How It Works
			
 
				-
			
 
				-### Auto-Shutdown Feature
			
 
				-The `serve.sh` script:
			
 
				-1. Starts HTTP server on port 8000
			
 
				-2. Opens dashboard in your browser
			
 
				-3. Waits 15 seconds for data to load
			
 
				-4. Shuts down server automatically
			
 
				-5. Dashboard continues working (data cached in browser)
			
 
				-
			
 
				-**Why does it still work after shutdown?**
			
 
				-- The browser caches the JSON data
			
 
				-- All filtering/sorting happens in JavaScript
			
 
				-- No server needed after initial load
			
 
				-- Refresh the page to load new data (server will need to restart)
			
 
				-
			
 
				-### Stopping Manually
			
 
				-If you start the server manually:
			
 
				-```bash
			
 
				-# Find the process
			
 
				-lsof -ti:8000
			
 
				-
			
 
				-# Kill it
			
 
				-kill $(lsof -ti:8000)
			
 
				-```
			
 
				-
			
 
				-Or just press Ctrl+C in the terminal.
			
 
				-
			
 
				-## Troubleshooting
			
 
				-
			
 
				-### Dashboard shows "No results found"
			
 
				-- Run tests first: `npm run eval:sdk`
			
 
				-- Check that `latest.json` exists
			
 
				-- Refresh the page
			
 
				-
			
 
				-### Chart not displaying
			
 
				-- Check browser console for errors
			
 
				-- Ensure Chart.js CDN is accessible
			
 
				-- Try refreshing the page
			
 
				-
			
 
				-### Dark mode not persisting
			
 
				-- Check browser localStorage is enabled
			
 
				-- Clear cache and try again
			
 
				-
			
 
				-## Future Enhancements
			
 
				-
			
 
				-Potential improvements:
			
 
				-- [ ] Historical comparison (compare two runs)
			
 
				-- [ ] Test duration trends per test
			
 
				-- [ ] Violation type breakdown chart
			
 
				-- [ ] Agent performance comparison chart
			
 
				-- [ ] Auto-refresh option
			
 
				-- [ ] Shareable URLs with filters
			
 
				-- [ ] CI/CD badge generation
			
 
				-
			
 
				-## Contributing
			
 
				-
			
 
				-To improve the dashboard:
			
 
				-
			
 
				-1. Edit `index.html` (all code is in one file)
			
 
				-2. Test locally by opening in browser
			
 
				-3. Submit PR with description of changes
			
 
				-
			
 
				-## License
			
 
				-
			
 
				-MIT - Same as OpenCode Agents project
			
--- a/evals/test_tmp/README.md
+++ b/evals/test_tmp/README.md
@@ -1,29 +0,0 @@
 
				-# Test Artifacts
			
 
				-
			
 
				-This directory contains temporary files created during test execution.
			
 
				-It should be cleaned up after tests complete.
			
 
				-
			
 
				-**DO NOT COMMIT FILES IN THIS DIRECTORY**
			
 
				-
			
 
				-## Installation
			
 
				-
			
 
				-To install the project dependencies, navigate to the evaluation framework directory and run:
			
 
				-
			
 
				-```bash
			
 
				-cd evals/framework
			
 
				-npm install
			
 
				-```
			
 
				-
			
 
				-This will install all required dependencies including:
			
 
				-- `@opencode-ai/sdk` - OpenCode AI SDK
			
 
				-- `yaml` - YAML parser for test cases
			
 
				-- `zod` - Schema validation
			
 
				-- `glob` - File pattern matching
			
 
				-
			
 
				-### Development Dependencies
			
 
				-
			
 
				-For development and testing, the following tools are also installed:
			
 
				-- TypeScript compiler
			
 
				-- Vitest testing framework
			
 
				-- ESLint for code linting
			
 
				-- tsx for TypeScript execution