Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.1.0-alpha.1 - 2025-11-26

[0.3.0] - 2025-11-26

Changes

feat(ci): skip redundant tests on PR merge

Tests only run on:

Pull requests (CI checks)
Direct pushes to main
Manual workflow dispatch

PR merges skip tests (already passed) but still trigger version bump.

[0.2.0] - 2025-11-26

Changes

feat(ci): auto-create GitHub releases on version bump

Releases now appear in the GitHub sidebar. The release notes are extracted from CHANGELOG.md for the specific version.

[0.1.0] - 2025-11-26

Changes

fix(ci): add contents:write permission for auto version bump

The GITHUB_TOKEN needs explicit write permission to push commits and tags back to the repository.

Added

SDK-Based Evaluation Framework

Complete test execution framework using OpenCode SDK
Support for openagent and opencoder testing
Real agent testing with session management
Smart timeout system with activity monitoring
Multi-turn conversation support

Modular Architecture

Refactored test-runner.ts (884 lines → 4 focused modules):
- test-runner.ts (411 lines): Thin orchestrator
- test-executor.ts (392 lines): Core execution logic
- result-validator.ts (253 lines): Validation logic
- event-logger.ts (128 lines): Logging utilities
Improved Single Responsibility Principle compliance
Enhanced testability through dependency injection

Test Infrastructure

20+ test cases across multiple categories:
- OpenAgent: Developer (12), Context Loading (5), Business (2), Edge Cases (3)
- OpenCoder: Developer (4)
BehaviorEvaluator for validating expected agent actions
Comprehensive evaluators: approval-gate, context-loading, delegation, tool-usage

Interactive Results Dashboard

Real-time test results visualization
Filtering by agent, category, status
Detailed violation tracking
CSV export functionality
Historical results tracking
One-command deployment (./serve.sh)

Documentation

ARCHITECTURE.md: Comprehensive system review (456 lines)
GETTING_STARTED.md: Quick start guide (435 lines)
SDK_EVAL_README.md: Complete SDK guide (298 lines)
Test design guide and architecture overview
Documentation cleanup (removed 3 outdated files)

Script Organization

Organized 12 scripts into logical directories:
- scripts/debug/: Session debugging tools (4 files)
- scripts/test/: Test execution scripts (6 files)
- scripts/utils/: Utility scripts (2 files)
Comprehensive scripts/README.md with usage examples

Monorepo Structure

Root package.json with convenient npm scripts
Easy agent selection (openagent, opencoder)
Easy model selection (grok, claude, gpt-4)
Quick dashboard access from root
No folder navigation required

CI/CD

GitHub Actions workflow for automated testing
Pre-merge validation for agent changes
Fast smoke tests for both agents
Automated test result reporting

Agent Improvements

Enhanced openagent with better context loading
New opencoder agent with test suite
Improved subagent invocation patterns
Ultra-compact context index system

Changed

Reorganized evaluation framework structure
Improved test case schema with behavior expectations
Enhanced context loading detection

Removed

Outdated documentation files (TESTING_CONFIDENCE.md, TEST_REVIEW.md, SESSION_STORAGE_FIX.md)
Redundant test files

Fixed

Context loading evaluator detection accuracy
Multi-turn prompt handling
Test artifact cleanup

Version Format

v0.1.0-alpha.1
│ │ │  │      │
│ │ │  │      └─ Build/Iteration number
│ │ │  └──────── Release stage (alpha, beta, rc)
│ │ └─────────── Patch version
│ └───────────── Minor version
└─────────────── Major version (0 = pre-release)

Version Progression

Alpha (v0.x.0-alpha.N): Early development, unstable
Beta (v0.x.0-beta.N): Feature complete, testing
RC (v0.x.0-rc.N): Release candidate, stable
Stable (v1.x.x): Production ready

CHANGELOG.md 4.4 KB History Raw

Changelog

0.1.0-alpha.1 - 2025-11-26

[0.3.0] - 2025-11-26

Changes

[0.2.0] - 2025-11-26

Changes

[0.1.0] - 2025-11-26

Changes

Added

SDK-Based Evaluation Framework

Modular Architecture

Test Infrastructure

Interactive Results Dashboard

Documentation

Script Organization

Monorepo Structure

CI/CD

Agent Improvements

Changed

Removed

Fixed

Version Format

Version Progression

CHANGELOG.md 4.4 KB

History Raw