CHANGELOG.md 4.4 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.1.0-alpha.1 - 2025-11-26

[0.3.0] - 2025-11-26

Changes

  • feat(ci): skip redundant tests on PR merge

Tests only run on:

  • Pull requests (CI checks)
  • Direct pushes to main
  • Manual workflow dispatch

PR merges skip tests (already passed) but still trigger version bump.

[0.2.0] - 2025-11-26

Changes

  • feat(ci): auto-create GitHub releases on version bump

Releases now appear in the GitHub sidebar. The release notes are extracted from CHANGELOG.md for the specific version.

[0.1.0] - 2025-11-26

Changes

  • fix(ci): add contents:write permission for auto version bump

The GITHUB_TOKEN needs explicit write permission to push commits and tags back to the repository.

Added

SDK-Based Evaluation Framework

  • Complete test execution framework using OpenCode SDK
  • Support for openagent and opencoder testing
  • Real agent testing with session management
  • Smart timeout system with activity monitoring
  • Multi-turn conversation support

Modular Architecture

  • Refactored test-runner.ts (884 lines → 4 focused modules):
    • test-runner.ts (411 lines): Thin orchestrator
    • test-executor.ts (392 lines): Core execution logic
    • result-validator.ts (253 lines): Validation logic
    • event-logger.ts (128 lines): Logging utilities
  • Improved Single Responsibility Principle compliance
  • Enhanced testability through dependency injection

Test Infrastructure

  • 20+ test cases across multiple categories:
    • OpenAgent: Developer (12), Context Loading (5), Business (2), Edge Cases (3)
    • OpenCoder: Developer (4)
  • BehaviorEvaluator for validating expected agent actions
  • Comprehensive evaluators: approval-gate, context-loading, delegation, tool-usage

Interactive Results Dashboard

  • Real-time test results visualization
  • Filtering by agent, category, status
  • Detailed violation tracking
  • CSV export functionality
  • Historical results tracking
  • One-command deployment (./serve.sh)

Documentation

  • ARCHITECTURE.md: Comprehensive system review (456 lines)
  • GETTING_STARTED.md: Quick start guide (435 lines)
  • SDK_EVAL_README.md: Complete SDK guide (298 lines)
  • Test design guide and architecture overview
  • Documentation cleanup (removed 3 outdated files)

Script Organization

  • Organized 12 scripts into logical directories:
    • scripts/debug/: Session debugging tools (4 files)
    • scripts/test/: Test execution scripts (6 files)
    • scripts/utils/: Utility scripts (2 files)
  • Comprehensive scripts/README.md with usage examples

Monorepo Structure

  • Root package.json with convenient npm scripts
  • Easy agent selection (openagent, opencoder)
  • Easy model selection (grok, claude, gpt-4)
  • Quick dashboard access from root
  • No folder navigation required

CI/CD

  • GitHub Actions workflow for automated testing
  • Pre-merge validation for agent changes
  • Fast smoke tests for both agents
  • Automated test result reporting

Agent Improvements

  • Enhanced openagent with better context loading
  • New opencoder agent with test suite
  • Improved subagent invocation patterns
  • Ultra-compact context index system

Changed

  • Reorganized evaluation framework structure
  • Improved test case schema with behavior expectations
  • Enhanced context loading detection

Removed

  • Outdated documentation files (TESTING_CONFIDENCE.md, TEST_REVIEW.md, SESSION_STORAGE_FIX.md)
  • Redundant test files

Fixed

  • Context loading evaluator detection accuracy
  • Multi-turn prompt handling
  • Test artifact cleanup

Version Format

v0.1.0-alpha.1
│ │ │  │      │
│ │ │  │      └─ Build/Iteration number
│ │ │  └──────── Release stage (alpha, beta, rc)
│ │ └─────────── Patch version
│ └───────────── Minor version
└─────────────── Major version (0 = pre-release)

Version Progression

  • Alpha (v0.x.0-alpha.N): Early development, unstable
  • Beta (v0.x.0-beta.N): Feature complete, testing
  • RC (v0.x.0-rc.N): Release candidate, stable
  • Stable (v1.x.x): Production ready