safe-methodology.md 20 KB

Safe Refactoring Methodology Reference

Strategies for large and small refactorings that preserve behavior, minimize risk, and provide rollback safety.


Test-Driven Refactoring

The Core Loop

Red-Green-Refactor (for new code)
│
├─ RED: Write a failing test for the desired behavior
├─ GREEN: Write the simplest code that makes the test pass
└─ REFACTOR: Clean up while keeping tests green
   └─ This is where refactoring happens safely

Characterization-Then-Refactor (for existing code)
│
├─ Step 1: Write Characterization Tests
│  │  Run the existing code and capture its actual output
│  │  Assert on that output, even if it seems wrong
│  │  Goal: document what the code DOES, not what it SHOULD do
│  │
│  │  Example:
│  │  def test_calculate_tax_current_behavior():
│  │      # This may be "wrong" but it's what the code does today
│  │      assert calculate_tax(100) == 8.25  # captures actual behavior
│  │
│  └─ Coverage: ensure every branch you will touch is covered
│
├─ Step 2: Refactor Under Test Safety
│  │  Make one small change
│  │  Run all characterization tests
│  │  If tests pass → commit and continue
│  │  If tests fail → revert and try smaller change
│  └─ Never refactor and change behavior in the same step
│
├─ Step 3: Replace Characterization Tests
│  │  Once code is clean, write proper intention-revealing tests
│  │  The characterization tests served as scaffolding
│  └─ Now you can safely fix behavioral bugs you discovered
│
└─ Step 4: Fix Behavioral Issues (if any)
   Now that you have proper tests and clean code,
   fix any bugs discovered during characterization
   Each fix gets its own test + commit

Writing Effective Characterization Tests

# Strategy: Use the code itself to tell you what to assert

# 1. Call the function with representative inputs
result = process_order(sample_order)

# 2. Print the result
print(result)  # {'total': 108.25, 'tax': 8.25, 'status': 'pending'}

# 3. Assert on the printed output
def test_process_order_characterization():
    result = process_order(sample_order)
    assert result['total'] == 108.25
    assert result['tax'] == 8.25
    assert result['status'] == 'pending'

# 4. Cover edge cases the same way
def test_process_order_empty_items():
    result = process_order(Order(items=[]))
    # Even if this behavior is "wrong", capture it
    assert result['total'] == 0
    assert result['status'] == 'pending'  # maybe should be 'invalid'?

When You Cannot Write Tests First

Sometimes characterization tests are impractical (tightly coupled UI, external service dependencies, time-based logic). Alternatives:

Cannot write characterization tests?
│
├─ Too coupled to test → Extract the testable parts first
│  Use "Sprout Method" or "Sprout Class":
│  1. Write the new logic in a new, testable function
│  2. Call it from the old code
│  3. Test the new function
│  4. Gradually move more logic into testable functions
│
├─ External service dependency → Record and replay
│  Use VCR/Polly/nock to record real responses
│  Replay them in tests
│
├─ UI-heavy → Snapshot/Approval tests
│  Capture screenshots or HTML output
│  Compare against approved baseline
│
└─ Time-based logic → Inject a clock
   Pass a clock/timer as a parameter
   Use a fake clock in tests

Strangler Fig Pattern

For replacing a large legacy system or module incrementally, without a risky big-bang rewrite.

Strangler Fig Strategy
│
├─ Phase 1: Identify boundaries
│  │  Map the legacy system's entry points (API routes, function calls, events)
│  │  Each entry point is a candidate for strangling
│  └─ Prioritize by: risk (low first), value (high first), coupling (loose first)
│
├─ Phase 2: Build new implementation alongside old
│  │  New code lives in a new module/service
│  │  Both old and new exist simultaneously
│  └─ No modification to legacy code yet
│
├─ Phase 3: Route traffic to new implementation
│  │  Use a router/proxy/feature flag to direct requests
│  │  Start with a small percentage (canary)
│  │  Monitor for errors and performance differences
│  └─ Gradually increase percentage
│
├─ Phase 4: Remove legacy code
│  │  Once 100% traffic goes to new implementation
│  │  Keep legacy code for one release cycle (rollback safety)
│  └─ Then delete it
│
└─ Repeat for each entry point

Example: Strangling a Legacy API Endpoint

// Phase 2: New implementation alongside old
// old: /api/v1/users (legacy monolith)
// new: /api/v2/users (new service)

// Phase 3: Router decides which to call
app.get('/api/users', async (req, res) => {
  const useNewImplementation = await featureFlag('new-users-api', {
    userId: req.user?.id,
    percentage: 25,  // Start with 25% of traffic
  });

  if (useNewImplementation) {
    return newUsersService.getUsers(req, res);
  }
  return legacyUsersController.getUsers(req, res);
});

// Phase 4: Once at 100%, simplify
app.get('/api/users', (req, res) => newUsersService.getUsers(req, res));

Parallel Change (Expand-Migrate-Contract)

For changing an interface without breaking consumers. Three phases: expand (add new), migrate (move consumers), contract (remove old).

Parallel Change Phases
│
├─ EXPAND: Add the new interface alongside the old
│  │  Both old and new work simultaneously
│  │  Old interface delegates to new implementation internally
│  └─ All existing tests continue to pass
│
├─ MIGRATE: Update all consumers to use the new interface
│  │  One consumer at a time
│  │  Each migration is a separate commit/PR
│  │  Old interface still works (backward compatible)
│  └─ Monitor for issues after each migration
│
└─ CONTRACT: Remove the old interface
   │  All consumers now use the new interface
   │  Delete old code and update tests
   └─ This is the only "breaking" change

Example: Renaming a Function

# EXPAND: Add new name, keep old as alias
def calculate_shipping_cost(order: Order) -> Money:
    """New name with improved logic."""
    # ... implementation ...

def calcShipping(order: Order) -> Money:
    """Deprecated: Use calculate_shipping_cost instead."""
    import warnings
    warnings.warn("calcShipping is deprecated, use calculate_shipping_cost", DeprecationWarning)
    return calculate_shipping_cost(order)

# MIGRATE: Update all call sites one by one
# grep for calcShipping, replace with calculate_shipping_cost
# Run tests after each file

# CONTRACT: Remove old function
# Delete calcShipping entirely
# Remove deprecation warning

Example: Changing a Database Schema

-- EXPAND: Add new column alongside old
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Application code writes to BOTH columns
-- UPDATE users SET full_name = first_name || ' ' || last_name, ...

-- MIGRATE: Backfill existing data
-- UPDATE users SET full_name = first_name || ' ' || last_name WHERE full_name IS NULL;
-- Update all queries to read from full_name

-- CONTRACT: Remove old columns
-- ALTER TABLE users DROP COLUMN first_name, DROP COLUMN last_name;

Branch by Abstraction

For replacing an internal implementation without feature branches. Introduce an abstraction layer, swap the implementation behind it.

Branch by Abstraction
│
├─ Step 1: Create abstraction (interface/protocol/trait)
│  │  Define the contract that both old and new implementations satisfy
│  └─ All existing code uses the abstraction, not the concrete implementation
│
├─ Step 2: Wrap existing implementation
│  │  Make existing code implement the new abstraction
│  └─ All tests pass -- no behavior change
│
├─ Step 3: Build new implementation
│  │  New implementation also satisfies the abstraction
│  │  Test new implementation independently
│  └─ Old implementation is still the default
│
├─ Step 4: Switch
│  │  Change the wiring to use new implementation
│  │  Feature flag or config toggle for easy rollback
│  └─ Monitor in production
│
└─ Step 5: Clean up
   Remove old implementation
   Remove abstraction if only one implementation remains
   Remove feature flag

Example:

// Step 1: Define abstraction
interface PaymentGateway {
  charge(amount: Money, card: CardInfo): Promise<PaymentResult>;
  refund(paymentId: string, amount: Money): Promise<RefundResult>;
}

// Step 2: Wrap existing implementation
class StripeGateway implements PaymentGateway {
  async charge(amount: Money, card: CardInfo): Promise<PaymentResult> {
    // existing Stripe code, now behind the interface
  }
  async refund(paymentId: string, amount: Money): Promise<RefundResult> {
    // existing Stripe refund code
  }
}

// Step 3: Build new implementation
class SquareGateway implements PaymentGateway {
  async charge(amount: Money, card: CardInfo): Promise<PaymentResult> {
    // new Square implementation
  }
  async refund(paymentId: string, amount: Money): Promise<RefundResult> {
    // new Square refund implementation
  }
}

// Step 4: Switch via configuration
function createPaymentGateway(): PaymentGateway {
  if (config.paymentProvider === 'square') {
    return new SquareGateway();
  }
  return new StripeGateway(); // default/fallback
}

Small Commits Strategy

Every commit during a refactoring must satisfy two invariants:

  1. Code compiles (type-checks, no syntax errors)
  2. All tests pass (no behavioral regressions)

Commit Granularity Guide

Refactoring Commit Patterns
│
├─ Rename → 1 commit
│  "refactor: rename calcShipping to calculateShippingCost"
│
├─ Extract Function → 1 commit
│  "refactor: extract validateOrderItems from processOrder"
│
├─ Move File → 1 commit
│  "refactor: move utils/helpers.ts to lib/string-utils.ts"
│
├─ Extract Class → 2-3 commits
│  1. "refactor: extract PriceCalculator interface"
│  2. "refactor: implement PriceCalculator, delegate from OrderService"
│  3. "refactor: remove pricing logic from OrderService"
│
├─ Replace Algorithm → 2 commits
│  1. "test: add characterization tests for sorting"
│  2. "refactor: replace bubble sort with merge sort"
│
└─ Large Restructure → Many small commits
   Each file move or extraction is its own commit
   Never batch unrelated changes

Git Workflow for Refactoring

# Start a refactoring session
git checkout -b refactor/extract-payment-service

# After each small refactoring step
git add -p  # Stage only the relevant changes
git commit -m "refactor: extract PaymentValidator from PaymentService"

# Verify at each step
npm test  # or pytest, cargo test, go test ./...

# If a step goes wrong, revert just that step
git revert HEAD

# When done, create a clean PR
# Each commit should be reviewable independently

Feature Flags for Gradual Rollout

When a refactoring affects runtime behavior (e.g., new algorithm, new data flow), use feature flags to control rollout.

Feature Flag Strategy
│
├─ Before refactoring
│  │  Add a feature flag that defaults to OFF (old behavior)
│  └─ Deploy the flag infrastructure
│
├─ During refactoring
│  │  New code path guarded by the flag
│  │  Old code path remains the default
│  └─ Both paths are tested
│
├─ Rollout
│  │  Enable for internal users first
│  │  Enable for 1% → 10% → 50% → 100%
│  │  Monitor error rates, latency, correctness
│  └─ Rollback = disable the flag (instant, no deploy needed)
│
└─ Cleanup
   Remove the flag and old code path
   This is a separate PR after the rollout is complete

Implementation Pattern

// Simple feature flag check
async function searchProducts(query: string): Promise<Product[]> {
  if (await featureFlags.isEnabled('new-search-algorithm', { userId })) {
    return newSearchAlgorithm(query);
  }
  return legacySearch(query);
}

Feature Flag Hygiene

Rule Why
Remove flags within 2 sprints of 100% rollout Stale flags accumulate and confuse
Name flags descriptively new-search-algorithm not flag-123
Log flag evaluations Debug which path was taken
Test both paths Both old and new must have coverage
Flag owner documented Someone must clean up the flag

Approval Testing / Snapshot Testing

Capture the output of existing code and use it as the test assertion. Ideal for characterization testing before refactoring.

How It Works

Approval Testing Flow
│
├─ First run: Capture output → save as "approved" baseline
│  ├─ HTML output → screenshot or HTML snapshot
│  ├─ JSON output → save formatted JSON
│  ├─ Console output → save text
│  └─ API response → save response body
│
├─ Subsequent runs: Compare output against baseline
│  ├─ Match → test passes
│  └─ Mismatch → test fails, show diff
│     ├─ If expected change → approve new baseline
│     └─ If unexpected change → regression, investigate
│
└─ During refactoring: any output change is flagged
   You decide if the change is intentional or a bug

Tools

Language Tool Type
JavaScript Jest snapshots expect(result).toMatchSnapshot()
JavaScript Storybook Chromatic Visual regression
Python pytest-snapshot snapshot.assert_match(result)
Python Approval Tests verify(result)
Go go-snaps snaps.MatchSnapshot(t, result)
Rust insta insta::assert_snapshot!(result)
Any screenshot comparison Playwright, Cypress, Percy

Jest Snapshot Example

// Before refactoring: create baseline
test('renders user profile', () => {
  const { container } = render(<UserProfile user={mockUser} />);
  expect(container.innerHTML).toMatchSnapshot();
});

// During refactoring: any HTML change will fail this test
// If the change is intentional:
//   npx jest --updateSnapshot

Python Approval Test Example

from approvaltests import verify

def test_generate_report():
    report = generate_report(sample_data)
    verify(report)  # First run saves "approved" file
                    # Subsequent runs compare against it

Rollback Strategies

Rollback Options (fastest to slowest)
│
├─ Feature flag toggle (seconds)
│  └─ Disable the flag → old code path runs instantly
│     No deployment needed
│
├─ Git revert (minutes)
│  └─ git revert <commit-hash>
│     Creates a new commit that undoes the change
│     Deploy the revert
│
├─ Redeploy previous version (minutes-hours)
│  └─ Roll back to previous container image / release tag
│     CI/CD pipeline handles the rest
│
├─ Database rollback (hours-days)
│  └─ If schema changed: run reverse migration
│     If data changed: restore from backup
│     Most disruptive, avoid if possible
│
└─ Cannot rollback (prevention only)
   Deleted data, sent emails, external API calls
   Design for forward-fix instead

Forward-Fix vs Rollback Decision

Should you rollback or fix forward?
│
├─ Is the bug causing data loss or corruption?
│  └─ ROLLBACK immediately, fix later
│
├─ Is the bug affecting > 10% of users?
│  └─ ROLLBACK, then fix forward on a branch
│
├─ Is the fix obvious and small (< 5 lines)?
│  └─ FIX FORWARD with expedited review
│
├─ Is the bug cosmetic or low-severity?
│  └─ FIX FORWARD in next regular release
│
└─ Are you unsure of the scope?
   └─ ROLLBACK (when in doubt, be safe)

Code Review Checklist for Refactoring PRs

Reviewer Checklist
│
├─ Behavior Preservation
│  [ ] No functional changes mixed with structural changes
│  [ ] Test suite passes (check CI, not just author's word)
│  [ ] Snapshot/approval tests show no unexpected diffs
│  [ ] Public API unchanged (or deprecated properly)
│
├─ Quality of Refactoring
│  [ ] Each commit is atomic and independently valid
│  [ ] Naming improves clarity (not just different)
│  [ ] Abstraction level is appropriate (not over-engineered)
│  [ ] No new duplication introduced
│  [ ] No circular dependencies introduced
│
├─ Safety
│  [ ] Characterization tests exist for changed code
│  [ ] Feature flag or rollback plan documented (if applicable)
│  [ ] Performance-sensitive code benchmarked before/after
│  [ ] No dead code left behind (old implementations removed)
│
└─ Completeness
   [ ] All references updated (imports, configs, docs, tests)
   [ ] Deprecation warnings added for public API changes
   [ ] Migration guide for downstream consumers (if applicable)

Measuring Refactoring Success

Refactoring is an investment. Measure whether it paid off.

Quantitative Metrics

Metric Before/After Tool
Cyclomatic complexity Should decrease radon, eslint, gocyclo
Cognitive complexity Should decrease SonarQube
File length Should decrease (god files → smaller modules) tokei, wc -l
Test coverage Should increase or stay the same coverage.py, istanbul, tarpaulin
Build time Should not increase significantly CI pipeline timing
Bundle size Should not increase (may decrease with dead code removal) webpack-bundle-analyzer
Deployment frequency Should increase (easier to ship) DORA metrics
Change failure rate Should decrease (fewer regressions) DORA metrics

Qualitative Indicators

Signal Meaning
Fewer merge conflicts in the area Code is better organized, less contention
New features in the area are faster to build Reduced coupling and clear boundaries
Fewer bug reports in the area Cleaner code, better error handling
Team members are less afraid to change the code Improved testability and readability
Code review comments shift from "I don't understand" to "looks good" Better naming and structure

Before/After Comparison Template

# Capture BEFORE metrics
echo "=== BEFORE ==="
tokei src/module-to-refactor/          # Line counts
radon cc src/module-to-refactor/ -a    # Cyclomatic complexity (Python)
npx knip --reporter compact            # Unused code (JS/TS)

# ... do the refactoring ...

# Capture AFTER metrics
echo "=== AFTER ==="
tokei src/module-to-refactor/
radon cc src/module-to-refactor/ -a
npx knip --reporter compact

# Compare
# Complexity should go down
# Line count may go up slightly (more files, smaller each)
# Unused code count should go down

Anti-Patterns in Refactoring Methodology

Anti-pattern Problem Better Approach
Big-bang rewrite High risk, nothing works for weeks Strangler fig: replace incrementally
Refactoring without a goal Endless polishing, no business value Define success criteria before starting
Refactoring everything at once Merge conflicts, hard to review, hard to rollback One module at a time, one PR at a time
Skipping characterization tests No safety net, cannot verify behavior preserved Always capture current behavior first
Mixing refactoring with features Cannot tell which caused a regression Separate PRs: refactor first, then add feature
Not measuring improvement Cannot justify the time investment Capture before/after metrics
Stopping halfway Half-old, half-new is worse than either Plan for completion, or don't start
Over-designing for the future YAGNI -- you are not going to need it Refactor for today's needs, not hypothetical future
Refactoring shared library without coordinating consumers Breaks downstream teams Parallel change + deprecation period
No rollback plan Stuck if something goes wrong in production Always have a path back: feature flag, git revert, or previous deploy