Safe Refactoring Methodology Reference

Strategies for large and small refactorings that preserve behavior, minimize risk, and provide rollback safety.

Test-Driven Refactoring

The Core Loop

Red-Green-Refactor (for new code)
│
├─ RED: Write a failing test for the desired behavior
├─ GREEN: Write the simplest code that makes the test pass
└─ REFACTOR: Clean up while keeping tests green
   └─ This is where refactoring happens safely

Characterization-Then-Refactor (for existing code)
│
├─ Step 1: Write Characterization Tests
│  │  Run the existing code and capture its actual output
│  │  Assert on that output, even if it seems wrong
│  │  Goal: document what the code DOES, not what it SHOULD do
│  │
│  │  Example:
│  │  def test_calculate_tax_current_behavior():
│  │      # This may be "wrong" but it's what the code does today
│  │      assert calculate_tax(100) == 8.25  # captures actual behavior
│  │
│  └─ Coverage: ensure every branch you will touch is covered
│
├─ Step 2: Refactor Under Test Safety
│  │  Make one small change
│  │  Run all characterization tests
│  │  If tests pass → commit and continue
│  │  If tests fail → revert and try smaller change
│  └─ Never refactor and change behavior in the same step
│
├─ Step 3: Replace Characterization Tests
│  │  Once code is clean, write proper intention-revealing tests
│  │  The characterization tests served as scaffolding
│  └─ Now you can safely fix behavioral bugs you discovered
│
└─ Step 4: Fix Behavioral Issues (if any)
   Now that you have proper tests and clean code,
   fix any bugs discovered during characterization
   Each fix gets its own test + commit

Writing Effective Characterization Tests

# Strategy: Use the code itself to tell you what to assert

# 1. Call the function with representative inputs
result = process_order(sample_order)

# 2. Print the result
print(result)  # {'total': 108.25, 'tax': 8.25, 'status': 'pending'}

# 3. Assert on the printed output
def test_process_order_characterization():
    result = process_order(sample_order)
    assert result['total'] == 108.25
    assert result['tax'] == 8.25
    assert result['status'] == 'pending'

# 4. Cover edge cases the same way
def test_process_order_empty_items():
    result = process_order(Order(items=[]))
    # Even if this behavior is "wrong", capture it
    assert result['total'] == 0
    assert result['status'] == 'pending'  # maybe should be 'invalid'?

When You Cannot Write Tests First

Sometimes characterization tests are impractical (tightly coupled UI, external service dependencies, time-based logic). Alternatives:

Cannot write characterization tests?
│
├─ Too coupled to test → Extract the testable parts first
│  Use "Sprout Method" or "Sprout Class":
│  1. Write the new logic in a new, testable function
│  2. Call it from the old code
│  3. Test the new function
│  4. Gradually move more logic into testable functions
│
├─ External service dependency → Record and replay
│  Use VCR/Polly/nock to record real responses
│  Replay them in tests
│
├─ UI-heavy → Snapshot/Approval tests
│  Capture screenshots or HTML output
│  Compare against approved baseline
│
└─ Time-based logic → Inject a clock
   Pass a clock/timer as a parameter
   Use a fake clock in tests

Strangler Fig Pattern

For replacing a large legacy system or module incrementally, without a risky big-bang rewrite.

Strangler Fig Strategy
│
├─ Phase 1: Identify boundaries
│  │  Map the legacy system's entry points (API routes, function calls, events)
│  │  Each entry point is a candidate for strangling
│  └─ Prioritize by: risk (low first), value (high first), coupling (loose first)
│
├─ Phase 2: Build new implementation alongside old
│  │  New code lives in a new module/service
│  │  Both old and new exist simultaneously
│  └─ No modification to legacy code yet
│
├─ Phase 3: Route traffic to new implementation
│  │  Use a router/proxy/feature flag to direct requests
│  │  Start with a small percentage (canary)
│  │  Monitor for errors and performance differences
│  └─ Gradually increase percentage
│
├─ Phase 4: Remove legacy code
│  │  Once 100% traffic goes to new implementation
│  │  Keep legacy code for one release cycle (rollback safety)
│  └─ Then delete it
│
└─ Repeat for each entry point

Example: Strangling a Legacy API Endpoint

// Phase 2: New implementation alongside old
// old: /api/v1/users (legacy monolith)
// new: /api/v2/users (new service)

// Phase 3: Router decides which to call
app.get('/api/users', async (req, res) => {
  const useNewImplementation = await featureFlag('new-users-api', {
    userId: req.user?.id,
    percentage: 25,  // Start with 25% of traffic
  });

  if (useNewImplementation) {
    return newUsersService.getUsers(req, res);
  }
  return legacyUsersController.getUsers(req, res);
});

// Phase 4: Once at 100%, simplify
app.get('/api/users', (req, res) => newUsersService.getUsers(req, res));

Parallel Change (Expand-Migrate-Contract)

For changing an interface without breaking consumers. Three phases: expand (add new), migrate (move consumers), contract (remove old).

Parallel Change Phases
│
├─ EXPAND: Add the new interface alongside the old
│  │  Both old and new work simultaneously
│  │  Old interface delegates to new implementation internally
│  └─ All existing tests continue to pass
│
├─ MIGRATE: Update all consumers to use the new interface
│  │  One consumer at a time
│  │  Each migration is a separate commit/PR
│  │  Old interface still works (backward compatible)
│  └─ Monitor for issues after each migration
│
└─ CONTRACT: Remove the old interface
   │  All consumers now use the new interface
   │  Delete old code and update tests
   └─ This is the only "breaking" change

Example: Renaming a Function

# EXPAND: Add new name, keep old as alias
def calculate_shipping_cost(order: Order) -> Money:
    """New name with improved logic."""
    # ... implementation ...

def calcShipping(order: Order) -> Money:
    """Deprecated: Use calculate_shipping_cost instead."""
    import warnings
    warnings.warn("calcShipping is deprecated, use calculate_shipping_cost", DeprecationWarning)
    return calculate_shipping_cost(order)

# MIGRATE: Update all call sites one by one
# grep for calcShipping, replace with calculate_shipping_cost
# Run tests after each file

# CONTRACT: Remove old function
# Delete calcShipping entirely
# Remove deprecation warning

Example: Changing a Database Schema

-- EXPAND: Add new column alongside old
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Application code writes to BOTH columns
-- UPDATE users SET full_name = first_name || ' ' || last_name, ...

-- MIGRATE: Backfill existing data
-- UPDATE users SET full_name = first_name || ' ' || last_name WHERE full_name IS NULL;
-- Update all queries to read from full_name

-- CONTRACT: Remove old columns
-- ALTER TABLE users DROP COLUMN first_name, DROP COLUMN last_name;

Branch by Abstraction

For replacing an internal implementation without feature branches. Introduce an abstraction layer, swap the implementation behind it.

Branch by Abstraction
│
├─ Step 1: Create abstraction (interface/protocol/trait)
│  │  Define the contract that both old and new implementations satisfy
│  └─ All existing code uses the abstraction, not the concrete implementation
│
├─ Step 2: Wrap existing implementation
│  │  Make existing code implement the new abstraction
│  └─ All tests pass -- no behavior change
│
├─ Step 3: Build new implementation
│  │  New implementation also satisfies the abstraction
│  │  Test new implementation independently
│  └─ Old implementation is still the default
│
├─ Step 4: Switch
│  │  Change the wiring to use new implementation
│  │  Feature flag or config toggle for easy rollback
│  └─ Monitor in production
│
└─ Step 5: Clean up
   Remove old implementation
   Remove abstraction if only one implementation remains
   Remove feature flag

Example:

// Step 1: Define abstraction
interface PaymentGateway {
  charge(amount: Money, card: CardInfo): Promise<PaymentResult>;
  refund(paymentId: string, amount: Money): Promise<RefundResult>;
}

// Step 2: Wrap existing implementation
class StripeGateway implements PaymentGateway {
  async charge(amount: Money, card: CardInfo): Promise<PaymentResult> {
    // existing Stripe code, now behind the interface
  }
  async refund(paymentId: string, amount: Money): Promise<RefundResult> {
    // existing Stripe refund code
  }
}

// Step 3: Build new implementation
class SquareGateway implements PaymentGateway {
  async charge(amount: Money, card: CardInfo): Promise<PaymentResult> {
    // new Square implementation
  }
  async refund(paymentId: string, amount: Money): Promise<RefundResult> {
    // new Square refund implementation
  }
}

// Step 4: Switch via configuration
function createPaymentGateway(): PaymentGateway {
  if (config.paymentProvider === 'square') {
    return new SquareGateway();
  }
  return new StripeGateway(); // default/fallback
}

Small Commits Strategy

Every commit during a refactoring must satisfy two invariants:

Code compiles (type-checks, no syntax errors)
All tests pass (no behavioral regressions)

Commit Granularity Guide

Refactoring Commit Patterns
│
├─ Rename → 1 commit
│  "refactor: rename calcShipping to calculateShippingCost"
│
├─ Extract Function → 1 commit
│  "refactor: extract validateOrderItems from processOrder"
│
├─ Move File → 1 commit
│  "refactor: move utils/helpers.ts to lib/string-utils.ts"
│
├─ Extract Class → 2-3 commits
│  1. "refactor: extract PriceCalculator interface"
│  2. "refactor: implement PriceCalculator, delegate from OrderService"
│  3. "refactor: remove pricing logic from OrderService"
│
├─ Replace Algorithm → 2 commits
│  1. "test: add characterization tests for sorting"
│  2. "refactor: replace bubble sort with merge sort"
│
└─ Large Restructure → Many small commits
   Each file move or extraction is its own commit
   Never batch unrelated changes

Git Workflow for Refactoring

# Start a refactoring session
git checkout -b refactor/extract-payment-service

# After each small refactoring step
git add -p  # Stage only the relevant changes
git commit -m "refactor: extract PaymentValidator from PaymentService"

# Verify at each step
npm test  # or pytest, cargo test, go test ./...

# If a step goes wrong, revert just that step
git revert HEAD

# When done, create a clean PR
# Each commit should be reviewable independently

Feature Flags for Gradual Rollout

When a refactoring affects runtime behavior (e.g., new algorithm, new data flow), use feature flags to control rollout.

Feature Flag Strategy
│
├─ Before refactoring
│  │  Add a feature flag that defaults to OFF (old behavior)
│  └─ Deploy the flag infrastructure
│
├─ During refactoring
│  │  New code path guarded by the flag
│  │  Old code path remains the default
│  └─ Both paths are tested
│
├─ Rollout
│  │  Enable for internal users first
│  │  Enable for 1% → 10% → 50% → 100%
│  │  Monitor error rates, latency, correctness
│  └─ Rollback = disable the flag (instant, no deploy needed)
│
└─ Cleanup
   Remove the flag and old code path
   This is a separate PR after the rollout is complete

Implementation Pattern

// Simple feature flag check
async function searchProducts(query: string): Promise<Product[]> {
  if (await featureFlags.isEnabled('new-search-algorithm', { userId })) {
    return newSearchAlgorithm(query);
  }
  return legacySearch(query);
}

Feature Flag Hygiene

Rule	Why
Remove flags within 2 sprints of 100% rollout	Stale flags accumulate and confuse
Name flags descriptively	`new-search-algorithm` not `flag-123`
Log flag evaluations	Debug which path was taken
Test both paths	Both old and new must have coverage
Flag owner documented	Someone must clean up the flag

Approval Testing / Snapshot Testing

Capture the output of existing code and use it as the test assertion. Ideal for characterization testing before refactoring.

How It Works

Approval Testing Flow
│
├─ First run: Capture output → save as "approved" baseline
│  ├─ HTML output → screenshot or HTML snapshot
│  ├─ JSON output → save formatted JSON
│  ├─ Console output → save text
│  └─ API response → save response body
│
├─ Subsequent runs: Compare output against baseline
│  ├─ Match → test passes
│  └─ Mismatch → test fails, show diff
│     ├─ If expected change → approve new baseline
│     └─ If unexpected change → regression, investigate
│
└─ During refactoring: any output change is flagged
   You decide if the change is intentional or a bug

Tools

Language	Tool	Type
JavaScript	Jest snapshots	`expect(result).toMatchSnapshot()`
JavaScript	Storybook Chromatic	Visual regression
Python	pytest-snapshot	`snapshot.assert_match(result)`
Python	Approval Tests	`verify(result)`
Go	go-snaps	`snaps.MatchSnapshot(t, result)`
Rust	insta	`insta::assert_snapshot!(result)`
Any	screenshot comparison	Playwright, Cypress, Percy

Jest Snapshot Example

// Before refactoring: create baseline
test('renders user profile', () => {
  const { container } = render(<UserProfile user={mockUser} />);
  expect(container.innerHTML).toMatchSnapshot();
});

// During refactoring: any HTML change will fail this test
// If the change is intentional:
//   npx jest --updateSnapshot

Python Approval Test Example

from approvaltests import verify

def test_generate_report():
    report = generate_report(sample_data)
    verify(report)  # First run saves "approved" file
                    # Subsequent runs compare against it

Rollback Strategies

Rollback Options (fastest to slowest)
│
├─ Feature flag toggle (seconds)
│  └─ Disable the flag → old code path runs instantly
│     No deployment needed
│
├─ Git revert (minutes)
│  └─ git revert <commit-hash>
│     Creates a new commit that undoes the change
│     Deploy the revert
│
├─ Redeploy previous version (minutes-hours)
│  └─ Roll back to previous container image / release tag
│     CI/CD pipeline handles the rest
│
├─ Database rollback (hours-days)
│  └─ If schema changed: run reverse migration
│     If data changed: restore from backup
│     Most disruptive, avoid if possible
│
└─ Cannot rollback (prevention only)
   Deleted data, sent emails, external API calls
   Design for forward-fix instead

Forward-Fix vs Rollback Decision

Should you rollback or fix forward?
│
├─ Is the bug causing data loss or corruption?
│  └─ ROLLBACK immediately, fix later
│
├─ Is the bug affecting > 10% of users?
│  └─ ROLLBACK, then fix forward on a branch
│
├─ Is the fix obvious and small (< 5 lines)?
│  └─ FIX FORWARD with expedited review
│
├─ Is the bug cosmetic or low-severity?
│  └─ FIX FORWARD in next regular release
│
└─ Are you unsure of the scope?
   └─ ROLLBACK (when in doubt, be safe)

Code Review Checklist for Refactoring PRs

Reviewer Checklist
│
├─ Behavior Preservation
│  [ ] No functional changes mixed with structural changes
│  [ ] Test suite passes (check CI, not just author's word)
│  [ ] Snapshot/approval tests show no unexpected diffs
│  [ ] Public API unchanged (or deprecated properly)
│
├─ Quality of Refactoring
│  [ ] Each commit is atomic and independently valid
│  [ ] Naming improves clarity (not just different)
│  [ ] Abstraction level is appropriate (not over-engineered)
│  [ ] No new duplication introduced
│  [ ] No circular dependencies introduced
│
├─ Safety
│  [ ] Characterization tests exist for changed code
│  [ ] Feature flag or rollback plan documented (if applicable)
│  [ ] Performance-sensitive code benchmarked before/after
│  [ ] No dead code left behind (old implementations removed)
│
└─ Completeness
   [ ] All references updated (imports, configs, docs, tests)
   [ ] Deprecation warnings added for public API changes
   [ ] Migration guide for downstream consumers (if applicable)

Measuring Refactoring Success

Refactoring is an investment. Measure whether it paid off.

Quantitative Metrics

Metric	Before/After	Tool
Cyclomatic complexity	Should decrease	radon, eslint, gocyclo
Cognitive complexity	Should decrease	SonarQube
File length	Should decrease (god files → smaller modules)	tokei, wc -l
Test coverage	Should increase or stay the same	coverage.py, istanbul, tarpaulin
Build time	Should not increase significantly	CI pipeline timing
Bundle size	Should not increase (may decrease with dead code removal)	webpack-bundle-analyzer
Deployment frequency	Should increase (easier to ship)	DORA metrics
Change failure rate	Should decrease (fewer regressions)	DORA metrics

Qualitative Indicators

Signal	Meaning
Fewer merge conflicts in the area	Code is better organized, less contention
New features in the area are faster to build	Reduced coupling and clear boundaries
Fewer bug reports in the area	Cleaner code, better error handling
Team members are less afraid to change the code	Improved testability and readability
Code review comments shift from "I don't understand" to "looks good"	Better naming and structure

Before/After Comparison Template

# Capture BEFORE metrics
echo "=== BEFORE ==="
tokei src/module-to-refactor/          # Line counts
radon cc src/module-to-refactor/ -a    # Cyclomatic complexity (Python)
npx knip --reporter compact            # Unused code (JS/TS)

# ... do the refactoring ...

# Capture AFTER metrics
echo "=== AFTER ==="
tokei src/module-to-refactor/
radon cc src/module-to-refactor/ -a
npx knip --reporter compact

# Compare
# Complexity should go down
# Line count may go up slightly (more files, smaller each)
# Unused code count should go down

Anti-Patterns in Refactoring Methodology

Anti-pattern	Problem	Better Approach
Big-bang rewrite	High risk, nothing works for weeks	Strangler fig: replace incrementally
Refactoring without a goal	Endless polishing, no business value	Define success criteria before starting
Refactoring everything at once	Merge conflicts, hard to review, hard to rollback	One module at a time, one PR at a time
Skipping characterization tests	No safety net, cannot verify behavior preserved	Always capture current behavior first
Mixing refactoring with features	Cannot tell which caused a regression	Separate PRs: refactor first, then add feature
Not measuring improvement	Cannot justify the time investment	Capture before/after metrics
Stopping halfway	Half-old, half-new is worse than either	Plan for completion, or don't start
Over-designing for the future	YAGNI -- you are not going to need it	Refactor for today's needs, not hypothetical future
Refactoring shared library without coordinating consumers	Breaks downstream teams	Parallel change + deprecation period
No rollback plan	Stuck if something goes wrong in production	Always have a path back: feature flag, git revert, or previous deploy

safe-methodology.md 20 KB Permalink History Raw