Browse Source

feat(windows-ops): Add drive-dependencies, recover-clone, disk-health

Substantial refinement pass making windows-ops complete end-to-end
rather than diagnosis-only.

New scripts:
- drive-dependencies.ps1 — the "is it safe to disconnect this drive?"
  check. Audits pagefile, search index, scheduled tasks, services,
  user-profile symlinks, startup shortcuts, run-key entries, and
  volume mount points for references to a target drive. Emits
  SAFE TO DISCONNECT / WARNINGS / DO NOT DISCONNECT verdict.
- recover-clone.ps1 — safe data clone from failing drive. Wraps
  robocopy with /R:0 /W:0 (no retries — critical for failing drives,
  every retry stresses bad sectors). Capacity preflight, failed-files
  log extraction, semantic exit-code remapping from robocopy's
  bitmask. Supports -DryRun, -NoMirror.
- disk-health.ps1 — focused per-drive deep dive. Resolves by
  -DiskNumber, -DriveLetter, or -Model substring. Emits SMART
  (with smartmontools fallback note), full per-event-ID breakdown,
  controller resets, threshold-vs-actual indicators, FAILING /
  WATCHLIST / HEALTHY verdict with specific reasons.

New reference:
- recovery-patterns.md (315 lines) — the "drive is dying, what now"
  playbook. Covers: cardinal rules (never chkdsk /f a failing drive),
  three tiers of data recovery (robocopy R:0 → ddrescue with map
  files → professional cleanroom), filesystem repair decision tree,
  sfc/DISM integrity, BCD/UEFI boot repair via bootrec and bcdboot,
  pagefile relocation off failing drives, drive removal hierarchy
  (software offline → BIOS-disable → physical disconnect → secure
  destruction), and no-boot recovery flow.

Enhancements to existing scripts:
- health-audit.ps1: now checks pagefile location and Windows Search
  index location, flagging if either lives on a failing drive (which
  is the hidden boot-time amplifier we saw this morning). Top CPU
  section now samples 2-second delta to show CURRENT percentage
  instead of misleading accumulated CPU time (was showing
  "claude 662s CPU" for normal long-running processes).
- safe-disable-startup.ps1: covers StartupFolder variant — startup
  folder shortcuts (.lnk files) can now be disabled too via the
  HKCU\...\StartupApproved\StartupFolder overlay. List mode shows
  all entries across Run/Run32/StartupFolder.

Bug fixed during dogfooding:
- drive-dependencies.ps1 had a regex false-positive where "(E):"
  matched lowercase "e:" in URL schemes like "file:" because
  PowerShell -match is case-insensitive. Fixed with anchored regex
  requiring non-alpha character before drive letter, used via a
  Test-DrivePath helper instead of inline -match operators.

All new scripts dogfooded against the same failing-drive workstation
that motivated this skill — drive-dependencies correctly reports Y:
as SAFE TO DISCONNECT (no system refs) while flagging C:'s 9
profile-symlinks and many critical services. disk-health correctly
verdicts the HGST as FAILING with all three threshold breaches
listed (Event 7: 1943, Event 154: 1646, resets: 20).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0xDarkMatter 2 weeks ago
parent
commit
924ef8ddfe

File diff suppressed because it is too large
+ 38 - 17
skills/windows-ops/SKILL.md


+ 315 - 0
skills/windows-ops/references/recovery-patterns.md

@@ -0,0 +1,315 @@
+# Windows Recovery Patterns
+
+Load this when responding to "my drive is dying, what do I do RIGHT NOW", filesystem-level corruption, boot configuration damage, or system file integrity issues. These are the procedures that have to be right the first time — getting them wrong destroys data.
+
+## Contents
+
+1. [The cardinal rules](#the-cardinal-rules) — what NEVER to do
+2. [Failing-drive data recovery](#failing-drive-data-recovery) — robocopy, ddrescue, vendor tools
+3. [Filesystem repair](#filesystem-repair) — chkdsk semantics, when to use what flag
+4. [System file integrity](#system-file-integrity) — sfc, DISM
+5. [Boot configuration repair](#boot-configuration-repair) — BCD, MBR, bootrec
+6. [Pagefile management](#pagefile-management) — moving pagefile off a failing drive
+7. [Drive removal procedures](#drive-removal-procedures) — offline, physically disconnect, BIOS-disable
+8. [Recovery from no-boot](#recovery-from-no-boot) — Windows RE, Safe Mode, System Restore
+
+## The cardinal rules
+
+These never bend:
+
+1. **Image first, repair second.** When a drive is failing, your priority is getting data OFF it before doing anything that writes TO it. Repair operations write to bad sectors; that finishes a marginal drive faster than any other action.
+
+2. **Never `chkdsk /f` a failing drive.** The `/f` flag writes fixes back to disk. If the drive is throwing hardware errors, every write is potentially the one that kills it. Read-only chkdsk (`chkdsk` with no flags, or explicitly `chkdsk /scan /forceofflinefix`) is OK; anything that writes is not.
+
+3. **Never run `format` or `convert` on a drive you want data from.** Obvious but it gets done in panic.
+
+4. **Don't trust SMART "Healthy"** when event logs are screaming. Windows reports SMART status based on a small handful of attributes; meanwhile the System log can have thousands of Event 7 / 154 hardware errors. The events are the truth.
+
+5. **Don't pound on a failing drive with retries.** Robocopy default is `/R:1000000`. Use `/R:0`. Every retry on a bad sector causes the drive's internal retry-and-relocate logic to run, which stresses both the failing sector and the spare-sector pool.
+
+## Failing-drive data recovery
+
+### Tier 1: Healthy-side clone with robocopy `/R:0`
+
+When the drive is still mostly readable and you can mount it:
+
+```powershell
+robocopy "Y:\important-data" "Z:\rescue\important-data" `
+    /MIR /XJ /COPY:DAT /DCOPY:T `
+    /R:0 /W:0 `
+    /MT:8 `
+    /V /BYTES /NP `
+    /LOG:"$env:TEMP\clone.log" /TEE
+```
+
+Flag breakdown:
+
+| Flag | Effect |
+|------|--------|
+| `/MIR` | Mirror — recursive copy AND delete files at destination that don't exist at source. Use `/E` instead if destination has other content to preserve. |
+| `/XJ` | Skip junction points. Prevents infinite recursion if a junction loops back. |
+| `/COPY:DAT` | Copy Data, Attributes, Timestamps. Skip ACL/Owner (faster, usually unwanted on a recovery target anyway). |
+| `/DCOPY:T` | Also copy directory timestamps. |
+| `/R:0 /W:0` | **Zero retries.** Critical — skip bad sectors fast instead of retrying. |
+| `/MT:8` | 8 threads (default; explicit for clarity). |
+| `/V` | Verbose log includes which files were skipped — needed for the failed-files list. |
+| `/BYTES /NP` | Cleaner log output for parsing. |
+| `/LOG:path /TEE` | Log to file + console. |
+
+Robocopy exits with a bitmask (>=8 means errors). The skill's `scripts/recover-clone.ps1` wraps this with proper exit-code translation and failed-files extraction.
+
+### Tier 2: Image-level recovery with ddrescue
+
+When the drive has many bad sectors or the filesystem itself is unreliable:
+
+`ddrescue` (GNU ddrescue) reads the raw block device, skips errors, comes back later to retry just the failed regions. Two-pass recovery with a map file makes it resumable across crashes/cable yanks.
+
+Install via WSL:
+```bash
+wsl sudo apt install gddrescue
+```
+
+Or boot a live Linux USB.
+
+First pass — read everything that's easy:
+```bash
+ddrescue -d -r0 /dev/sdX recovery.img mapfile
+```
+- `-d` direct (skip OS buffering)
+- `-r0` zero retries on first pass
+
+Second pass — retry the failed regions, this time aggressively:
+```bash
+ddrescue -d -r3 -R /dev/sdX recovery.img mapfile
+```
+- `-r3` three retries on remaining bad blocks
+- `-R` reverse direction (sometimes recovers what forward couldn't)
+
+Then mount the image and copy files out:
+```bash
+sudo losetup -P -f recovery.img        # Linux
+# (Windows: mount via tools like OSFMount; the image is the raw device)
+```
+
+### Tier 3: Professional data recovery
+
+When the drive has mechanical failure (clicking, not spinning, drive ID lost) — stop touching it. Every power cycle risks more damage. Professional cleanroom recovery (Ontrack, DriveSavers, local equivalents) costs $300-3000 AUD depending on damage, but is the only option for physical-fault drives.
+
+## Filesystem repair
+
+### chkdsk decision tree
+
+| Situation | Command | What it does |
+|-----------|---------|--------------|
+| **Failing drive — DO NOT RUN** | Don't `chkdsk /f` | Writes to disk; can finish off marginal drive |
+| Drive healthy, suspicious files | `chkdsk D:` | Read-only check. Reports problems. No writes. Safe. |
+| Drive healthy, repair-OK | `chkdsk D: /f` | Fixes filesystem errors. Locks volume. |
+| Drive healthy, also fix bad sectors | `chkdsk D: /r` | Implies `/f` + scans every sector + recovers what it can from bad ones. **Days for large drives.** |
+| Drive healthy, faster repair | `chkdsk D: /spotfix` | Fixes targeted issues only. Doesn't need offline volume. |
+| System drive, schedule for next boot | `chkdsk C: /f /scan` | Can't lock C: live; schedules check at next boot. |
+| Just scan, don't fix | `chkdsk D: /scan` | Online scan, reports only. Won't fix. |
+
+### Filesystem corruption signals
+
+NTFS will throw `Ntfs` Event 55 ("A corruption was discovered in the file system structure on volume X") when it spots metadata issues mid-operation. If you see this:
+
+1. Don't ignore it
+2. **First** image the drive (Tier 1 or Tier 2 above)
+3. Then run `chkdsk /scan` (read-only) to assess
+4. Then decide if repair is safe (depends on hardware state)
+
+### $LogFile / $MFT damage
+
+If chkdsk reports MFT or $LogFile damage, the drive is in a precarious state. Options:
+
+- Clone the drive first (always), then `chkdsk /f` on the clone
+- If the original is failing physically: use `ntfsfix` from Linux (lighter touch than Windows chkdsk; doesn't try to recover bad sectors)
+- Worst case: image the drive and use `TestDisk` to reconstruct partition tables and `PhotoRec` to extract files by signature
+
+## System file integrity
+
+### When Windows system files are corrupt
+
+Symptoms: blue screens during boot, services failing to start, Windows Update broken, `winver` crashes.
+
+Run in this order:
+
+```powershell
+# 1. System File Checker - replaces corrupt protected files from cache
+sfc /scannow
+
+# 2. If sfc reports unfixable corruption, repair its own source (component store)
+DISM /Online /Cleanup-Image /CheckHealth          # quick check
+DISM /Online /Cleanup-Image /ScanHealth           # deeper scan
+DISM /Online /Cleanup-Image /RestoreHealth        # actually repair (uses Windows Update)
+
+# 3. Then re-run sfc
+sfc /scannow
+```
+
+`DISM /RestoreHealth` downloads replacement files from Windows Update, so the machine needs internet and a working WU stack. If WU itself is broken, supply a known-good `install.wim` via `/Source:WIM:D:\sources\install.wim:1`.
+
+### Component store cleanup
+
+Over years the WinSxS component store grows. Reset/cleanup:
+
+```powershell
+DISM /Online /Cleanup-Image /StartComponentCleanup           # standard cleanup
+DISM /Online /Cleanup-Image /StartComponentCleanup /ResetBase  # plus drops update rollback data (saves more but irreversible)
+```
+
+## Boot configuration repair
+
+### When Windows won't boot
+
+Boot to Windows Recovery Environment (Windows RE):
+- Three failed boots automatically triggers RE on Win10/11
+- Or boot from installation USB → "Repair your computer" → Troubleshoot → Advanced Options → Command Prompt
+
+### BCD (Boot Configuration Data) repair
+
+```cmd
+bootrec /fixmbr        :: Repair MBR (legacy BIOS only)
+bootrec /fixboot       :: Write new boot sector to system partition
+bootrec /scanos        :: Scan for Windows installs
+bootrec /rebuildbcd    :: Rebuild BCD store from scratch
+```
+
+If `/fixboot` returns "Access denied" (common on UEFI):
+
+```cmd
+:: Find the EFI partition and rebuild bootloader
+diskpart
+list volume
+select volume <EFI partition number>     :: Usually ~100 MB, FAT32
+assign letter=Z
+exit
+bcdboot C:\Windows /s Z: /f UEFI         :: Recreate UEFI boot files
+```
+
+### Drive enumeration changed → BCD points at wrong disk
+
+Symptom: BSOD `0x7B` (INACCESSIBLE_BOOT_DEVICE) after hardware change. The BCD references the system drive by device path; if SATA ports rearranged or you added an NVMe, the path may be stale.
+
+```cmd
+bcdedit /enum                          :: Show current BCD entries
+bcdedit /set {default} device boot     :: Reset to logical "boot"
+bcdedit /set {default} osdevice boot
+```
+
+## Pagefile management
+
+### Moving pagefile off a failing drive
+
+If a failing drive hosts (part of) the pagefile, Windows will continue to read/write to it under memory pressure — accelerating drive failure and risking BSOD `0x50` PAGE_FAULT_IN_NONPAGED_AREA.
+
+```powershell
+# Find current pagefile location(s)
+Get-CimInstance Win32_PageFileSetting
+
+# Remove pagefile from a specific drive (requires admin + reboot)
+$pf = Get-CimInstance Win32_PageFileSetting | Where-Object { $_.Name -like 'Y:*' }
+$pf | Remove-CimInstance
+
+# Or relocate: set on a healthy drive first, then remove from failing
+$newPf = New-CimInstance -ClassName Win32_PageFileSetting -Property @{
+    Name = 'C:\pagefile.sys'
+    InitialSize = 0    # 0 = system managed
+    MaximumSize = 0
+}
+```
+
+Changes apply at next reboot. If the failing drive can't be cleanly removed (it's needed at boot for some reason), at minimum reduce its pagefile to 16 MB minimum, 16 MB maximum to limit damage.
+
+### Pagefile sizing for crash dumps
+
+For a complete kernel memory dump on Win11, pagefile on the system drive must be ≥ RAM size (or `DedicatedDumpFile` configured). For minidumps, ≥256 MB is enough. System-managed sizing handles this automatically.
+
+## Drive removal procedures
+
+When you've decided to take a drive offline (failing, replacing, decommissioning), there's a hierarchy from least to most invasive:
+
+### Software-only (drive stays plugged in)
+
+```powershell
+# Take drive offline — Windows won't try to use it until next reboot or manual online
+diskpart
+DISKPART> select disk N
+DISKPART> offline disk
+DISKPART> exit
+```
+
+Useful when:
+- Drive will be physically disconnected at next shutdown
+- You want to test that nothing depends on it (apps that need it will error out, surfacing dependencies)
+- Quick reversibility — `online disk` brings it back
+
+### BIOS-disable (drive stays plugged in but firmware skips it)
+
+Reboot, enter BIOS, find storage configuration, disable the specific SATA port or NVMe slot. Use when:
+- Drive is causing boot stalls (Windows-side `offline` doesn't help boot time)
+- You don't want to open the case yet
+- Reversible without disassembly
+
+### Physical disconnect
+
+The complete solution. SATA: unplug data cable (power cable can stay). NVMe: unscrew the standoff and lift the drive out of the slot. Use when:
+- Drive is causing crashes (any contact with it is a risk)
+- You're done with it permanently
+- Boot performance still bad after BIOS disable (rare but possible)
+
+### Drive destruction (for sensitive data)
+
+Don't trust `format` or even `cipher /w:` on a failing drive — bad sectors may retain readable data. For sensitive data on a drive being decommissioned:
+
+- **HDD**: physical destruction (drill press through platters, or pay a shredding service)
+- **SSD**: `cipher /w:Y:\` for a healthy SSD (forces wear-leveling to overwrite); for failing SSDs, physical destruction is the only reliable path
+
+ATA Secure Erase (`hdparm --security-erase` from Linux, or vendor tools like Samsung Magician) works on healthy SSDs but may hang on failing drives.
+
+## Recovery from no-boot
+
+### Boot sequence triage
+
+When Windows won't boot, work the layers:
+
+| Symptom | Where it failed | First step |
+|---------|----------------|------------|
+| No POST, no fans, no LEDs | Power supply or motherboard | Check power, PSU |
+| POST but no boot device found | Drive or BIOS settings | Check boot order; check drive is detected in BIOS |
+| "Inaccessible boot device" (Win logo then crash) | BCD or boot driver | Boot to RE → `bootrec /scanos` then `/rebuildbcd` |
+| Spinning dots forever | Driver hang or filesystem | Boot to RE → Startup Repair, then `chkdsk /scan` |
+| Login screen reached but crash | User-mode driver/service | Safe Mode → identify recently changed driver |
+| Login OK but desktop missing | Shell / profile issue | Safe Mode → check `userinit.exe` registration |
+
+### Safe Mode access
+
+- **From login screen**: hold Shift while clicking Restart → Troubleshoot → Advanced → Startup Settings
+- **From three failed boots**: WinRE auto-triggers
+- **From running Windows**: `msconfig` → Boot tab → Safe boot (revert after diagnosing!)
+
+Once in Safe Mode, common moves:
+1. Roll back last driver (Device Manager → driver properties → Roll Back)
+2. Disable suspect startup item (`scripts/safe-disable-startup.ps1` works in Safe Mode too)
+3. System Restore to a known-good point
+4. Run `sfc /scannow` and `DISM /Online /Cleanup-Image /RestoreHealth`
+
+### System Restore from WinRE
+
+```
+Troubleshoot → Advanced Options → System Restore
+```
+
+Picks a restore point and rolls back system files + registry + drivers (NOT personal data). Effective against recent driver/update issues. Useless if no restore points exist (Win10/11 sometimes turn off System Protection by default).
+
+## When to escalate
+
+Time to call professional data recovery:
+
+- Drive doesn't show up in BIOS at all
+- Drive makes clicking, grinding, or scraping sounds
+- SMART status reports "Pred. Failure" AND the drive vanished mid-use
+- ddrescue can't make forward progress (reading at <1 MB/min for hours)
+- You opened the drive (you don't have a cleanroom; you just killed it)
+
+Cost ranges $300 (logical recovery — bad sectors but PCB intact) to $3000+ (head transplant, platter swap). Always get a quote before committing — quoted no-recovery-no-fee outfits exist.

+ 284 - 0
skills/windows-ops/scripts/disk-health.ps1

@@ -0,0 +1,284 @@
+<#
+.SYNOPSIS
+    Focused per-drive health report — every diagnostic signal for one
+    specific physical disk in one report.
+
+.DESCRIPTION
+    Drill-down companion to health-audit.ps1. Targets a single physical
+    disk (by number, drive letter, or model substring) and emits:
+
+      - Hardware identification (model, serial, firmware, capacity)
+      - SMART reliability counters (Windows native + smartctl if installed)
+      - All disk-provider events for the disk over the time window
+      - All storahci controller resets (skill correlates port to drive)
+      - Per-event-ID breakdown with severity classification
+      - Recovery clues — failing-LBA distribution, time-clustering
+      - System dependencies — quick summary (uses drive-dependencies.ps1
+        if available, else inline check)
+
+.PARAMETER DiskNumber
+    Physical disk number from Get-Disk. Mutually exclusive with -DriveLetter
+    and -Model.
+
+.PARAMETER DriveLetter
+    Drive letter — resolves to the underlying physical disk.
+
+.PARAMETER Model
+    Model substring match (e.g. 'HGST', '980 PRO'). Picks the first match.
+
+.PARAMETER Days
+    Days back to scan event logs. Default: 60.
+
+.PARAMETER Json
+    Machine-readable JSON output.
+
+.EXAMPLE
+    scripts/disk-health.ps1 -DiskNumber 1
+    Focused report on physical disk 1.
+
+.EXAMPLE
+    scripts/disk-health.ps1 -DriveLetter Y -Days 30
+    Drill on the disk that hosts Y:, 30-day window.
+
+.EXAMPLE
+    scripts/disk-health.ps1 -Model 'HGST' -Json | jq '.errors'
+    Find the HGST drive and dump its error counts as JSON.
+
+.NOTES
+    Exit codes:
+      0 success — drive looks healthy
+      3 not found — no matching disk
+      4 validation — drive shows failure indicators
+#>
+
+[CmdletBinding(DefaultParameterSetName='Number')]
+param(
+    [Parameter(ParameterSetName='Number', Position=0)][ValidateRange(0, 99)][int]$DiskNumber = -1,
+    [Parameter(ParameterSetName='Letter')][ValidatePattern('^[A-Za-z]$')][string]$DriveLetter,
+    [Parameter(ParameterSetName='Model')][string]$Model,
+    [ValidateRange(1, 365)][int]$Days = 60,
+    [switch]$Json
+)
+
+$ErrorActionPreference = 'Stop'
+. "$PSScriptRoot\_lib\common.ps1"
+
+# Resolve target disk
+$disks = Get-DiskMap
+$target = $null
+switch ($PSCmdlet.ParameterSetName) {
+    'Number' {
+        if ($DiskNumber -lt 0) {
+            Write-Log -Level FAIL -Message "Provide -DiskNumber, -DriveLetter, or -Model"
+            exit $script:EXIT_USAGE
+        }
+        $target = $disks | Where-Object { $_.Number -eq $DiskNumber } | Select-Object -First 1
+    }
+    'Letter' {
+        $L = $DriveLetter.ToUpper()
+        $part = Get-Partition -ErrorAction SilentlyContinue | Where-Object { $_.DriveLetter -eq $L } | Select-Object -First 1
+        if ($part) {
+            $target = $disks | Where-Object { $_.Number -eq $part.DiskNumber } | Select-Object -First 1
+        }
+    }
+    'Model' {
+        $target = $disks | Where-Object { $_.Model -like "*$Model*" } | Select-Object -First 1
+    }
+}
+
+if (-not $target) {
+    Write-Log -Level FAIL -Message "No matching disk found"
+    exit $script:EXIT_NOT_FOUND
+}
+
+# Collect data
+$result = [ordered]@{
+    diskNumber       = $target.Number
+    model            = $target.Model
+    serial           = $target.SerialNumber
+    firmware         = $target.FirmwareVersion
+    mediaType        = $target.MediaType
+    busType          = $target.BusType
+    sizeGB           = $target.SizeGB
+    driveLetters     = $target.DriveLetters
+    healthStatus     = $target.HealthStatus
+    windowDays       = $Days
+    smart            = $null
+    eventCounts      = @{}
+    eventSamples     = @()
+    storahciResets   = 0
+    verdict          = 'unknown'
+    indicators       = @()
+}
+
+# SMART reliability counter (Windows native)
+try {
+    $physical = Get-PhysicalDisk | Where-Object { $_.DeviceId -eq $target.Number }
+    $rel = $physical | Get-StorageReliabilityCounter -ErrorAction SilentlyContinue
+    if ($rel) {
+        $result.smart = @{
+            temperatureC   = $rel.Temperature
+            temperatureMax = $rel.TemperatureMax
+            wearPct        = $rel.Wear
+            readErrors     = $rel.ReadErrorsTotal
+            writeErrors    = $rel.WriteErrorsTotal
+            powerOnHours   = $rel.PowerOnHours
+            powerCycles    = $rel.PowerCycleCount
+            startStops     = $rel.StartStopCycleCount
+        }
+    }
+} catch {}
+
+# smartctl fallback (if smartmontools installed)
+$smartctl = Get-Command smartctl.exe -ErrorAction SilentlyContinue
+if ($smartctl -and -not $result.smart) {
+    try {
+        $smartOutput = & smartctl -A "/dev/sd$($target.Number)" 2>&1
+        if ($smartOutput) {
+            $result.smartctlAvailable = $true
+            $result.smartctlOutput = ($smartOutput -join "`n")
+        }
+    } catch {}
+}
+
+# Disk-provider events for this disk
+try {
+    $diskErrs = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        ProviderName='disk'
+        StartTime=(Get-Date).AddDays(-$Days)
+    } -ErrorAction SilentlyContinue
+    foreach ($e in $diskErrs) {
+        $n = $null
+        if     ($e.Message -match 'Harddisk(\d+)')      { $n = [int]$matches[1] }
+        elseif ($e.Message -match '\bfor Disk (\d+)\b') { $n = [int]$matches[1] }
+        if ($n -ne $target.Number) { continue }
+        $id = "$($e.Id)"
+        if ($result.eventCounts.ContainsKey($id)) {
+            $result.eventCounts[$id] = $result.eventCounts[$id] + 1
+        } else {
+            $result.eventCounts[$id] = 1
+        }
+        if ($result.eventSamples.Count -lt 5) {
+            $result.eventSamples += @{
+                time     = $e.TimeCreated.ToString('o')
+                id       = $e.Id
+                message  = (Format-EventMessage -Message $e.Message -MaxLength 150)
+            }
+        }
+    }
+} catch {}
+
+# storahci resets (controller-level; we can't always tie a port to a specific
+# disk number reliably, so report total reset count and let caller correlate
+# via drive enumeration order)
+try {
+    $resets = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        ProviderName='storahci'
+        Id=129
+        StartTime=(Get-Date).AddDays(-$Days)
+    } -ErrorAction SilentlyContinue
+    $result.storahciResets = if ($resets) { $resets.Count } else { 0 }
+} catch {}
+
+# Severity classification
+$isSsd = $target.MediaType -eq 'SSD'
+$ev7   = if ($result.eventCounts.ContainsKey('7'))   { $result.eventCounts['7']   } else { 0 }
+$ev51  = if ($result.eventCounts.ContainsKey('51'))  { $result.eventCounts['51']  } else { 0 }
+$ev154 = if ($result.eventCounts.ContainsKey('154')) { $result.eventCounts['154'] } else { 0 }
+
+$thresholds = if ($isSsd) {
+    @{ event7=10; event154=5; event51=5 }
+} else {
+    @{ event7=50; event154=10; event51=5 }
+}
+
+$failing = (
+    $ev7   -gt $thresholds.event7   -or
+    $ev154 -gt $thresholds.event154 -or
+    $ev51  -gt $thresholds.event51  -or
+    $result.storahciResets -gt 5
+)
+$watch = (
+    $ev7   -gt 5 -or
+    $ev154 -gt 2 -or
+    $result.storahciResets -gt 0
+)
+
+if ($failing) {
+    $result.verdict = 'FAILING'
+    if ($ev7   -gt $thresholds.event7)   { $result.indicators += "Event 7 (bad block): $ev7 > $($thresholds.event7) threshold" }
+    if ($ev154 -gt $thresholds.event154) { $result.indicators += "Event 154 (hw error): $ev154 > $($thresholds.event154) threshold" }
+    if ($ev51  -gt $thresholds.event51)  { $result.indicators += "Event 51 (paging error): $ev51 > $($thresholds.event51) threshold" }
+    if ($result.storahciResets -gt 5)    { $result.indicators += "Controller resets: $($result.storahciResets) > 5 threshold" }
+} elseif ($watch) {
+    $result.verdict = 'WATCHLIST'
+    if ($ev7   -gt 5) { $result.indicators += "Event 7 elevated: $ev7" }
+    if ($ev154 -gt 2) { $result.indicators += "Event 154 elevated: $ev154" }
+    if ($result.storahciResets -gt 0) { $result.indicators += "Controller resets: $($result.storahciResets)" }
+} else {
+    $result.verdict = 'HEALTHY'
+}
+
+# Output
+if ($Json) {
+    [Console]::Out.WriteLine(($result | ConvertTo-Json -Depth 5))
+} else {
+    Write-Section "Disk $($target.Number): $($target.Model)"
+    [Console]::Out.WriteLine("  Type:     $($target.MediaType) / $($target.BusType)")
+    [Console]::Out.WriteLine("  Capacity: $($target.SizeGB) GB")
+    [Console]::Out.WriteLine("  Firmware: $($target.FirmwareVersion)")
+    [Console]::Out.WriteLine("  Serial:   $($target.SerialNumber)")
+    [Console]::Out.WriteLine("  Letters:  $($target.DriveLetters)")
+    [Console]::Out.WriteLine("  Reports:  $($target.HealthStatus)")
+    [Console]::Out.WriteLine("")
+    if ($result.smart) {
+        Write-Section "SMART reliability counters"
+        [Console]::Out.WriteLine("  Temp:     $($result.smart.temperatureC) C (max: $($result.smart.temperatureMax) C)")
+        [Console]::Out.WriteLine("  Wear:     $($result.smart.wearPct)%")
+        [Console]::Out.WriteLine("  Read err: $($result.smart.readErrors)  Write err: $($result.smart.writeErrors)")
+        [Console]::Out.WriteLine("  Hours:    $($result.smart.powerOnHours)  Cycles: $($result.smart.powerCycles)")
+    } else {
+        [Console]::Out.WriteLine("  SMART:    (Windows reliability counter unavailable for this drive)")
+        if ($smartctl) {
+            [Console]::Out.WriteLine("            smartctl installed but call failed — try: smartctl -A /dev/sdX")
+        } else {
+            [Console]::Out.WriteLine("            Install smartmontools for SMART access: scoop install smartmontools")
+        }
+    }
+
+    Write-Section "Disk events ($Days days)"
+    if ($result.eventCounts.Count -eq 0) {
+        [Console]::Out.WriteLine("  No disk events for this disk in window.")
+    } else {
+        $result.eventCounts.GetEnumerator() | Sort-Object { [int]$_.Key } | ForEach-Object {
+            [Console]::Out.WriteLine("  Event $($_.Key):  $($_.Value) occurrences")
+        }
+    }
+    [Console]::Out.WriteLine("")
+    [Console]::Out.WriteLine("  Controller resets (storahci 129): $($result.storahciResets) over $Days days")
+
+    Write-Section "VERDICT: $($result.verdict)"
+    if ($result.indicators) {
+        foreach ($i in $result.indicators) {
+            [Console]::Out.WriteLine("  - $i")
+        }
+    }
+    [Console]::Out.WriteLine("")
+    switch ($result.verdict) {
+        'FAILING' {
+            [Console]::Out.WriteLine("  Recommended: back up data, run drive-dependencies.ps1, then replace.")
+        }
+        'WATCHLIST' {
+            [Console]::Out.WriteLine("  Recommended: back up irreplaceable data, monitor weekly.")
+        }
+        'HEALTHY' {
+            [Console]::Out.WriteLine("  Recommended: no action needed.")
+        }
+    }
+    [Console]::Out.WriteLine("")
+}
+
+if ($result.verdict -eq 'FAILING') { exit $script:EXIT_VALIDATION }
+exit $script:EXIT_OK

+ 291 - 0
skills/windows-ops/scripts/drive-dependencies.ps1

@@ -0,0 +1,291 @@
+<#
+.SYNOPSIS
+    Find every system mechanism referencing a target drive letter or
+    disk number. The "is it safe to disconnect?" check.
+
+.DESCRIPTION
+    Before physically removing a failing drive (or setting it Offline),
+    audit what's pointing at it: pagefile location, Windows Search index,
+    scheduled tasks, services, user-profile symlinks/junctions, startup
+    folder shortcuts, mounted volume mount points, and any drive
+    references in the Windows Run keys.
+
+    Default output is a human-readable table. -Json emits structured.
+
+    Exit codes:
+      0 success
+      2 usage
+      3 not found (no such drive)
+
+.PARAMETER DriveLetter
+    Single drive letter (e.g. 'Y'). Case-insensitive.
+
+.PARAMETER DiskNumber
+    Physical disk number (from Get-Disk). The script resolves all drive
+    letters mounted on that disk and checks each.
+
+.PARAMETER Json
+    Machine-readable JSON output.
+
+.EXAMPLE
+    scripts/drive-dependencies.ps1 -DriveLetter Y
+    Audit all system references to Y: drive.
+
+.EXAMPLE
+    scripts/drive-dependencies.ps1 -DiskNumber 1
+    Audit all references to drive letters on physical disk 1.
+
+.EXAMPLE
+    scripts/drive-dependencies.ps1 -DriveLetter Y -Json | jq '.dependencies[]'
+    Machine-readable output for downstream tooling.
+
+.NOTES
+    Output verdict at end:
+      SAFE TO DISCONNECT — no critical references found
+      WARNINGS — some references found but none boot-critical
+      DO NOT DISCONNECT — boot-critical reference (pagefile, system, etc.)
+#>
+
+[CmdletBinding(DefaultParameterSetName='Letter')]
+param(
+    [Parameter(Mandatory, ParameterSetName='Letter', Position=0)]
+    [ValidatePattern('^[A-Za-z]$')]
+    [string]$DriveLetter,
+
+    [Parameter(Mandatory, ParameterSetName='Number')]
+    [ValidateRange(0, 99)]
+    [int]$DiskNumber,
+
+    [switch]$Json
+)
+
+$ErrorActionPreference = 'Stop'
+. "$PSScriptRoot\_lib\common.ps1"
+
+# Resolve target drive letter(s)
+if ($PSCmdlet.ParameterSetName -eq 'Number') {
+    $parts = Get-Partition -DiskNumber $DiskNumber -ErrorAction SilentlyContinue
+    if (-not $parts) {
+        Write-Log -Level FAIL -Message "No partitions found on disk $DiskNumber"
+        exit $script:EXIT_NOT_FOUND
+    }
+    $targetLetters = @($parts | Where-Object { $_.DriveLetter } | ForEach-Object { "$($_.DriveLetter)" })
+    if (-not $targetLetters) {
+        Write-Log -Level WARN -Message "Disk $DiskNumber has no mounted drive letters (still audit-worthy for system-volume refs)"
+        $targetLetters = @()
+    }
+} else {
+    $targetLetters = @($DriveLetter.ToUpper())
+    # Verify the drive exists
+    if (-not (Get-PSDrive -PSProvider FileSystem -Name $DriveLetter.ToUpper() -ErrorAction SilentlyContinue)) {
+        Write-Log -Level WARN -Message "Drive ${DriveLetter}: not currently mounted — auditing references anyway"
+    }
+}
+
+# Build a drive-letter regex that doesn't false-positive on URL schemes
+# (e.g. the 'e:' in 'file:'). Require the letter to be either at string
+# start, or preceded by a non-alpha character, and followed by `:\` or `:/`.
+$letterPattern = if ($targetLetters) {
+    $letters = ($targetLetters | ForEach-Object { [regex]::Escape($_) }) -join '|'
+    "(?:^|[^A-Za-z])($letters):[\\/]"
+} else { '__NOMATCH__' }
+
+# Force case-sensitive match so lowercase 'e' inside 'file:' won't match 'E:'
+function Test-DrivePath {
+    param([string]$Text)
+    if (-not $Text) { return $false }
+    return [regex]::IsMatch($Text, $letterPattern)
+}
+
+$findings = New-Object System.Collections.Generic.List[hashtable]
+
+function Add-Dependency {
+    param(
+        [Parameter(Mandatory)][string]$Category,
+        [Parameter(Mandatory)][string]$Name,
+        [Parameter(Mandatory)][string]$Target,
+        [Parameter(Mandatory)][ValidateSet('critical','warn','info')]$Severity
+    )
+    $findings.Add(@{ category=$Category; name=$Name; target=$Target; severity=$Severity })
+}
+
+if (-not $Json) {
+    Write-Section "Drive dependency audit: $($targetLetters -join ', ')"
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# 1. Pagefile location
+# ─────────────────────────────────────────────────────────────────────
+try {
+    $pagefiles = Get-CimInstance Win32_PageFileSetting -ErrorAction SilentlyContinue
+    foreach ($pf in $pagefiles) {
+        if (Test-DrivePath $pf.Name) {
+            Add-Dependency -Category 'pagefile' -Name $pf.Name -Target $pf.Name -Severity 'critical'
+        }
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# 2. Windows Search index data directory
+# ─────────────────────────────────────────────────────────────────────
+try {
+    $idxDir = (Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows Search' -Name DataDirectory -ErrorAction SilentlyContinue).DataDirectory
+    if (Test-DrivePath $idxDir) {
+        Add-Dependency -Category 'search-index' -Name 'Windows.edb' -Target $idxDir -Severity 'warn'
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# 3. Windows Search indexed scopes (paths in the crawl scope)
+# ─────────────────────────────────────────────────────────────────────
+try {
+    $scopeKey = 'HKLM:\SOFTWARE\Microsoft\Windows Search\CrawlScopeManager\Windows\SystemIndex\WorkingSetRules'
+    if (Test-Path $scopeKey) {
+        Get-ChildItem $scopeKey -ErrorAction SilentlyContinue | ForEach-Object {
+            $url = (Get-ItemProperty $_.PSPath -Name URL -ErrorAction SilentlyContinue).URL
+            if (Test-DrivePath $url) {
+                Add-Dependency -Category 'search-scope' -Name 'Indexed path' -Target $url -Severity 'warn'
+            }
+        }
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# 4. Scheduled tasks
+# ─────────────────────────────────────────────────────────────────────
+try {
+    Get-ScheduledTask -ErrorAction SilentlyContinue | ForEach-Object {
+        $task = $_
+        foreach ($action in $task.Actions) {
+            $strs = @($action.Execute, $action.Arguments, $action.WorkingDirectory) -join ' '
+            if (Test-DrivePath $strs) {
+                Add-Dependency -Category 'scheduled-task' -Name $task.TaskName -Target ($strs.Trim()) -Severity 'warn'
+                break
+            }
+        }
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# 5. Services with binary path on target drive
+# ─────────────────────────────────────────────────────────────────────
+try {
+    Get-CimInstance Win32_Service -ErrorAction SilentlyContinue | ForEach-Object {
+        if (Test-DrivePath $_.PathName) {
+            $sev = if ($_.StartMode -eq 'Auto') { 'critical' } else { 'warn' }
+            Add-Dependency -Category 'service' -Name $_.Name -Target $_.PathName -Severity $sev
+        }
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# 6. User profile symlinks/junctions pointing at target
+# ─────────────────────────────────────────────────────────────────────
+try {
+    Get-ChildItem $env:USERPROFILE -Force -ErrorAction SilentlyContinue |
+        Where-Object { $_.Attributes -band [System.IO.FileAttributes]::ReparsePoint } |
+        ForEach-Object {
+            if ($_.Target -and (Test-DrivePath ($_.Target -join ' '))) {
+                Add-Dependency -Category 'profile-symlink' -Name $_.Name -Target ($_.Target -join '; ') -Severity 'warn'
+            }
+        }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# 7. Startup folder shortcuts targeting drive
+# ─────────────────────────────────────────────────────────────────────
+try {
+    $shell = New-Object -ComObject WScript.Shell
+    foreach ($d in @("$env:APPDATA\Microsoft\Windows\Start Menu\Programs\Startup",
+                     "$env:ALLUSERSPROFILE\Microsoft\Windows\Start Menu\Programs\StartUp")) {
+        if (Test-Path $d) {
+            Get-ChildItem $d -Filter *.lnk -ErrorAction SilentlyContinue | ForEach-Object {
+                $sc = $shell.CreateShortcut($_.FullName)
+                $combined = @($sc.TargetPath, $sc.WorkingDirectory, $sc.Arguments) -join ' '
+                if (Test-DrivePath $combined) {
+                    Add-Dependency -Category 'startup-shortcut' -Name $_.Name -Target $sc.TargetPath -Severity 'warn'
+                }
+            }
+        }
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# 8. Registry Run-key entries pointing at drive
+# ─────────────────────────────────────────────────────────────────────
+$runPaths = @(
+    'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run'
+)
+foreach ($p in $runPaths) {
+    if (Test-Path $p) {
+        (Get-ItemProperty $p -ErrorAction SilentlyContinue).PSObject.Properties |
+            Where-Object { $_.Name -notmatch '^PS' -and (Test-DrivePath $_.Value) } |
+            ForEach-Object {
+                Add-Dependency -Category 'run-key' -Name $_.Name -Target $_.Value -Severity 'warn'
+            }
+    }
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# 9. Volume mount points (a folder on C: that mounts the target volume)
+# ─────────────────────────────────────────────────────────────────────
+try {
+    $partitions = Get-Partition -ErrorAction SilentlyContinue | Where-Object {
+        $_.DriveLetter -and $targetLetters -contains "$($_.DriveLetter)"
+    }
+    foreach ($p in $partitions) {
+        $vol = Get-Volume -Partition $p -ErrorAction SilentlyContinue
+        if ($vol -and $vol.AccessPaths) {
+            foreach ($path in $vol.AccessPaths) {
+                if ($path -match '^[A-Z]:\\' -and $path -notmatch "^${($p.DriveLetter)}:") {
+                    Add-Dependency -Category 'mount-point' -Name "$($p.DriveLetter): mounted at" -Target $path -Severity 'warn'
+                }
+            }
+        }
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# Output
+# ─────────────────────────────────────────────────────────────────────
+$criticalCount = ($findings | Where-Object { $_.severity -eq 'critical' }).Count
+$warnCount     = ($findings | Where-Object { $_.severity -eq 'warn' }).Count
+$infoCount     = ($findings | Where-Object { $_.severity -eq 'info' }).Count
+
+$verdict = if ($criticalCount -gt 0) {
+    'DO NOT DISCONNECT — boot-critical references found'
+} elseif ($warnCount -gt 0) {
+    'WARNINGS — some references found; review before disconnecting'
+} else {
+    'SAFE TO DISCONNECT — no system dependencies on this drive'
+}
+
+if ($Json) {
+    @{
+        targetLetters = $targetLetters
+        dependencies  = $findings
+        critical      = $criticalCount
+        warnings      = $warnCount
+        verdict       = $verdict
+    } | ConvertTo-Json -Depth 5 | ForEach-Object { [Console]::Out.WriteLine($_) }
+} else {
+    if (-not $findings) {
+        [Console]::Out.WriteLine("")
+        [Console]::Out.WriteLine("  No dependencies found.")
+    } else {
+        [Console]::Out.WriteLine("")
+        $findings | Sort-Object { $_.category } | ForEach-Object {
+            $tag = switch ($_.severity) { 'critical' {'[CRITICAL]'} 'warn' {'[WARN]    '} default {'[INFO]    '} }
+            [Console]::Out.WriteLine(("  {0}  {1,-18}  {2,-40}  {3}" -f $tag, $_.category, $_.name.Substring(0,[Math]::Min(40,$_.name.Length)), $_.target.Substring(0,[Math]::Min(80,$_.target.Length))))
+        }
+    }
+    Write-Section "VERDICT"
+    [Console]::Out.WriteLine("  $verdict")
+    [Console]::Out.WriteLine("")
+    [Console]::Out.WriteLine("  Critical: $criticalCount    Warnings: $warnCount")
+}
+
+if ($criticalCount -gt 0) { exit $script:EXIT_VALIDATION }
+exit $script:EXIT_OK

+ 65 - 6
skills/windows-ops/scripts/health-audit.ps1

@@ -201,6 +201,39 @@ if ($resetCount -gt 5) {
         -Detail "No storahci resets in last $Days days"
 }
 
+# Pagefile location — flag if pagefile is on a failing drive
+try {
+    $pagefiles = Get-CimInstance Win32_PageFileSetting -ErrorAction SilentlyContinue
+    foreach ($pf in $pagefiles) {
+        if (-not $pf.Name) { continue }
+        $pfLetter = $pf.Name.Substring(0,1).ToUpper()
+        $pfDisk = $diskMap | Where-Object { $_.DriveLetters -like "*$pfLetter*" } | Select-Object -First 1
+        if ($pfDisk -and $failingDisks -contains $pfDisk) {
+            Add-Finding -Level fail -Category 'storage' -Subject 'Pagefile location' `
+                -Detail "Pagefile on FAILING drive: $($pf.Name) (Disk $($pfDisk.Number)). Move to a healthy drive."
+        } else {
+            Add-Finding -Level pass -Category 'storage' -Subject 'Pagefile location' `
+                -Detail "Pagefile on healthy drive: $($pf.Name)"
+        }
+    }
+} catch {}
+
+# Windows Search index location — boot-time amplifier if on failing drive
+try {
+    $idxDir = (Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows Search' -Name DataDirectory -ErrorAction SilentlyContinue).DataDirectory
+    if ($idxDir) {
+        $idxLetter = $idxDir.Substring(0,1).ToUpper()
+        $idxDisk = $diskMap | Where-Object { $_.DriveLetters -like "*$idxLetter*" } | Select-Object -First 1
+        if ($idxDisk -and $failingDisks -contains $idxDisk) {
+            Add-Finding -Level fail -Category 'storage' -Subject 'Search index location' `
+                -Detail "Search index on FAILING drive: $idxDir. Move to a healthy drive."
+        } else {
+            Add-Finding -Level pass -Category 'storage' -Subject 'Search index location' `
+                -Detail "Search index on healthy drive: $idxDir"
+        }
+    }
+} catch {}
+
 # ─────────────────────────────────────────────────────────────────────
 # Section: Crash history
 # ─────────────────────────────────────────────────────────────────────
@@ -312,14 +345,40 @@ try {
     Add-Finding -Level $level -Category 'resource' -Subject 'Memory' -Detail "$memUsedPct% used"
 } catch {}
 
-# Top 5 processes by accumulated CPU
+# Top processes by CURRENT CPU% over a 2-second sample (not accumulated CPU
+# time — that's misleading for long-running processes).
 try {
-    $topCpu = Get-Process | Where-Object { $_.CPU -gt 30 } | Sort-Object CPU -Descending | Select-Object -First 5
-    foreach ($p in $topCpu) {
-        Add-Finding -Level info -Category 'resource' -Subject "Top CPU: $($p.ProcessName)" `
-            -Detail "$([math]::Round($p.CPU,0))s CPU, $([math]::Round($p.WorkingSet/1MB,0)) MB"
+    $sample1 = Get-Process | Select-Object Id, ProcessName, CPU, WorkingSet
+    Start-Sleep -Milliseconds 2000
+    $sample2 = Get-Process | Select-Object Id, ProcessName, CPU, WorkingSet
+    $cores = (Get-CimInstance Win32_Processor | Measure-Object -Property NumberOfLogicalProcessors -Sum).Sum
+    if (-not $cores) { $cores = 1 }
+    $top = @()
+    foreach ($p2 in $sample2) {
+        $p1 = $sample1 | Where-Object { $_.Id -eq $p2.Id } | Select-Object -First 1
+        if (-not $p1) { continue }
+        $deltaCpuSec = $p2.CPU - $p1.CPU
+        $pct = [math]::Round(($deltaCpuSec / 2.0 / $cores) * 100, 1)
+        if ($pct -gt 1.0) {
+            $top += [PSCustomObject]@{
+                Name    = $p2.ProcessName
+                Pid     = $p2.Id
+                Pct     = $pct
+                RamMB   = [math]::Round($p2.WorkingSet / 1MB, 0)
+            }
+        }
     }
-} catch {}
+    $top = $top | Sort-Object Pct -Descending | Select-Object -First 5
+    foreach ($p in $top) {
+        Add-Finding -Level info -Category 'resource' -Subject "Active CPU: $($p.Name)" `
+            -Detail "$($p.Pct)% CPU (sampled 2s), $($p.RamMB) MB RAM, PID $($p.Pid)"
+    }
+    if (-not $top) {
+        Add-Finding -Level pass -Category 'resource' -Subject 'CPU pressure' -Detail "No process consuming >1% over 2s sample"
+    }
+} catch {
+    Add-Finding -Level info -Category 'resource' -Subject 'CPU sample' -Detail "Failed: $_"
+}
 
 # ─────────────────────────────────────────────────────────────────────
 # Verdict

+ 186 - 0
skills/windows-ops/scripts/recover-clone.ps1

@@ -0,0 +1,186 @@
+<#
+.SYNOPSIS
+    Safely clone data from a failing drive to a healthy target using
+    robocopy with retry=0 (skip bad sectors fast, don't pound on them).
+
+.DESCRIPTION
+    When a drive is dying, the worst thing you can do is repeatedly retry
+    reads on failing sectors — every retry stresses the drive further and
+    can finish it off. This script wraps robocopy with the right flags:
+
+      /R:0       no retries on read failures
+      /W:0       no wait between retries (n/a with R:0 but explicit)
+      /MIR       mirror (delete files at target that don't exist at source)
+      /XJ        skip junction points (don't follow recursive mounts)
+      /COPY:DAT  copy Data, Attributes, Timestamps (skip ACL/Owner — faster)
+      /MT:8      8 threads (default is 8 anyway, explicit for clarity)
+      /R:0 /W:0  total retry budget zero — fail fast on bad blocks
+      /LOG       full log of what failed
+      /TEE       output to console + log
+
+    A separate "failed files" log captures the specific paths that couldn't
+    be read, so the user can decide what to do with those (often: try
+    again later with ddrescue, or accept the loss).
+
+    The script can resume — robocopy /MIR is idempotent. Re-run after a
+    crash and it picks up where it left off (modulo files that have
+    already been mirrored).
+
+.PARAMETER Source
+    Source path (failing drive). Required.
+
+.PARAMETER Destination
+    Target path (healthy drive with enough space). Required.
+
+.PARAMETER NoMirror
+    Use /COPY instead of /MIR. Use this when the destination already has
+    other content you want preserved.
+
+.PARAMETER MaxRetries
+    Retry budget per file. Default 0 (no retries — recommended for failing
+    drives). Set to 1 only if you accept that retries may damage the
+    drive further.
+
+.PARAMETER LogDir
+    Where to write the clone log and failed-files log. Default: TEMP.
+
+.PARAMETER DryRun
+    Use robocopy /L to list what would be copied without copying. Useful
+    for planning capacity.
+
+.EXAMPLE
+    scripts/recover-clone.ps1 -Source Y:\ -Destination Z:\backup-of-Y
+    Full mirror clone with zero retries (safest for failing drive).
+
+.EXAMPLE
+    scripts/recover-clone.ps1 -Source Y:\important -Destination Z:\rescue -NoMirror
+    Copy a specific folder without mirroring (won't delete destination files).
+
+.EXAMPLE
+    scripts/recover-clone.ps1 -Source Y:\ -Destination Z:\backup -DryRun
+    Enumerate without copying — check capacity, file counts.
+
+.NOTES
+    Exit codes (robocopy's are remapped to ATP semantics):
+      0  success — no files needed copying, or all copied OK
+      1  partial — some files copied, some failed
+      3  not found — source path doesn't exist
+      4  validation — destination has less free space than source data
+      5  precondition — robocopy not found
+#>
+
+[CmdletBinding(SupportsShouldProcess)]
+param(
+    [Parameter(Mandatory, Position=0)][string]$Source,
+    [Parameter(Mandatory, Position=1)][string]$Destination,
+    [switch]$NoMirror,
+    [ValidateRange(0,5)][int]$MaxRetries = 0,
+    [string]$LogDir = $env:TEMP,
+    [switch]$DryRun
+)
+
+$ErrorActionPreference = 'Stop'
+. "$PSScriptRoot\_lib\common.ps1"
+
+# Preflight
+$robo = Get-Command robocopy.exe -ErrorAction SilentlyContinue
+if (-not $robo) {
+    Write-Log -Level FAIL -Message "robocopy.exe not on PATH (should be present on all Windows installs)"
+    exit $script:EXIT_PRECONDITION
+}
+
+if (-not (Test-Path $Source)) {
+    Write-Log -Level FAIL -Message "Source not found: $Source"
+    exit $script:EXIT_NOT_FOUND
+}
+
+# Capacity preflight
+try {
+    $srcUsedGB = [math]::Round((Get-ChildItem $Source -Recurse -Force -ErrorAction SilentlyContinue |
+        Measure-Object -Property Length -Sum -ErrorAction SilentlyContinue).Sum / 1GB, 1)
+} catch { $srcUsedGB = -1 }
+
+$destDriveLetter = $Destination.Substring(0, 1).ToUpper()
+$destDrive = Get-PSDrive -PSProvider FileSystem -Name $destDriveLetter -ErrorAction SilentlyContinue
+if ($destDrive) {
+    $destFreeGB = [math]::Round($destDrive.Free / 1GB, 1)
+    Write-Log -Level INFO -Message "Source data: $srcUsedGB GB  |  Destination free: $destFreeGB GB"
+    if ($srcUsedGB -gt 0 -and $destFreeGB -lt $srcUsedGB) {
+        Write-Log -Level FAIL -Message "Destination has $destFreeGB GB free; source is $srcUsedGB GB. Insufficient space."
+        exit $script:EXIT_VALIDATION
+    }
+}
+
+# Timestamps and log paths
+$stamp     = (Get-Date).ToString('yyyyMMdd-HHmmss')
+$cloneLog  = Join-Path $LogDir "recover-clone-$stamp.log"
+$failedLog = Join-Path $LogDir "recover-clone-failed-$stamp.log"
+
+# Build robocopy command
+$roboArgs = @($Source, $Destination)
+if ($NoMirror) {
+    $roboArgs += '/E'           # subdirectories incl. empty
+} else {
+    $roboArgs += '/MIR'         # mirror
+}
+$roboArgs += '/XJ'              # skip junction points
+$roboArgs += '/COPY:DAT'        # data, attributes, timestamps (skip ACL for speed)
+$roboArgs += '/DCOPY:T'         # also copy directory timestamps
+$roboArgs += "/R:$MaxRetries"
+$roboArgs += '/W:0'
+$roboArgs += '/MT:8'            # 8 threads
+$roboArgs += '/V'               # verbose — list skipped files
+$roboArgs += '/BYTES'           # report sizes in bytes (cleaner for parsing)
+$roboArgs += '/NP'              # no per-file progress (cleaner log)
+$roboArgs += "/LOG:$cloneLog"
+$roboArgs += '/TEE'             # console + log
+if ($DryRun) {
+    $roboArgs += '/L'           # list only — no actual copy
+    Write-Log -Level INFO -Message "DRY-RUN — robocopy /L will enumerate without copying"
+}
+
+Write-Log -Level INFO -Message "Logs:  $cloneLog"
+Write-Log -Level INFO -Message "Robocopy: robocopy $($roboArgs -join ' ')"
+
+if (-not $PSCmdlet.ShouldProcess("$Source -> $Destination", "robocopy clone")) {
+    Write-Log -Level INFO -Message "WhatIf: would run but skipped due to -WhatIf"
+    exit $script:EXIT_OK
+}
+
+# Run robocopy
+$start = Get-Date
+& robocopy.exe @roboArgs
+$roboExit = $LASTEXITCODE
+$end = Get-Date
+
+# Decode robocopy exit code
+# 0      — no files copied (nothing to do)
+# 1      — files copied OK
+# 2      — extra files/dirs detected (not an error in /MIR mode)
+# 4      — mismatches detected
+# 8      — failures — files could not be copied
+# 16     — fatal error
+# Combinations possible (bitmask). >=8 means errors.
+$elapsed = [math]::Round(($end - $start).TotalMinutes, 1)
+Write-Log -Level INFO -Message "Elapsed: $elapsed min  |  Robocopy exit: $roboExit"
+
+# Extract failed files from log
+if (Test-Path $cloneLog) {
+    $failedFiles = Select-String -Path $cloneLog -Pattern 'ERROR \d+ \(0x[0-9A-Fa-f]+\)' -ErrorAction SilentlyContinue
+    if ($failedFiles) {
+        $failedFiles | ForEach-Object { $_.Line } | Set-Content -Path $failedLog
+        Write-Log -Level WARN -Message "$($failedFiles.Count) file(s) failed to copy — see: $failedLog"
+    }
+}
+
+# Map robocopy exit to ATP semantics
+if ($roboExit -ge 16) {
+    Write-Log -Level FAIL -Message "Fatal robocopy error — review $cloneLog"
+    exit $script:EXIT_ERROR
+} elseif ($roboExit -ge 8) {
+    Write-Log -Level WARN -Message "Some files could not be copied (drive-failure or permission). Clone is partial."
+    exit 1   # partial success per ATP
+} else {
+    Write-Log -Level PASS -Message "Clone complete. Robocopy code $roboExit (no errors)."
+    exit $script:EXIT_OK
+}

+ 18 - 1
skills/windows-ops/scripts/safe-disable-startup.ps1

@@ -87,11 +87,28 @@ function Get-RunEntries {
                 }
         }
     }
+    # Startup folder shortcuts use a separate StartupApproved variant
+    foreach ($d in @("$env:APPDATA\Microsoft\Windows\Start Menu\Programs\Startup",
+                     "$env:ALLUSERSPROFILE\Microsoft\Windows\Start Menu\Programs\StartUp")) {
+        if (Test-Path $d) {
+            Get-ChildItem $d -Filter *.lnk -ErrorAction SilentlyContinue | ForEach-Object {
+                $entries += [PSCustomObject]@{
+                    Name    = $_.Name        # full filename, e.g. "Comet.lnk"
+                    Command = $_.FullName
+                    Path    = $d
+                    Variant = 'StartupFolder'
+                }
+            }
+        }
+    }
     return $entries
 }
 
 function Get-CurrentState {
-    param([Parameter(Mandatory)][string]$EntryName, [Parameter(Mandatory)][string]$Variant)
+    param(
+        [Parameter(Mandatory)][string]$EntryName,
+        [Parameter(Mandatory)][ValidateSet('Run','Run32','StartupFolder')][string]$Variant
+    )
     $key = "HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\$Variant"
     if (-not (Test-Path $key)) { return 'unmanaged' }
     $val = (Get-ItemProperty $key -Name $EntryName -ErrorAction SilentlyContinue).$EntryName