Browse Source

feat(skills): Add windows-ops — Windows workstation diagnostics

Comprehensive Windows operations skill covering slow boot diagnosis,
failing drive identification, BSOD/crash decoding, startup management
across all five mechanisms, and event log audit patterns. Bump to
v2.5.1 (76 skills).

Killer feature: scripts/health-audit.ps1 walks WHEA -> storage ->
crashes -> startup -> resources in one pass and emits a verdict
block (specific failing drive identified by \Device\HarddiskN to
drive-letter mapping, storahci Event 129 reset counts, Event 41
BugCheck decoding using Properties[0] not Properties[1]).

scripts/safe-disable-startup.ps1 implements the StartupApproved
registry mechanism — what Task Manager's Disable button actually
does — so non-admin users can disable HKLM startup entries for
themselves. Supports -List, -Enable (re-enable), wildcards, and
-Json output.

scripts/crash-triage.ps1 finds the most recent Event 41 (or one at
a specified time), decodes its BugCheck against a known catalog,
and walks the configurable window before the crash flagging
smoking guns (storahci resets -> storage cascade; WHEA -> hardware
fault; nvlddmkm/igdkmd -> GPU driver hang).

Three reference files: full disk/storahci/Ntfs event ID catalog
with HDD-vs-SSD failure thresholds, Windows BugCheck code catalog
with symptom-to-cause mapping, and the all-five-mechanisms startup
catalog with vendor-pattern checklists.

Skills protocol: H1=name on line 1, ## Helps with as first body H2
with 9 problem-shape entries, spec-compliant frontmatter with
metadata.author + related-skills, PowerShell-adapted ATP for
scripts (comment-based help with EXAMPLES, -Json switch, stream
separation, semantic exit codes 0/2/3/4/5).

Dogfooded against a real workstation with a dying 8TB HGST HDD —
correctly identified failing drive (1943 Event 7 + 1646 Event 154
+ 20 storahci 129 over 60 days) and decoded two recent unclean
shutdowns with correct cause discrimination (PowerButtonTimestamp
non-zero = forced shutdown of hung machine; zero = hard power
loss or hardware lockup).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0xDarkMatter 1 month ago
parent
commit
1527c70cf6

+ 4 - 3
.claude-plugin/plugin.json

@@ -1,7 +1,7 @@
 {
   "name": "claude-mods",
-  "version": "2.5.0",
-  "description": "Custom commands, skills, and agents for Claude Code - session continuity, 23 expert agents, 75 skills, 2 commands, 6 rules, 4 hooks, 13 output styles, modern CLI tools",
+  "version": "2.5.1",
+  "description": "Custom commands, skills, and agents for Claude Code - session continuity, 23 expert agents, 76 skills, 2 commands, 6 rules, 4 hooks, 13 output styles, modern CLI tools",
   "author": "0xDarkMatter",
   "repository": "https://github.com/0xDarkMatter/claude-mods",
   "license": "MIT",
@@ -120,7 +120,8 @@
       "skills/typescript-ops",
       "skills/unfold-admin",
       "skills/vue-ops",
-      "skills/genart-ops"
+      "skills/genart-ops",
+      "skills/windows-ops"
     ],
     "rules": [
       "rules/cli-tools.md",

+ 1 - 1
AGENTS.md

@@ -5,7 +5,7 @@
 This is **claude-mods** - a collection of custom extensions for Claude Code:
 - **23 expert agents** for specialized domains (React, Python, Go, Rust, AWS, git, etc.)
 - **2 commands** for session management (/sync, /save)
-- **75 skills** for CLI tools, patterns, workflows, and development tasks (incl. `net-ops` for cross-platform network troubleshooting)
+- **76 skills** for CLI tools, patterns, workflows, and development tasks (incl. `net-ops` for network troubleshooting and `windows-ops` for Windows workstation diagnostics)
 - **13 output styles** for response personality (Vesper, Spartan, Mentor, Executive, Pair, Atlas, Coach, Harbour, Meridian, Noir, Roast, Sage, Scout)
 - **4 hooks** for pre-commit linting, post-edit formatting, dangerous command warnings, and pmail notifications
 - **Pigeon** inter-session messaging (`pigeon send/read/reply`) - SQLite-backed pmail at `~/.claude/pmail.db`

+ 5 - 2
README.md

@@ -12,16 +12,19 @@
 
 > *A comprehensive extension toolkit that transforms Claude Code into a specialized development powerhouse.*
 
-**claude-mods** is a production-ready plugin that extends Claude Code with 23 expert agents, 75 specialized skills, 13 output styles, 4 hooks, and modern CLI tools designed for real-world development workflows. Whether you're debugging React hooks, optimizing PostgreSQL queries, or building production CLI applications, this toolkit equips Claude with the domain expertise and procedural knowledge to work at expert level across multiple technology stacks.
+**claude-mods** is a production-ready plugin that extends Claude Code with 23 expert agents, 76 specialized skills, 13 output styles, 4 hooks, and modern CLI tools designed for real-world development workflows. Whether you're debugging React hooks, optimizing PostgreSQL queries, or building production CLI applications, this toolkit equips Claude with the domain expertise and procedural knowledge to work at expert level across multiple technology stacks.
 
 Built on the [Agent Skills specification](https://agentskills.io/specification) (an open standard backed by Anthropic, Vercel, Google, Microsoft, and 40+ agent platforms), claude-mods fills critical gaps in Claude Code's capabilities: persistent session state that survives across machines, on-demand expert knowledge for specialized domains, token-efficient modern CLI tools (10-100x faster than traditional alternatives), and proven workflow patterns for TDD, code review, and feature development. The toolkit implements Anthropic's [recommended patterns for long-running agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents), ensuring your development context never vanishes when sessions end.
 
 From Python async patterns to Rust ownership models, from AWS Fargate deployments to Craft CMS development - claude-mods provides the specialized knowledge and tools that transform Claude from a general-purpose assistant into a domain expert who understands your stack, remembers your workflow, and ships production code.
 
-**23 agents. 75 skills. 13 styles. 4 hooks. 6 rules. One install.**
+**23 agents. 76 skills. 13 styles. 4 hooks. 6 rules. One install.**
 
 ## Recent Updates
 
+**v2.5.1** (May 2026)
+- 🪟 **`windows-ops` skill** - Comprehensive Windows workstation operations. Diagnoses slow boot, identifies failing drives, decodes BSOD crashes, manages startup apps across all five mechanisms (registry Run keys, services, scheduled tasks, startup folders, group policy), and audits event logs. The killer feature is `scripts/health-audit.ps1` — a one-shot diagnostic ladder that walks WHEA → storage → crashes → startup → resources and emits a verdict block (specific failing drive identified by `\Device\HarddiskN` → drive letter mapping; storahci Event 129 reset counts; Event 41 BugCheck decoding with Properties[0] not Properties[1]; pre-crash timeline correlation). `scripts/safe-disable-startup.ps1` implements the StartupApproved registry mechanism — what Task Manager's "Disable" button actually does — so non-admin users can disable HKLM startup entries for themselves. `scripts/crash-triage.ps1` walks the N minutes before a crash, flagging smoking guns (storahci resets → storage failure cascade; WHEA before crash → hardware fault; GPU driver warnings → driver hang). Three reference files codify the deep content: full storage event ID catalog with severity thresholds (per-month counts that indicate failure for HDD vs SSD), Windows BugCheck code catalog mapping symptom → likely codes, and the all-five-mechanisms startup catalog with vendor-pattern checklists. Dogfooded on a real workstation with a dying 8TB HGST HDD: correctly identified failing drive (1943 + 1646 + 20 storahci resets) and decoded two recent crashes with discriminator (power-button-held = hang vs no-button = power-loss).
+
 **v2.5.0** (May 2026)
 - 🌐 **`net-ops` skill** - Cross-platform network troubleshooting (Windows / macOS / Linux) via local or remote SSH with a layered diagnostic ladder: link → ICMP → socket → DNS infrastructure → OS resolver → app. NDP-aware IPv6 classifier (disabled / ULA-only / no-route / path-broken / healthy), MTU/PMTU test, time-skew check, browser DoH detection (Chrome / Brave / Firefox), WSL2/container awareness. Modes: `--watch`, `--json` (NDJSON), `--redact` for opsec-clean dumps, `--quick` for skip-if-healthy. Per-OS probe + dns-audit + repair scripts, reverse-mode probe, 24-test self-suite.
 - 🌐 **`portless-ops` skill** - Local-dev HTTPS proxy operations for Vercel Labs' [portless](https://github.com/vercel-labs/portless). Wraps the canonical upstream `SKILL.md` and `oauth/SKILL.md` (vendored verbatim into `references/` since the npm package only ships `dist/`) and overlays operational patterns we've validated: the static-alias pattern for pairing portless with external supervisors (Process Compose, PM2, Docker), TLD selection decision tree (`.test`/`.dev`/`.localhost`/custom-owned), Windows-specific gotchas (`openssl` PATH from Git for Windows, `certutil` quirks, curl-vs-browser cert handling, PS 5.1 vs 7+ flag differences), the clean-reset procedure when changing TLDs (because `portless alias --remove` appends the active TLD), and three runnable scripts: `install-portless.ps1` (audits the npm tarball for known supply-chain IOCs *before* installing), `reset-state.ps1` (full state wipe + re-register), `sync-aliases-from-yaml.ps1` (derives portless aliases from a supervisor's YAML). Four `portless.json` asset templates cover single-app, monorepo, custom-TLD-documented, and `package.json`-inline patterns.

+ 335 - 0
skills/windows-ops/SKILL.md

@@ -0,0 +1,335 @@
+---
+name: windows-ops
+description: "Comprehensive Windows workstation operations - diagnose slow boot, identify failing drives, decode BSOD crashes, manage startup apps, audit event logs. Use for: Windows is slow, slow bootup, won't boot, blue screen, BSOD, kernel crash, drive failing, SMART errors, disk errors, Event 41, Event 129, storahci reset, BugCheck, CRITICAL_PROCESS_DIED, crash dump, MEMORY.DMP, minidump, msconfig, services.msc, registry Run keys, StartupApproved, scheduled tasks at logon, slow login, high CPU at boot, Adobe startup, Docker startup, disable startup app."
+license: MIT
+allowed-tools: "Read Write Bash"
+metadata:
+  author: claude-mods
+  related-skills: net-ops, debug-ops, perf-ops
+---
+
+# windows-ops
+
+## Helps with
+
+Slow boot on a Windows machine that used to be fast — bloat accumulation across the five startup mechanisms (registry Run keys, services, scheduled tasks, startup folders, group policy). The same machine still boots fast once those are inventoried and trimmed.
+
+Failing drives that nobody's spotted yet. The signal lives in System log Events `7` / `52` / `153` / `154` (disk bad block, paging error, retry, hardware error) and `storahci` Event `129` ("Reset to device, \Device\RaidPortN, was issued"). Healthy drives produce zero of these — hundreds in a month means active failure even when SMART still claims "Healthy."
+
+Crashes with no obvious cause. Event 41 (Kernel-Power) carries the BugCheck code at `Properties[0]` and four parameters at `Properties[1-4]`. A `0xEF` (CRITICAL_PROCESS_DIED), `0xD1` (DRIVER_IRQL), `0x124` (WHEA uncorrectable), or `0x0` (no bugcheck recorded → hard power loss) each implies a completely different fix.
+
+"My PC is slow" diagnosed by chasing the wrong symptom. Task Manager shows what's running NOW; the System log shows what failed at boot, what's been crashing, and what storage events preceded each crash. Always audit before treating.
+
+Unable to disable an HKLM startup entry because the user isn't an Administrator. The `StartupApproved` registry mechanism — what Task Manager's "Disable" button actually does — flips one byte in `HKCU\...\Explorer\StartupApproved\Run` and works without elevation, even for HKLM entries.
+
+BSOD analysis without a dump file. Pagefile too small, or hard power loss skipped the dump-write. `CrashDumpEnabled` registry key + pagefile size + free space on system drive determine whether the next crash gets diagnosed at all.
+
+Pre-crash timeline correlation. The events in the 10 minutes BEFORE Event 41 are where the story is. `storahci` resets before a crash → storage failure cascade. `nvlddmkm` / `igdkmd64` warnings before crash → GPU driver hang. WHEA events before crash → hardware fault.
+
+Identifying which physical drive is failing when the symptom is "Disk 1" or "\Device\Harddisk1" in an event message. Maps physical disk number ↔ drive letter ↔ controller port ↔ model + firmware, so the user knows which SATA cable to unplug.
+
+Adobe Creative Cloud / Docker Desktop / Slack / Electron app bloat eating boot time. Each ships with multiple startup entries (registry + services + scheduled tasks) that all need disabling to fully stop the auto-launch.
+
+## The Universal Insight
+
+**Windows tells you what's wrong if you ask the right log in the right way.** Most users (and most tutorials) reach for Task Manager. The actual diagnostic signal lives in the Event Log, the Registry's StartupApproved key, the storage driver's reset events, and the kernel's bugcheck records. This skill packages the queries that turn noise into a verdict.
+
+The most common diagnostic failure: treating symptoms in isolation. "Slow boot" → disable startup apps. "BSOD" → reinstall drivers. "Random crashes" → memtest. These are reasonable last resorts, but the data to identify the *actual* cause is sitting in the System log untouched. Always audit before treating.
+
+## The Diagnostic Ladder
+
+Walk down the layers in order. Each rung has a binary outcome:
+
+```
+1. Hardware errors    — WHEA-Logger events (CPU/RAM/PCIe-level faults)
+2. Storage health     — disk events 7/52/153/154, storahci 129 (controller reset)
+3. Crash record       — Event 41 (Kernel-Power) + BugCheck code + dump files
+4. Pre-crash timeline — events in N minutes before each crash
+5. Boot inventory     — all 5 startup mechanisms (registry, services, tasks, folders, group policy)
+6. Resource pressure  — top CPU/RAM/IO consumers
+7. Verdict            — what's failing, what to do
+```
+
+The most interesting failures cluster at rung 2 (storage) and rung 5 (startup bloat). The least interesting (but most-treated) is rung 6.
+
+## Workflow
+
+### 1. Run the comprehensive audit
+
+```powershell
+scripts/health-audit.ps1
+```
+
+Produces a verdict block: hardware errors, storage health per disk, recent crashes, top resource consumers, startup inventory. Scan for `[FAIL]` markers — that's where to drill.
+
+### 2. Drill into the failing layer
+
+| Symptom | Script |
+|---|---|
+| Storage errors flagged | `scripts/disk-health.ps1` — per-drive SMART + event correlation |
+| Recent crash | `scripts/crash-triage.ps1 -CrashTime <datetime>` — pre-crash timeline + BugCheck decode |
+| Slow boot / many startup items | `scripts/startup-audit.ps1` — all 5 mechanisms inventoried |
+| Need to query specific events | `scripts/event-search.ps1 -Provider <name> -Hours <N>` — flexible filter helper |
+
+### 3. Apply the minimum reversible fix
+
+| Action | Script |
+|---|---|
+| Disable startup app (no admin needed) | `scripts/safe-disable-startup.ps1 -Name <regname>` |
+| Set service to Manual (admin) | `Set-Service <name> -StartupType Manual; Stop-Service <name>` |
+| Disable scheduled task | `Disable-ScheduledTask -TaskName <name>` |
+
+All disables are reversible — the StartupApproved registry mechanism flips one byte; re-enabling is the inverse.
+
+## Storage Health & Failure Detection
+
+The single highest-yield audit. Failing drives cause slow boots (Windows times out probing them), instability (controller resets cascade into kernel hangs), and crashes (I/O failures kill critical processes). Three independent data sources to cross-reference:
+
+### Disk error events
+
+```powershell
+Get-WinEvent -FilterHashtable @{LogName='System'; ProviderName='disk'; StartTime=(Get-Date).AddDays(-30)} |
+    Group-Object Id | Select-Object Count, Name
+```
+
+Event ID catalog (full reference in `references/storage-events.md`):
+
+| ID | Meaning | Severity |
+|----|---------|----------|
+| **7** | "The device, \Device\HarddiskN\DR1, has a bad block" | **High** — sectors going bad |
+| **51** | "An error was detected on device during a paging operation" | High |
+| **52** | "Write cache enabled" | Informational |
+| **153** | "IO operation at LBA X was retried" | Medium |
+| **154** | "IO operation at LBA X failed due to a hardware error" | **High** — Windows' explicit hardware verdict |
+
+Even 10 events of ID 7 or 154 in a month is a strong failure signal. Hundreds = drive replacement is urgent.
+
+### Storage controller resets
+
+```powershell
+Get-WinEvent -FilterHashtable @{LogName='System'; ProviderName='storahci'; Id=129; StartTime=(Get-Date).AddDays(-60)}
+```
+
+`storahci` Event 129 ("Reset to device, \Device\RaidPortN, was issued") means the drive stopped responding and the driver had to reset the controller. **Healthy = zero events.** Any non-zero count warrants investigation. >5 in a month = active failure.
+
+### Disk → drive letter mapping
+
+The error message identifies `\Device\HarddiskN` — to find the actual drive:
+
+```powershell
+Get-Disk | Select-Object Number, FriendlyName, BusType, HealthStatus, FirmwareVersion,
+    @{N='SizeGB';E={[math]::Round($_.Size/1GB,0)}}
+```
+
+`Number` matches the `N` in `\Device\HarddiskN`. Cross-reference with `Get-Partition -DiskNumber N` for drive letter.
+
+### SMART reliability counters
+
+```powershell
+Get-PhysicalDisk | ForEach-Object {
+    $_ | Get-StorageReliabilityCounter | Select-Object Temperature, Wear, ReadErrorsTotal, WriteErrorsTotal, PowerOnHours
+}
+```
+
+Returns blank on some NVMe drives due to Windows driver limitations — fall back to vendor tools (Samsung Magician, CrystalDiskInfo) or `smartctl` from smartmontools if installed.
+
+## Boot Performance & Startup Management
+
+Windows has **five separate startup mechanisms**, each requiring different tooling. Task Manager only shows two of them. Full inventory in `references/startup-mechanisms.md`.
+
+| Mechanism | Where | How to inspect | How to disable |
+|-----------|-------|----------------|----------------|
+| Registry Run keys | `HKCU/HKLM\...\Run` (+ WOW6432) | `Get-ItemProperty` | `StartupApproved` binary flag |
+| Services | Service Control Manager | `Get-Service` | `Set-Service -StartupType Manual` (admin) |
+| Scheduled Tasks at logon | Task Scheduler | `Get-ScheduledTask` | `Disable-ScheduledTask` |
+| Startup folder shortcuts | `%APPDATA%\...\Startup\` + AllUsers | `Get-ChildItem` | Delete or rename .lnk |
+| Group Policy startup scripts | `HKLM\...\Policies\Scripts` | Group Policy Editor / `gpresult` | (rare on workstations) |
+
+### The StartupApproved trick (disable HKLM entries without admin)
+
+Task Manager's "Disable" button writes a binary flag to:
+
+```
+HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\Run    (HKLM 64-bit entries)
+HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\Run32  (HKLM WOW6432 entries)
+HKCU\...\StartupApproved\StartupFolder                                          (startup folder shortcuts)
+```
+
+The value is 12 bytes: `[status byte] [00 00 00] [8-byte FILETIME timestamp]`. Status = `0x02` enabled, `0x03` disabled. Writing this to HKCU lets a non-admin user disable HKLM startup entries for themselves. The script `scripts/safe-disable-startup.ps1` automates this.
+
+### Boot duration measurement
+
+Windows 11 stores boot performance in `Microsoft-Windows-Diagnostics-Performance/Operational` log (admin to read). Without admin, infer from the gap between Event 12 (`The operating system started at...`) and Event 6005 (`The Event log service was started`), then to first user-mode event. Typically:
+
+- Healthy SSD system: 15–25 seconds to login screen
+- Healthy + many startup apps: 30–60 seconds to usable desktop
+- Failing storage: 60+ seconds, with stalls
+
+## Crash Analysis & Dump Triage
+
+### Event 41 (Kernel-Power) decoding
+
+This is **the** crash record. Properties array layout:
+
+| Index | Field | What it means |
+|-------|-------|---------------|
+| 0 | BugcheckCode | The stop code (0x0 = no bugcheck recorded → hard power loss or hang) |
+| 1 | BugcheckParameter1 | First parameter (often a memory address) |
+| 2-4 | BugcheckParameter2-4 | Additional parameters |
+| 5 | SleepInProgress | True if crash during sleep transition |
+| 6 | PowerButtonTimestamp | Non-zero = power button was held |
+
+Common BugCheck codes (full reference in `references/bugcheck-codes.md`):
+
+| Code | Name | Typical cause |
+|------|------|---------------|
+| `0x0` | (no bugcheck) | Hard power loss, total hang, hardware-level failure |
+| `0xEF` | CRITICAL_PROCESS_DIED | A critical system process (csrss/services/wininit) was killed |
+| `0xD1` | DRIVER_IRQL_NOT_LESS_OR_EQUAL | Bad driver accessed bad memory address |
+| `0x50` | PAGE_FAULT_IN_NONPAGED_AREA | Bad memory or storage I/O for pagefile |
+| `0x124` | WHEA_UNCORRECTABLE_ERROR | Hardware-level CPU/cache/PCIe error |
+| `0x7E` | SYSTEM_THREAD_EXCEPTION_NOT_HANDLED | Driver crashed |
+| `0x9F` | DRIVER_POWER_STATE_FAILURE | Driver hung during sleep/wake |
+
+### Pre-crash timeline correlation
+
+The crash record alone rarely tells you the cause. The **events in the 10 minutes before the crash** are where the story is. Use:
+
+```powershell
+scripts/crash-triage.ps1 -CrashTime '2026-05-15 00:57:50' -WindowMinutes 10
+```
+
+Look for:
+- `storahci` Event 129 (drive reset) before crash → storage failure cascade
+- `nvlddmkm` / `igdkmd64` warnings before crash → GPU driver hang
+- `WHEA-Logger` events before crash → hardware-level fault
+- Sudden silence (no events for >30s before crash) → total system hang
+
+### Dump configuration audit
+
+```powershell
+Get-ItemProperty 'HKLM:\SYSTEM\CurrentControlSet\Control\CrashControl' |
+    Select-Object CrashDumpEnabled, DumpFile, MinidumpDir, AutoReboot
+```
+
+`CrashDumpEnabled` values: `0` = None, `1` = Complete, `2` = Kernel, `3` = Small (minidump), `7` = Automatic.
+
+If `0` or no dumps exist after recent crashes:
+- Pagefile may be too small (needs >RAM size for complete dump, or >256MB for minidump)
+- Power loss crashes can't write dumps regardless — RAM contents are gone before disk write
+- Some BSODs in early boot also skip dump-writing
+
+## Event Log Query Patterns
+
+`Get-WinEvent` with `-FilterHashtable` is dramatically faster than `Where-Object` filtering. Keys that work:
+
+| Key | Type | Example |
+|-----|------|---------|
+| `LogName` | string or array | `'System'`, `@('System','Application')` |
+| `ProviderName` | string or array | `'storahci'`, `'Microsoft-Windows-Kernel-Power'` |
+| `Id` | int or array | `41`, `@(7,153,154)` |
+| `Level` | int or array | `1`=Critical, `2`=Error, `3`=Warning, `4`=Information |
+| `StartTime` | DateTime | `(Get-Date).AddDays(-7)` |
+| `EndTime` | DateTime | `(Get-Date)` |
+
+Use `scripts/event-search.ps1` for common patterns (events in time window, by provider, correlated across logs).
+
+## Common Failure Modes
+
+| Symptom | First check | Common cause |
+|---------|-------------|--------------|
+| Slow boot, used to be fast | `startup-audit.ps1` | Bloat accumulation (Docker, Adobe CC, Electron apps) |
+| Slow boot, getting worse | `disk-health.ps1` | Failing drive — Windows waiting on probe timeouts |
+| Random freezes + hard restarts | `disk-health.ps1` + `crash-triage.ps1` | storahci resets cascading into kernel hang |
+| BSOD on wake from sleep | `crash-triage.ps1` (BugCheck `0x9F`) | Driver power state failure (often GPU, USB) |
+| BSOD with WHEA before it | `crash-triage.ps1` (BugCheck `0x124`) | Hardware fault — RAM, CPU, PCIe lane |
+| Sluggish but not crashing | `health-audit.ps1` performance section | Background process pileup |
+| Login takes minutes | `startup-audit.ps1` | Slow startup item synchronously blocking shell |
+
+## Recovery Patterns
+
+### Cloning from a failing drive
+
+**Never run `chkdsk /f` on a failing drive** — repair operations write to bad sectors and can finish the drive off. Image first, repair the image second.
+
+```powershell
+# Healthy-side clone with no retries (fast, skips bad sectors)
+robocopy "Y:\important" "Z:\backup\important" /MIR /R:0 /W:0 /XJ /NDL /LOG:clone.log
+```
+
+For bit-level recovery from a drive with many bad sectors, use `ddrescue` (via WSL or live Linux USB) with a map file so the operation is resumable. Documented in `references/storage-events.md`.
+
+### Physically removing a failing drive
+
+If a drive is causing boot stalls or crashes:
+1. Identify it via `disk-health.ps1`
+2. Verify nothing critical points at it (`scripts/disk-health.ps1 -CheckDependencies <drive-letter>`)
+3. Physically disconnect SATA cable OR disable in BIOS OR set offline in `diskpart`
+4. Reboot — boot time should drop significantly, controller resets should stop
+
+## Voice & Output Style
+
+Output follows the claude-mods diagnostic convention:
+
+- `[PASS]` / `[FAIL]` / `[WARN]` / `[INFO]` prefixes for scan rows
+- Verdict block at the bottom with specific findings + recommended actions
+- Drive identifications include physical disk number, model, capacity, drive letter
+- Crash references include UTC timestamp, BugCheck code, primary parameter, suspected cause
+- No marketing language, no emojis in scripts (reserved for SKILL.md prose where useful)
+
+## What This Skill Doesn't Cover
+
+- **Network diagnostics** → use `net-ops`
+- **Specific application performance profiling** → use `perf-ops`
+- **Source-code-level debugging** → use `debug-ops`
+- **Kernel dump file analysis with WinDbg** — too specialised for this skill; covered by reference doc pointers only
+- **Group Policy diagnostics** — relevant for enterprise but rare on workstations
+- **Linux-on-Windows (WSL) issues** — separate domain
+
+## Cross-References
+
+| When | Use |
+|------|-----|
+| Need to triage a remote Windows box | `net-ops` reverse-probe pattern adapts directly |
+| Crash is networking-related | Combine with `net-ops` for DNS / VPN driver issues |
+| Multiple machines exhibit same pattern | Run `health-audit.ps1` on each, diff the outputs |
+
+## References
+
+- `references/storage-events.md` — full event ID catalog for `disk`, `storahci`, `Ntfs`, `partmgr`, `volmgr` providers. Load when investigating disk errors, mapping `\Device\Harddisk N` references, or interpreting LBA-level I/O failures. Includes severity triage thresholds (per-month counts that indicate failure for HDD vs SSD) and the query recipes the audit script uses.
+
+- `references/bugcheck-codes.md` — Windows BSOD stop-code catalog covering the codes that actually appear on workstations. Load when decoding a non-trivial Event 41, analyzing a minidump's stop code, or matching a symptom ("crashes during sleep", "random reboot no dump", "crashes during file copy") to a likely cause. Covers `0xEF`, `0xD1`, `0x124`, `0x50`, `0x7A`, `0x9F` and the special `0x0` case.
+
+- `references/startup-mechanisms.md` — Deep dive on all five Windows startup mechanisms: registry Run keys, services, scheduled tasks, startup folders, group policy. Load when doing a full startup audit, hunting vendor-installed auto-launch hooks across multiple mechanisms, or implementing the StartupApproved disable trick. Includes vendor-pattern checklists (Adobe, Docker, NVIDIA) and edge cases like WMI permanent event consumers and IFEO Debugger redirects.
+
+## Worked example
+
+A user reports "my PC takes minutes to boot and crashes sometimes." Workflow:
+
+```
+1. scripts/health-audit.ps1
+   → identifies failing drive (Disk N), counts pre-crash storage resets,
+     surfaces crash history with BugCheck codes, inventories startup load
+
+2. scripts/crash-triage.ps1
+   → most recent crash decoded; pre-crash timeline shows storahci 129
+     at T-2min → SMOKING GUNS: storage failure cascade
+
+3. scripts/safe-disable-startup.ps1 -List
+   → see current state of every Run-key entry across HKCU + HKLM (+ WOW64)
+
+4. scripts/safe-disable-startup.ps1 -Name 'Adobe Creative Cloud','Granola',...
+   → bulk disable via StartupApproved overlay (no admin needed)
+
+5. (admin) Set-Service AdobeARMservice -StartupType Manual; Stop-Service ...
+   → for the service-tier startup hooks the script doesn't touch
+
+6. Verdict to user:
+   - Disk N is dying — back up + replace (specific drive identified)
+   - N startup items disabled
+   - Crash risk eliminated by physically disconnecting failing drive
+
+7. Confirm by reboot — re-run health-audit, verify zero storahci resets
+```
+
+The data was always there in the System log — this skill just asks for it correctly.

+ 194 - 0
skills/windows-ops/references/bugcheck-codes.md

@@ -0,0 +1,194 @@
+# Windows BugCheck Code Catalog
+
+Load this when decoding Event 41 Properties[0] (the BugCheck code), analyzing minidump files, or matching a BSOD stop code to a likely cause. Codes here are the ones that actually appear on workstations; the full list is in Microsoft's documentation but most are kernel-internal or driver-specific corner cases.
+
+## Contents
+
+1. [How to read Event 41](#how-to-read-event-41)
+2. [Most common stop codes](#most-common-stop-codes) — by frequency on real workstations
+3. [Hardware-pointer codes](#hardware-pointer-codes) — when the bugcheck points at silicon
+4. [Driver-pointer codes](#driver-pointer-codes) — when a kernel-mode driver is at fault
+5. [Storage-pointer codes](#storage-pointer-codes) — bugchecks induced by failing disks
+6. [Power / sleep codes](#power--sleep-codes)
+7. [Code 0x0 (no bugcheck)](#code-0x0-no-bugcheck) — the special case
+8. [Decoding bugcheck parameters](#decoding-bugcheck-parameters)
+9. [Cross-reference: symptom → likely codes](#cross-reference-symptom--likely-codes)
+
+## How to read Event 41
+
+```powershell
+Get-WinEvent -FilterHashtable @{LogName='System'; Id=41} -MaxEvents 5 |
+    Select-Object TimeCreated,
+        @{N='BugCheckCode'; E={ '0x{0:X}' -f $_.Properties[0].Value }},
+        @{N='Param1'; E={ '0x{0:X}' -f $_.Properties[1].Value }},
+        @{N='Param2'; E={ '0x{0:X}' -f $_.Properties[2].Value }},
+        @{N='Param3'; E={ '0x{0:X}' -f $_.Properties[3].Value }},
+        @{N='Param4'; E={ '0x{0:X}' -f $_.Properties[4].Value }},
+        @{N='SleepInProgress'; E={ $_.Properties[5].Value }},
+        @{N='PowerButtonTime'; E={ $_.Properties[6].Value }}
+```
+
+`Properties[0]` is the BugCheck code (the "stop code" in BSOD blue screen). `Properties[1-4]` are the four parameters whose meaning depends on the code. `Properties[5]` flags crashes during sleep transitions. `Properties[6]` is non-zero if the power button was held (manual force-shutdown vs spontaneous crash).
+
+**Critical gotcha**: people frequently quote `Properties[1]` as "the BugCheck code" — it isn't. That's BugcheckParameter1. The actual code is at index `0`.
+
+## Most common stop codes
+
+Frequency on real workstations, descending:
+
+| Hex | Name | Typical cause | Investigation entry point |
+|-----|------|---------------|---------------------------|
+| `0x0` | (no bugcheck recorded) | Hard power loss / total hang / hardware-level failure | See [Code 0x0](#code-0x0-no-bugcheck) below |
+| `0xD1` | DRIVER_IRQL_NOT_LESS_OR_EQUAL | Driver accessed bad memory at high IRQL | Param4 = driver address; symbol lookup |
+| `0x3B` | SYSTEM_SERVICE_EXCEPTION | Exception in a kernel service call | Param2 = faulting address |
+| `0x7E` | SYSTEM_THREAD_EXCEPTION_NOT_HANDLED | Driver thread threw unhandled exception | Param1 = exception code; Param2 = address |
+| `0x50` | PAGE_FAULT_IN_NONPAGED_AREA | Bad memory OR storage I/O failed for pageable code | Param1 = referenced address; check disk Event 51 same timeframe |
+| `0xEF` | CRITICAL_PROCESS_DIED | A critical system process (csrss/services/wininit) terminated | Param1 = EPROCESS address (needs dump for process name) |
+| `0x124` | WHEA_UNCORRECTABLE_ERROR | Hardware-level CPU/cache/PCIe error | Cross-reference WHEA-Logger events same timeframe |
+| `0x1E` | KMODE_EXCEPTION_NOT_HANDLED | Kernel-mode unhandled exception | Param1 = exception; Param2 = faulting address |
+| `0x9F` | DRIVER_POWER_STATE_FAILURE | Driver hung during sleep/wake | Param1 = transition type (1=sleep, 2=resume, 3=device, 4=node) |
+| `0xA` | IRQL_NOT_LESS_OR_EQUAL | Like 0xD1 but caller usually pageable code | Param4 = caller address |
+| `0x1A` | MEMORY_MANAGEMENT | Memory manager corruption | Param1 = subcode (0x41201 = PFN list corruption, 0x41284 = PTE corruption) |
+| `0xC1` | SPECIAL_POOL_DETECTED_MEMORY_CORRUPTION | Driver Verifier caught buffer overrun | Param1 = pool address; only fires with Verifier enabled |
+| `0x139` | KERNEL_SECURITY_CHECK_FAILURE | Stack/pool corruption detected | Param1 = subcode (3 = invalid stack guard, 0xA = corrupt LIST_ENTRY) |
+| `0xC2` | BAD_POOL_CALLER | Driver freed bad pool / freed twice | Param1 = subcode |
+
+## Hardware-pointer codes
+
+When you see these, suspect the hardware first. Software/driver fixes are unlikely to help.
+
+| Hex | Name | What it means |
+|-----|------|---------------|
+| `0x124` | WHEA_UNCORRECTABLE_ERROR | CPU machine check, ECC failure, PCIe link fault. Run memtest, check thermals, audit recent hardware changes. |
+| `0xF4` | CRITICAL_OBJECT_TERMINATION | Critical process exited — often storage-induced when paging fails. |
+| `0x9C` | MACHINE_CHECK_EXCEPTION | CPU detected uncorrectable hardware fault. Param2 = machine check bank. Almost always CPU/RAM. |
+| `0x18B` | SECURE_KERNEL_ERROR | VBS/Credential Guard hardware enforcement failure. CPU/TPM. |
+
+For `0x124`, the WHEA-Logger entries in the same minute give the actual MCA bank and error type. Without those it's hard to localise further than "hardware error."
+
+## Driver-pointer codes
+
+Most common on workstations. The fixable category — driver update / rollback usually resolves.
+
+| Hex | Name | Typical culprit drivers |
+|-----|------|-------------------------|
+| `0xD1` | DRIVER_IRQL_NOT_LESS_OR_EQUAL | Network drivers, antivirus, VPN drivers |
+| `0x7E` | SYSTEM_THREAD_EXCEPTION_NOT_HANDLED | GPU drivers (nvlddmkm, igdkmd64, amdkmdag), audio drivers |
+| `0x9F` | DRIVER_POWER_STATE_FAILURE | USB, GPU, network (anything that has power states) |
+| `0xC4` | DRIVER_VERIFIER_DETECTED_VIOLATION | Whatever driver Verifier was watching |
+| `0x101` | CLOCK_WATCHDOG_TIMEOUT | Usually CPU/chipset driver, or hardware. Param1 = stalled CPU number. |
+
+**Identifying the driver**: minidump analysis with WinDbg's `!analyze -v` is the canonical method. Without a dump, look for warnings from the driver's provider name in the System log within the same minute as the crash:
+
+```powershell
+# Provider names commonly associated with these crashes
+Get-WinEvent -FilterHashtable @{
+    LogName='System'
+    ProviderName=@('nvlddmkm','igdkmd64','amdkmdag','e1rexpress','RTKVHD','iaStorAVC','Disk','storahci')
+    StartTime=(Get-Date).AddDays(-7)
+    Level=@(1,2,3)
+}
+```
+
+## Storage-pointer codes
+
+These bugchecks are typically caused by storage failures, not the kernel/drivers per se. The actual fix is usually replacing the disk.
+
+| Hex | Name | Why storage causes it |
+|-----|------|-----------------------|
+| `0x50` | PAGE_FAULT_IN_NONPAGED_AREA | Page file I/O failed → kernel can't read paged-out memory |
+| `0x77` | KERNEL_STACK_INPAGE_ERROR | Paged-out kernel stack couldn't be read back. Param2 = I/O status. |
+| `0x7A` | KERNEL_DATA_INPAGE_ERROR | Paged-out kernel data couldn't be read back. Param3 = I/O status code. |
+| `0xC4` (subcode 0x91) | Driver Verifier — DPC routine exceeded time limit | Often disk driver waiting on hung disk |
+| `0xF4` | CRITICAL_OBJECT_TERMINATION | A critical process couldn't read its executable pages (storage failed) |
+| `0xEF` | CRITICAL_PROCESS_DIED | Variant of above — paging failure kills csrss/services/wininit |
+
+For storage-induced bugchecks, the I/O status code at Param3 (for `0x7A`) or Param2 (for `0x77`) is informative:
+
+| I/O Status | Meaning |
+|------------|---------|
+| `0xC000009C` | STATUS_DEVICE_DATA_ERROR — bad sector |
+| `0xC000009D` | STATUS_DEVICE_NOT_CONNECTED — drive vanished |
+| `0xC000016A` | STATUS_DISK_OPERATION_FAILED — generic disk failure |
+| `0xC0000185` | STATUS_IO_DEVICE_ERROR — I/O device error |
+
+Cross-reference with disk Event 51 / 154 (`disk` provider, hardware error events) in the same timeframe. The two together definitively pin the cause to a specific drive.
+
+## Power / sleep codes
+
+| Hex | Name | When |
+|-----|------|------|
+| `0x9F` | DRIVER_POWER_STATE_FAILURE | Driver hung during sleep transition. Param1 = phase. |
+| `0xA0` | INTERNAL_POWER_ERROR | Power manager internal error |
+| `0x9E` | USER_MODE_HEALTH_MONITOR | Clustering / fault-tolerance code path (rare on workstations) |
+| `0xEF` (during resume) | CRITICAL_PROCESS_DIED | Critical process didn't survive sleep — often paging-storage related |
+| `0x101` | CLOCK_WATCHDOG_TIMEOUT | CPU didn't tick — sometimes ACPI/chipset-driver-induced during sleep |
+
+For `0x9F` with Param1=3 (device sleep): Param4 points to the device object that hung. WinDbg can decode this; without dump access the device class can sometimes be inferred from the System log's last successful device-state events before the crash.
+
+## Code 0x0 (no bugcheck)
+
+When `Properties[0]` is `0x0` and all four parameters are also `0`, **Windows recorded no bugcheck**. The crash record exists only because the kernel saw an unclean shutdown on the next boot. This means one of:
+
+- **Hard power loss** — PSU dropout, power cable yanked, mains cut. Most common cause on desktops.
+- **Total hardware lockup** — CPU/chipset entered a state where the kernel couldn't even execute the bugcheck path.
+- **Manual power button hold** — user force-shutdown a hung machine.
+
+Discriminator: `Properties[6]` (PowerButtonTimestamp):
+
+- `0` → no power button input recorded → likely power loss or hardware lockup
+- Non-zero → power button was pressed → user-initiated force shutdown of a hung machine
+
+Critically: **no minidump will exist** for `0x0` crashes. There's no point hunting for `MEMORY.DMP` — the system was gone before it could write. Investigation has to use circumstantial evidence: System log events in the minutes before the crash, recent hardware changes, thermal logs, etc.
+
+The repeated occurrence of `0x0` crashes on the same machine is a strong signal for:
+1. Failing PSU (power transients under load)
+2. Failing/loose storage cable (storage drops → kernel hangs → user power-cycles)
+3. Thermal shutdown (CPU/GPU hits TjMax)
+4. Failing RAM (kernel hang or instant total corruption)
+
+## Decoding bugcheck parameters
+
+The four parameters' meaning depends on the BugCheck code. The Microsoft docs list each code's parameter semantics; the common patterns:
+
+| Position | Typical content for hardware-related codes | Typical content for driver-related codes |
+|----------|---------------------------------------------|------------------------------------------|
+| Param1 | Subcode / error class / referenced address | Faulting address |
+| Param2 | I/O status / IRQL level | Calling function address |
+| Param3 | Error-specific | Process/thread context |
+| Param4 | Caller address | Driver image base address |
+
+Parameter values that look like `0xFFFFxxxxxxxxxxxx` (high bits set) are kernel-mode addresses. Decoding them to a driver name requires dump analysis with proper symbols.
+
+Parameter values that look like `0x000000xxxx` (low value) are usually subcodes — look these up in the Microsoft documentation for the specific BugCheck.
+
+## Cross-reference: symptom → likely codes
+
+| User-reported symptom | Most likely BugCheck codes | First investigation step |
+|------------------------|---------------------------|--------------------------|
+| "Random crashes, no pattern" | `0x124`, `0x1A`, `0x101` | Check WHEA events, run memtest |
+| "Crashes during gaming" | `0x116`, `0x117`, `0x7E` (nvlddmkm/amdkmdag) | GPU driver, GPU thermals, PSU |
+| "Crashes on sleep/wake" | `0x9F`, `0xA0`, `0xEF` | Param1 of 0x9F = transition phase |
+| "Crashes with USB device plugged" | `0xD1`, `0x9F` (Param1=3) | USB driver / device driver |
+| "Crashes during file copy / heavy IO" | `0x7A`, `0x50`, `0xF4`, `0xEF` | Check storage event 7/154 + storahci 129 |
+| "Random reboot, no dump" | `0x0` | Check storahci 129 in minutes before; check WHEA |
+| "Crashes after BIOS update" | Various | Likely chipset driver mismatch; roll back BIOS or update chipset |
+| "Crashes after Windows Update" | `0xD1`, `0x7E`, `0x3B` | Identify recently-updated driver; roll back |
+| "Crashes only at boot" | `0x7B`, `0xED`, `0x74` | Boot-critical drivers or storage |
+
+## When to escalate to WinDbg
+
+The Windows-side analysis the skill performs (BugCheck code + Properties + correlating events) handles ~70% of crashes. For the remaining 30% — driver bugs, complex memory corruption, hardware quirks — WinDbg with proper symbols is essential:
+
+```powershell
+# Verify symbol path
+$env:_NT_SYMBOL_PATH = "srv*C:\Symbols*https://msdl.microsoft.com/download/symbols"
+
+# Then open the dump in WinDbg and run:
+#   !analyze -v
+#   lm   (list modules)
+#   .bugcheck   (recap bugcheck)
+#   k   (stack trace)
+```
+
+That's a deeper investigation than this skill's scope; `windows-ops` produces the verdict that says "go look at the dump with WinDbg" for the cases that warrant it.

+ 408 - 0
skills/windows-ops/references/startup-mechanisms.md

@@ -0,0 +1,408 @@
+# Windows Startup Mechanisms
+
+Load this when auditing what auto-launches on a Windows system at boot or login. Windows has **five distinct mechanisms** plus a few edge cases. Task Manager's Startup tab shows only two of them. A proper startup audit must walk all five.
+
+## Contents
+
+1. [The five mechanisms](#the-five-mechanisms) — overview
+2. [Registry Run keys](#1-registry-run-keys) — the most common
+3. [Services](#2-services) — auto-start at boot
+4. [Scheduled Tasks](#3-scheduled-tasks-at-logon) — at logon, at startup, at event
+5. [Startup folder shortcuts](#4-startup-folder-shortcuts) — `.lnk` files
+6. [Group Policy startup scripts](#5-group-policy-startup-scripts) — domain / corp scenario
+7. [The StartupApproved mechanism](#the-startupapproved-mechanism) — how Task Manager disables things
+8. [Edge cases](#edge-cases) — WMI consumers, ActiveSetup, AppInit_DLLs, RunOnce
+9. [Full audit query patterns](#full-audit-query-patterns)
+10. [Why disabling-by-mechanism matters](#why-disabling-by-mechanism-matters) — apps register in multiple places
+
+## The five mechanisms
+
+| # | Mechanism | Scope | Trigger | Visible in Task Manager? | Disable without admin? |
+|---|-----------|-------|---------|--------------------------|------------------------|
+| 1 | Registry Run keys | User or Machine | Logon | Yes | Yes (StartupApproved trick) |
+| 2 | Services | Machine | Boot | No | No (admin required) |
+| 3 | Scheduled Tasks at logon/boot | Variable | Logon or boot or event | No (mostly) | Yes (for user tasks); No for system |
+| 4 | Startup folder shortcuts | User or All Users | Logon | Yes (user-folder ones) | Yes (user); No (all-users without admin) |
+| 5 | Group Policy startup scripts | Machine | Boot | No | No (admin/GPO required) |
+
+## 1. Registry Run keys
+
+The classic startup mechanism. Four registry locations Windows checks at logon.
+
+| Path | Scope | Architecture |
+|------|-------|--------------|
+| `HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Run` | Current user | Both 32-bit and 64-bit |
+| `HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run` | All users, machine-wide | 64-bit (on 64-bit Windows) |
+| `HKLM\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run` | All users, machine-wide | 32-bit redirect |
+| `HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce` | Current user | Runs once then deletes itself |
+| `HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce` | All users | Runs once then deletes itself |
+| `HKLM\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\RunOnce` | All users | 32-bit, runs once |
+
+Each is a registry key containing a flat list of named string values. Each value's name is the entry's "friendly name"; its data is the command line to execute:
+
+```
+HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Run
+├── Slack (REG_SZ) = "C:\Users\X\AppData\Local\slack\slack.exe" --process-start-args --startup
+├── Docker Desktop (REG_SZ) = "C:\Program Files\Docker\Docker\Docker Desktop.exe"
+└── BingWallpaperApp (REG_SZ) = "C:\Users\X\AppData\Local\Microsoft\BingWallpaperApp\BingWallpaperApp.exe"
+```
+
+### Enumeration
+
+```powershell
+$paths = @(
+    'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run',
+    'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce',
+    'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce',
+    'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\RunOnce'
+)
+foreach ($p in $paths) {
+    if (Test-Path $p) {
+        (Get-ItemProperty $p).PSObject.Properties |
+            Where-Object { $_.Name -notmatch '^PS' } |
+            ForEach-Object { [PSCustomObject]@{ Path=$p; Name=$_.Name; Command=$_.Value } }
+    }
+}
+```
+
+### Disable
+
+For HKCU entries: delete the registry value (user has permission to write their own hive).
+For HKLM entries: either use the StartupApproved mechanism (no admin needed, works for the current user only) or delete the value (needs admin, affects all users).
+
+## 2. Services
+
+Auto-starting services run before any user logs in. They contribute to "boot time to login screen" rather than "login to usable desktop."
+
+### Start types
+
+| Type | Meaning | Boot impact |
+|------|---------|-------------|
+| `Automatic` | Starts at boot, before logon | High — directly extends boot time |
+| `Automatic (Delayed Start)` | Starts ~2 minutes after boot, low priority | Low — runs after login |
+| `Manual` | Only starts when something requests it | None at boot |
+| `Disabled` | Can't be started at all | None |
+
+### Enumeration
+
+```powershell
+# Auto-start services currently running
+Get-Service | Where-Object { $_.StartType -eq 'Automatic' -and $_.Status -eq 'Running' } |
+    Select-Object Name, DisplayName, StartType
+
+# Get the binary path (Win32_Service) — useful for identifying bloat
+Get-CimInstance Win32_Service -Filter "StartMode='Auto' AND State='Running'" |
+    Select-Object Name, DisplayName, PathName
+```
+
+### Disable
+
+Always requires admin. For workstation tuning, prefer `Manual` over `Disabled`:
+
+```powershell
+# Set to manual (won't auto-start, but can run on demand)
+Set-Service <name> -StartupType Manual
+Stop-Service <name> -Force  # stop the currently running instance
+
+# Fully disable (NEVER runs)
+Set-Service <name> -StartupType Disabled
+Stop-Service <name> -Force
+```
+
+`Manual` is reversible by any process requesting the service. `Disabled` requires another `Set-Service` to re-enable.
+
+### Vendor patterns
+
+Common auto-start services that ship with consumer apps and rarely need to be Automatic:
+
+| Service | Application | Typical recommendation |
+|---------|-------------|------------------------|
+| `AdobeARMservice` | Adobe Acrobat | Manual — Acrobat starts it on demand for update checks |
+| `AdobeUpdateService` | Adobe Creative Cloud | Manual |
+| `ClickToRunSvc` | Microsoft Office | Disable if Office is rarely used; otherwise leave |
+| `Bonjour Service` | Apple iTunes / Adobe Bridge | Manual unless using mDNS |
+| `LGHUBUpdaterService` | Logitech G Hub | Manual |
+| `DSAService` / `DSAUpdateService` | Intel Driver Support Assistant | Manual or disable |
+| `WMPNetworkSvc` | Windows Media Player | Disable (legacy) |
+
+Note: don't disable security-related services (`ekrn`, `SecurityHealthService`, `WinDefend`, `BFE`). Antivirus needing early loading is by design.
+
+## 3. Scheduled Tasks at logon
+
+Task Scheduler can trigger tasks at:
+- System boot
+- User logon (specific user or any user)
+- Specific event (e.g., user idle for N minutes)
+- Specific time / schedule
+
+The "AtLogon" and "AtStartup" triggers are the startup-relevant ones.
+
+### Enumeration
+
+```powershell
+# All tasks with logon trigger
+Get-ScheduledTask | Where-Object { $_.Triggers.CimClass.CimClassName -like '*LogonTrigger*' } |
+    Select-Object TaskName, TaskPath, State, @{N='Action';E={$_.Actions.Execute}}
+
+# All tasks with boot trigger
+Get-ScheduledTask | Where-Object { $_.Triggers.CimClass.CimClassName -like '*BootTrigger*' } |
+    Select-Object TaskName, TaskPath, State, @{N='Action';E={$_.Actions.Execute}}
+
+# Both
+Get-ScheduledTask | Where-Object {
+    $_.Triggers.CimClass.CimClassName -match 'Logon|Boot'
+} | Select-Object TaskName, State, @{N='Trigger';E={$_.Triggers.CimClass.CimClassName -join ','}},
+    @{N='Action';E={$_.Actions.Execute}}
+```
+
+### Why they're easy to miss
+
+- Don't appear in Task Manager Startup tab
+- Often installed by third-party apps without telling the user (Adobe, Google Update, Microsoft Edge, Spotify, Syncthing)
+- Frequently in the `\Microsoft\...` task path which most audit tools skip
+
+Real-world example from this morning's session: **Syncthing's "Start Syncthing at logon" task** was the launch mechanism. Nothing in Run keys, nothing in Startup folder, nothing in services — only in Task Scheduler.
+
+### Disable
+
+```powershell
+Disable-ScheduledTask -TaskName 'task name' -TaskPath '\optional\subpath\'
+
+# Fully remove
+Unregister-ScheduledTask -TaskName 'task name' -Confirm:$false
+```
+
+User-scope tasks (under `\Users\` or stored in user's profile) can be disabled by the user. System-scope tasks need admin.
+
+## 4. Startup folder shortcuts
+
+The least sophisticated mechanism: drop a `.lnk` file in a magic folder, Windows launches it at logon.
+
+| Folder | Scope |
+|--------|-------|
+| `%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup` | Current user |
+| `%ALLUSERSPROFILE%\Microsoft\Windows\Start Menu\Programs\StartUp` | All users (note capital U) |
+
+These are file system locations, not registry entries. Items here also appear in Task Manager Startup tab.
+
+### Enumeration
+
+```powershell
+$startupDirs = @(
+    "$env:APPDATA\Microsoft\Windows\Start Menu\Programs\Startup",
+    "$env:ALLUSERSPROFILE\Microsoft\Windows\Start Menu\Programs\StartUp"
+)
+$shell = New-Object -ComObject WScript.Shell
+foreach ($d in $startupDirs) {
+    if (Test-Path $d) {
+        Get-ChildItem $d -Filter *.lnk | ForEach-Object {
+            $sc = $shell.CreateShortcut($_.FullName)
+            [PSCustomObject]@{
+                Folder = $d
+                Shortcut = $_.Name
+                Target = $sc.TargetPath
+                Arguments = $sc.Arguments
+                WorkingDir = $sc.WorkingDirectory
+            }
+        }
+    }
+}
+```
+
+### Disable
+
+For user folder: delete the `.lnk` file (user has write permission to their own folder).
+For all-users folder: needs admin to modify; OR use StartupApproved mechanism via `HKCU\...\StartupApproved\StartupFolder` to disable for current user only.
+
+## 5. Group Policy startup scripts
+
+Domain-joined or locally-configured policy scripts that run at boot (machine) or logon (user). On consumer workstations these are usually empty; on corporate machines they're frequently used for drive mappings, software deployment, registry configuration.
+
+| Path | Scope |
+|------|-------|
+| `HKLM\SOFTWARE\Policies\Microsoft\Windows\System\Scripts\Startup` | Machine boot scripts |
+| `HKLM\SOFTWARE\Policies\Microsoft\Windows\System\Scripts\Shutdown` | Machine shutdown scripts |
+| `HKCU\SOFTWARE\Policies\Microsoft\Windows\System\Scripts\Logon` | User logon scripts |
+| `HKCU\SOFTWARE\Policies\Microsoft\Windows\System\Scripts\Logoff` | User logoff scripts |
+
+Plus the filesystem locations:
+- `C:\Windows\System32\GroupPolicy\Machine\Scripts\Startup\`
+- `C:\Windows\System32\GroupPolicy\User\Scripts\Logon\`
+
+### Inspection
+
+```powershell
+# Effective policy applied to this machine
+gpresult /h gpreport.html
+Start-Process gpreport.html
+```
+
+For audit purposes the registry paths and filesystem locations are usually the fastest check. On a consumer machine, an unexpected non-empty result here is a strong "what is this and who put it here" signal.
+
+## The StartupApproved mechanism
+
+How Task Manager's "Disable" button works — and why a non-admin user can disable HKLM entries for themselves.
+
+### Locations
+
+| Path | Disables entries in |
+|------|---------------------|
+| `HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\Run` | HKCU\...\Run AND HKLM\...\Run (64-bit) |
+| `HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\Run32` | HKLM\...\WOW6432Node\Run (32-bit) |
+| `HKCU\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\StartupFolder` | Startup folder shortcuts |
+| `HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\...` | Machine-wide disables (admin only) |
+
+The value name matches the original Run key entry name. The value is 12 bytes binary:
+
+```
+Offset  Length  Meaning
+0       1       Status: 0x02 = enabled, 0x03 = disabled
+1       3       Reserved (00 00 00)
+4       8       FILETIME timestamp of last enable/disable
+```
+
+### Writing the disable marker
+
+```powershell
+$timestamp = [BitConverter]::GetBytes([DateTime]::Now.ToFileTime())
+$disabledValue = [byte[]]@(0x03, 0x00, 0x00, 0x00) + $timestamp
+
+# Ensure the StartupApproved\Run key exists
+$key = 'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\Run'
+if (-not (Test-Path $key)) { New-Item -Path $key -Force | Out-Null }
+
+# Write disable marker (matches the value name in the original Run key)
+Set-ItemProperty -Path $key -Name 'Slack' -Value $disabledValue -Type Binary -Force
+```
+
+Re-enable: change first byte to `0x02`:
+
+```powershell
+$enabledValue = [byte[]]@(0x02, 0x00, 0x00, 0x00) + $timestamp
+Set-ItemProperty -Path $key -Name 'Slack' -Value $enabledValue -Type Binary -Force
+```
+
+### Why this works without admin
+
+The StartupApproved key under HKCU is writable by the current user. Windows' Explorer reads both the Run keys and the StartupApproved overlay at logon — if the StartupApproved entry says `0x03` for an entry name, that entry is skipped, regardless of which Run key (HKCU or HKLM) it lives in.
+
+This means: a non-admin user can disable any HKLM startup entry for their own session, even ones an administrator installed for all users. Useful for cleaning up vendor bloat without going through "Run as Administrator."
+
+## Edge cases
+
+Less common but worth knowing about:
+
+### WMI permanent event consumers
+
+```powershell
+Get-CimInstance -Namespace root\subscription -ClassName __EventConsumer
+Get-CimInstance -Namespace root\subscription -ClassName __EventFilter
+Get-CimInstance -Namespace root\subscription -ClassName __FilterToConsumerBinding
+```
+
+Used legitimately by some monitoring tools, infamously by malware for persistence. A consumer machine should usually have zero or one (Windows Defender's). Unexpected entries warrant investigation.
+
+### ActiveSetup
+
+`HKLM\SOFTWARE\Microsoft\Active Setup\Installed Components` — designed for per-user setup-on-first-logon. Rarely used today.
+
+### AppInit_DLLs
+
+`HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Windows\AppInit_DLLs` — DLLs injected into every user32-loading process. Deprecated since Windows 8, blocked by default when Secure Boot is enabled. Empty on modern workstations; a non-empty value is suspicious.
+
+### Image File Execution Options (IFEO) "Debugger"
+
+`HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\<exe>` with a `Debugger` value will replace `<exe>` with the debugger command whenever Windows tries to launch it. Used legitimately for debugging; used by malware to redirect execution. Audit:
+
+```powershell
+Get-ChildItem 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options' |
+    ForEach-Object {
+        $debugger = (Get-ItemProperty $_.PSPath -Name Debugger -ErrorAction SilentlyContinue).Debugger
+        if ($debugger) {
+            [PSCustomObject]@{ Image = $_.PSChildName; Debugger = $debugger }
+        }
+    }
+```
+
+### Shell extensions / context menu handlers
+
+Not strictly "startup" but they load into Explorer.exe at logon and can drag boot performance. Audited via `HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\ShellIconOverlayIdentifiers` and similar paths. NirSoft's `ShellExView` is the canonical tool.
+
+### Print providers
+
+Print provider DLLs load into spoolsv at boot. A failing or slow provider can delay print spooler initialization which (because spoolsv is a delayed-start service in some configs) can ripple into other delayed-start services. Rare cause but real.
+
+## Full audit query patterns
+
+The audit script (`scripts/startup-audit.ps1`) walks all five mechanisms in parallel and produces a unified report. The patterns it uses:
+
+```powershell
+# Mechanism 1: Run keys (all 6 paths)
+$runPaths = @(
+    'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run',
+    'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce',
+    'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce',
+    'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\RunOnce'
+)
+
+# Mechanism 2: Services (Auto, Auto-Delayed)
+Get-Service | Where-Object { $_.StartType -in @('Automatic','AutomaticDelayedStart') }
+
+# Mechanism 3: Tasks (Logon or Boot trigger)
+Get-ScheduledTask | Where-Object {
+    $_.Triggers.CimClass.CimClassName -match 'Logon|Boot'
+}
+
+# Mechanism 4: Startup folders (user + all-users)
+@(
+    "$env:APPDATA\Microsoft\Windows\Start Menu\Programs\Startup",
+    "$env:ALLUSERSPROFILE\Microsoft\Windows\Start Menu\Programs\StartUp"
+)
+
+# Mechanism 5: Group Policy scripts
+@(
+    'HKLM:\SOFTWARE\Policies\Microsoft\Windows\System\Scripts\Startup',
+    'HKCU:\SOFTWARE\Policies\Microsoft\Windows\System\Scripts\Logon'
+)
+```
+
+Cross-reference each entry with the StartupApproved overlay (mechanisms 1 and 4 only) to determine current enabled/disabled state.
+
+## Why disabling-by-mechanism matters
+
+Vendors don't ship a single auto-launch entry. The common pattern: **one app installs three or four separate startup hooks**, and disabling one leaves the others firing. Example from a real audit (Adobe ecosystem):
+
+| Mechanism | Entry |
+|-----------|-------|
+| Run (HKLM-WOW) | "Adobe Creative Cloud" |
+| Run (HKLM-WOW) | "Adobe CCXProcess" |
+| Run (HKLM) | "AdobeAAMUpdater-1.0" |
+| Run (HKCU) | "Adobe Acrobat Synchronizer" |
+| Service (Auto) | `AdobeARMservice` (Acrobat update service) |
+| Service (Auto) | `AdobeUpdateService` (Creative Cloud update service) |
+
+To fully stop Adobe auto-launching, all six need to be addressed. Disabling only the visible Task Manager startup entries leaves the two services running unattended.
+
+**Audit recipe**: search across mechanisms for the vendor name to find every hook they've installed:
+
+```powershell
+$vendor = 'Adobe'
+
+# Run keys
+foreach ($p in $runPaths) {
+    (Get-ItemProperty $p -ErrorAction SilentlyContinue).PSObject.Properties |
+        Where-Object { $_.Value -like "*$vendor*" -or $_.Name -like "*$vendor*" }
+}
+
+# Services
+Get-CimInstance Win32_Service | Where-Object { $_.PathName -like "*$vendor*" -or $_.DisplayName -like "*$vendor*" }
+
+# Tasks
+Get-ScheduledTask | Where-Object { $_.Actions.Execute -like "*$vendor*" }
+```
+
+This pattern is what `scripts/startup-audit.ps1` runs by default for vendor patterns (Adobe, Docker, Slack, NVIDIA, Microsoft Office, Intel).

+ 175 - 0
skills/windows-ops/references/storage-events.md

@@ -0,0 +1,175 @@
+# Storage Event ID Catalog
+
+Load this when investigating disk errors, storage controller resets, or correlating I/O failures to a specific drive. The `System` log carries the bulk of storage signal; the `Microsoft-Windows-Storage-*` operational logs add detail when Windows-side debug logging is enabled (rare on workstations).
+
+## Contents
+
+1. [`disk` provider events](#disk-provider-events) — the most common storage signal
+2. [`storahci` provider events](#storahci-provider-events) — AHCI/NVMe driver layer
+3. [`Disk` provider variants](#disk-provider-variants) — newer Windows 11 split provider names
+4. [`partmgr` and `volmgr`](#partmgr-and-volmgr) — partition / volume layer (less common signal)
+5. [`Ntfs` provider events](#ntfs-provider-events) — filesystem-layer errors
+6. [Query recipes](#query-recipes) — `Get-WinEvent` patterns for each scenario
+7. [Reading the message bodies](#reading-the-message-bodies) — extracting `\Device\HarddiskN` and LBA values
+8. [Severity triage](#severity-triage) — count thresholds that indicate failure
+
+## `disk` provider events
+
+Source: Windows kernel-mode disk driver. The classical signal for HDD/SSD failures. These events have been stable across Windows 7 → 11.
+
+| ID | Level | Meaning | Significance |
+|----|-------|---------|--------------|
+| **7** | Warning | The device, \Device\HarddiskN\DR1, has a bad block. | **HIGH** — sectors are going bad. Single events occur on healthy drives during normal wear; >50 in a month indicates active failure. |
+| **9** | Warning | The device, \Device\HarddiskN, did not respond within the timeout period. | **HIGH** — drive hung on an I/O request. Frequently precedes controller reset (storahci 129). |
+| **11** | Error | The driver detected a controller error on \Device\HarddiskN. | **HIGH** — driver-level error during I/O. Usually paired with disk hardware errors. |
+| **15** | Warning | The device, \Device\HarddiskN, is not ready for access yet. | Boot-time only. Drive slow to spin up / negotiate link. Common with failing/aged HDDs. |
+| **51** | Warning | An error was detected on device \Device\HarddiskN\DRn during a paging operation. | **HIGH** — failed I/O on a page file or page-mapped file. Direct cause of BSOD `0x50` (PAGE_FAULT_IN_NONPAGED_AREA) when paging fails. |
+| **52** | Informational | Write cache enabled on \Device\HarddiskN. | None — informational, posted at boot. |
+| **153** | Warning | The IO operation at logical block address 0x{LBA} for Disk N was retried. | Medium — single retry can be transient. >20/month suggests failing drive. |
+| **154** | Error | The IO operation at logical block address 0x{LBA} for Disk N failed due to a hardware error. | **HIGH** — Windows' explicit hardware verdict on a specific block. Even single events warrant investigation. |
+| **157** | Warning | Disk N has been surprise removed. | USB/eSATA drive yanked while in use. Expected when expected. |
+
+## `storahci` provider events
+
+Source: AHCI/NVMe storage driver. Captures controller-level issues and the lower-level "drive stopped responding" signal that precedes most storage-induced crashes.
+
+| ID | Level | Meaning | Significance |
+|----|-------|---------|--------------|
+| **129** | Warning | Reset to device, \Device\RaidPortN, was issued. | **HIGH** — controller reset because the drive on port N stopped responding. Healthy = zero events. >5/month = active failure. |
+| **131** | Error | Storage device on \Device\RaidPortN doesn't support a feature required by the driver. | Rare — usually firmware bug or unsupported drive. |
+| **132** | Warning | Storage device on \Device\RaidPortN was removed without warning. | Surprise removal at the AHCI layer. Cabling or power issue if the drive shouldn't have left. |
+| **134** | Error | Storage device on \Device\RaidPortN failed initial setup. | Boot-time. Drive not detected / not ready. Pair with disk Event 15. |
+
+### Mapping `\Device\RaidPortN` to a drive
+
+`RaidPortN` refers to the AHCI controller port number, not the drive number directly. Numbering starts at 0 in most BIOSes but Windows can renumber based on enumeration order. The most reliable mapping:
+
+```powershell
+# Pair each disk with its bus address (controller, port, target)
+Get-PhysicalDisk | ForEach-Object {
+    $bus = Get-CimInstance Win32_DiskDrive | Where-Object { $_.SerialNumber -eq $_.SerialNumber } |
+        Select-Object -First 1 SCSIBus, SCSIPort, SCSITargetId, SCSILogicalUnit
+    [PSCustomObject]@{
+        Drive = $_.FriendlyName
+        BusType = $_.BusType
+        DeviceId = $_.DeviceId
+    }
+}
+```
+
+In practice the count of resets is the actionable signal, not the precise port mapping — if `RaidPortN` only ever appears for one specific port and the disk error events name a specific Disk number, those two together identify the drive.
+
+## `Disk` provider variants
+
+Windows 11 introduced a newer split provider naming for some storage events. When `disk` events are absent but a drive is suspect, also query:
+
+| Provider | Notes |
+|----------|-------|
+| `Microsoft-Windows-Disk` | Newer (Win11) — usually mirrors `disk` events. Some Insider builds emit here exclusively. |
+| `Microsoft-Windows-Ntfs` | Filesystem-layer; covers MFT corruption and chkdsk runs. |
+| `Microsoft-Windows-Storage-Storport/Operational` | Low-level storage port driver. Usually empty on consumer Windows; populated when storport tracing enabled. |
+| `Microsoft-Windows-Kernel-IO/Operational` | I/O subsystem; populated when kernel I/O tracing enabled (rare). |
+
+## `partmgr` and `volmgr`
+
+| Provider | ID | Meaning |
+|----------|----|---------|
+| `partmgr` | 6 | Volume guid path change — partition table modified. |
+| `partmgr` | 7 | Failed to open device — frequent on drives with persistent failures, otherwise rare. |
+| `volmgr` | 162 | Crash dump initialization failed. Important: this means the next crash won't write a dump even if `CrashDumpEnabled=7`. |
+| `volmgr` | 46 | Crash dump file could not be created (disk full or no pagefile). |
+
+`volmgr` Event 162 is high-value: pair it with the absence of `MEMORY.DMP` to explain why crash dumps aren't being captured.
+
+## `Ntfs` provider events
+
+NTFS-layer corruption usually shows up here. Most events are benign (boot-time mount logging); the meaningful ones:
+
+| ID | Meaning | Significance |
+|----|---------|--------------|
+| 55 | A corruption was discovered in the file system structure on volume X. | **HIGH** — runs of these on the same volume indicate metadata corruption. Often triggered by underlying disk errors. |
+| 98 | Volume X is not properly formatted. | Boot-time on a drive that's failing badly enough to not present a valid FS. |
+| 130 | A transaction failed because the corresponding log records have already been allocated. | Filesystem log overflow — usually under heavy load on a failing/slow drive. |
+| 137 | The default transaction resource manager on volume X failed to start. | Boot-time, on volumes with severely corrupt $LogFile. |
+
+## Query recipes
+
+### All disk error events in last 30 days, grouped by ID
+
+```powershell
+Get-WinEvent -FilterHashtable @{
+    LogName='System'
+    ProviderName='disk'
+    StartTime=(Get-Date).AddDays(-30)
+} | Group-Object Id | Select-Object Count, Name | Sort-Object Count -Descending
+```
+
+### All storage controller resets, sorted by time
+
+```powershell
+Get-WinEvent -FilterHashtable @{
+    LogName='System'
+    ProviderName='storahci'
+    Id=129
+    StartTime=(Get-Date).AddDays(-60)
+} | Select-Object TimeCreated, Message | Sort-Object TimeCreated
+```
+
+### Errors targeting a specific physical disk
+
+```powershell
+$diskNumber = 1
+Get-WinEvent -FilterHashtable @{
+    LogName='System'
+    ProviderName='disk'
+    StartTime=(Get-Date).AddDays(-60)
+} | Where-Object { $_.Message -match "Harddisk$diskNumber\b" }
+```
+
+### Combined storage signal in time window before a crash
+
+```powershell
+$crashTime = [datetime]'2026-05-15 00:57:50'
+$window = $crashTime.AddMinutes(-10)
+Get-WinEvent -FilterHashtable @{
+    LogName='System'
+    StartTime=$window
+    EndTime=$crashTime
+    ProviderName=@('disk','storahci','Ntfs','partmgr','volmgr','Microsoft-Windows-Kernel-Power')
+} | Sort-Object TimeCreated | Select-Object TimeCreated, ProviderName, Id,
+    @{N='Message';E={($_.Message -replace '\s+',' ').Substring(0, [Math]::Min(120, $_.Message.Length))}}
+```
+
+## Reading the message bodies
+
+The `Message` field carries the device path and (for events 153/154) the failing LBA. Extraction patterns:
+
+```powershell
+# Harddisk number from a disk event message
+if ($event.Message -match '\\Device\\Harddisk(\d+)') { $diskNum = $matches[1] }
+
+# RaidPort number from a storahci 129 message
+if ($event.Message -match '\\Device\\RaidPort(\d+)') { $portNum = $matches[1] }
+
+# Failing LBA from a disk 153/154 message
+if ($event.Message -match 'logical block address 0x([0-9a-f]+)') { $lba = [Convert]::ToInt64($matches[1], 16) }
+```
+
+A failing LBA pattern is occasionally diagnostic: clusters of failures at sequential LBAs suggest physical head/track damage on an HDD or a single failing flash block on an SSD. Random scattered LBAs across the drive are usually controller-level issues (cable, firmware, controller chip).
+
+## Severity triage
+
+Rules of thumb for what error counts mean over a 30-day window:
+
+| Drive type | Disk Event 7 (bad block) | Disk Event 154 (hw error) | storahci 129 (reset) | Verdict |
+|------------|--------------------------|---------------------------|----------------------|---------|
+| HDD (any) | 0–5 | 0–2 | 0 | Healthy |
+| HDD (any) | 6–50 | 0–10 | 0–2 | Watch — back up irreplaceable data |
+| HDD (any) | >50 OR | >10 OR | >2 | **Failing — replace** |
+| SSD (any) | 0 | 0 | 0 | Healthy |
+| SSD (any) | 1–10 | 0–5 | 0–1 | Watch — check SMART for wear |
+| SSD (any) | >10 OR | >5 OR | >1 | **Failing — replace** |
+
+SSDs have stricter thresholds because they don't develop bad blocks during normal wear the way HDDs do — any disk Event 7 on an SSD is meaningful, where on an HDD a few per month is within normal aging.
+
+A drive showing 1000+ events per category in 30 days is in late-stage failure. The skill's verdict block should call this out unambiguously.

+ 125 - 0
skills/windows-ops/scripts/_lib/common.ps1

@@ -0,0 +1,125 @@
+# windows-ops common helpers
+# Dot-source from any script: . "$PSScriptRoot\_lib\common.ps1"
+
+# Semantic exit codes (matches ATP §7.8)
+$script:EXIT_OK           = 0
+$script:EXIT_ERROR        = 1
+$script:EXIT_USAGE        = 2
+$script:EXIT_NOT_FOUND    = 3
+$script:EXIT_VALIDATION   = 4
+$script:EXIT_PRECONDITION = 5
+$script:EXIT_TIMEOUT      = 6
+$script:EXIT_UNAVAILABLE  = 7
+
+function Write-Log {
+    # All logs to stderr — never pollute stdout
+    param(
+        [Parameter(Mandatory)][ValidateSet('INFO','WARN','ERROR','PASS','FAIL','DEBUG')]$Level,
+        [Parameter(Mandatory)][string]$Message
+    )
+    $color = switch ($Level) {
+        'PASS'  { 'Green' }
+        'FAIL'  { 'Red' }
+        'ERROR' { 'Red' }
+        'WARN'  { 'Yellow' }
+        'INFO'  { 'Cyan' }
+        'DEBUG' { 'DarkGray' }
+    }
+    [Console]::Error.WriteLine("[$Level] $Message")
+    # Re-emit colorised version when stderr is a TTY (for human readability)
+    if ([Console]::IsErrorRedirected -eq $false) {
+        # Can't easily colorise stderr in PS — accept plain text, color reserved for TTY-only contexts
+    }
+}
+
+function Write-Section {
+    param([Parameter(Mandatory)][string]$Title)
+    $line = '=' * 60
+    [Console]::Error.WriteLine("")
+    [Console]::Error.WriteLine($line)
+    [Console]::Error.WriteLine("  $Title")
+    [Console]::Error.WriteLine($line)
+}
+
+function Write-Data {
+    # Plain data row to stdout — only thing that should go there
+    param([Parameter(Mandatory, ValueFromPipeline)][object]$Object)
+    process { $Object | Out-String -Stream | Where-Object { $_ -ne '' } | ForEach-Object { [Console]::Out.WriteLine($_) } }
+}
+
+function ConvertTo-Bytes12 {
+    # Build a 12-byte StartupApproved value: [status][3-byte pad][8-byte FILETIME]
+    param(
+        [Parameter(Mandatory)][ValidateRange(0,255)][byte]$StatusByte
+    )
+    $ts = [BitConverter]::GetBytes([DateTime]::Now.ToFileTime())
+    [byte[]](@($StatusByte, 0, 0, 0) + $ts)
+}
+
+function Get-StartupApprovedKey {
+    # Ensure the StartupApproved key exists; return its registry path
+    param(
+        [Parameter(Mandatory)][ValidateSet('Run','Run32','StartupFolder')]$Variant,
+        [ValidateSet('HKCU','HKLM')]$Hive = 'HKCU'
+    )
+    $key = "${Hive}:\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\$Variant"
+    if (-not (Test-Path $key)) {
+        New-Item -Path $key -Force -ErrorAction Stop | Out-Null
+    }
+    return $key
+}
+
+function Test-IsElevated {
+    # Returns true if running as Administrator
+    $id = [Security.Principal.WindowsIdentity]::GetCurrent()
+    $principal = New-Object Security.Principal.WindowsPrincipal($id)
+    return $principal.IsInRole([Security.Principal.WindowsBuiltInRole]::Administrator)
+}
+
+function Get-DiskMap {
+    # Map physical disk number -> friendly name / type / drive letters
+    # Returns array of [PSCustomObject] with Number, Model, BusType, MediaType, FirmwareVersion, SizeGB, DriveLetters
+    Get-Disk | ForEach-Object {
+        $disk = $_
+        $letters = (Get-Partition -DiskNumber $disk.Number -ErrorAction SilentlyContinue |
+            Where-Object { $_.DriveLetter } | Select-Object -ExpandProperty DriveLetter) -join ','
+        $physical = Get-PhysicalDisk | Where-Object { $_.DeviceId -eq $disk.Number } | Select-Object -First 1
+        [PSCustomObject]@{
+            Number           = $disk.Number
+            Model            = $disk.FriendlyName
+            BusType          = $disk.BusType
+            MediaType        = if ($physical) { $physical.MediaType } else { 'Unknown' }
+            FirmwareVersion  = $disk.FirmwareVersion
+            SizeGB           = [math]::Round($disk.Size / 1GB, 0)
+            DriveLetters     = $letters
+            HealthStatus     = $disk.HealthStatus
+            SerialNumber     = if ($physical) { $physical.SerialNumber } else { $null }
+        }
+    }
+}
+
+function Resolve-HarddiskRef {
+    # Resolve a "\Device\HarddiskN" reference to a disk map row
+    param([Parameter(Mandatory)][string]$Reference)
+    if ($Reference -match 'Harddisk(\d+)' -or $Reference -match '^Disk\s*(\d+)' -or $Reference -match '^(\d+)$') {
+        $num = [int]$matches[1]
+        return Get-DiskMap | Where-Object { $_.Number -eq $num } | Select-Object -First 1
+    }
+    return $null
+}
+
+function Format-EventMessage {
+    # Truncate + collapse whitespace for table display
+    param(
+        [Parameter(Mandatory, ValueFromPipeline)][string]$Message,
+        [int]$MaxLength = 120
+    )
+    process {
+        $cleaned = $Message -replace '\s+', ' '
+        if ($cleaned.Length -le $MaxLength) { return $cleaned }
+        return $cleaned.Substring(0, $MaxLength - 3) + '...'
+    }
+}
+
+# Export common state for caller scripts
+$script:CommonLoaded = $true

+ 225 - 0
skills/windows-ops/scripts/crash-triage.ps1

@@ -0,0 +1,225 @@
+<#
+.SYNOPSIS
+    Decode an Event 41 crash record and surface the events in the N
+    minutes leading up to it. The pre-crash timeline is where the
+    actual cause lives.
+
+.DESCRIPTION
+    Reads Event 41 (Kernel-Power) and properly decodes:
+      Properties[0] = BugCheckCode (the stop code; NOT Properties[1])
+      Properties[1-4] = BugcheckParameter1-4
+      Properties[6] = PowerButtonTimestamp (non-zero = forced shutdown)
+    Then walks events in the configurable window before the crash from
+    System log providers that matter for crash correlation: storage
+    drivers, GPU drivers, WHEA hardware errors, kernel-power.
+
+    BugCheck = 0x0 with no power-button = hard power loss or hardware
+    lockup. BugCheck = 0x0 with power-button = user force-shutdown of a
+    hung machine. Non-zero codes are decoded against the known catalog
+    (see references/bugcheck-codes.md).
+
+.PARAMETER CrashTime
+    Specific crash time (datetime) to triage. If omitted, the most
+    recent Event 41 within -DaysBack is used.
+
+.PARAMETER WindowMinutes
+    Minutes before the crash to scan for correlated events. Default: 10.
+
+.PARAMETER DaysBack
+    When -CrashTime is omitted, how far back to look for the most recent
+    crash. Default: 30.
+
+.PARAMETER Json
+    Emit machine-readable JSON.
+
+.EXAMPLE
+    scripts/crash-triage.ps1
+    Triage the most recent crash in the last 30 days.
+
+.EXAMPLE
+    scripts/crash-triage.ps1 -CrashTime '2026-05-15 00:57:50'
+    Triage a specific crash by timestamp.
+
+.EXAMPLE
+    scripts/crash-triage.ps1 -CrashTime '2026-05-15 00:57:50' -WindowMinutes 30
+    Widen the pre-crash window to 30 minutes (default 10).
+
+.EXAMPLE
+    scripts/crash-triage.ps1 -Json | jq '.bugcheck'
+    Pull just the BugCheck code from machine-readable output.
+
+.NOTES
+    Exit codes:
+      0 success
+      3 not found (no crashes in window)
+      4 validation
+#>
+
+[CmdletBinding()]
+param(
+    [datetime]$CrashTime,
+    [ValidateRange(1, 240)][int]$WindowMinutes = 10,
+    [ValidateRange(1, 365)][int]$DaysBack = 30,
+    [switch]$Json
+)
+
+$ErrorActionPreference = 'Stop'
+. "$PSScriptRoot\_lib\common.ps1"
+
+# BugCheck quick-lookup (most common codes; full catalog in references/bugcheck-codes.md)
+$bugCheckNames = @{
+    0x0   = '(no bugcheck recorded — hard power loss / total hang / hardware lockup)'
+    0x1E  = 'KMODE_EXCEPTION_NOT_HANDLED'
+    0x1A  = 'MEMORY_MANAGEMENT'
+    0x3B  = 'SYSTEM_SERVICE_EXCEPTION'
+    0x50  = 'PAGE_FAULT_IN_NONPAGED_AREA  (often storage I/O failure for pagefile)'
+    0x77  = 'KERNEL_STACK_INPAGE_ERROR  (storage paging failure)'
+    0x7A  = 'KERNEL_DATA_INPAGE_ERROR  (storage paging failure)'
+    0x7E  = 'SYSTEM_THREAD_EXCEPTION_NOT_HANDLED  (often GPU/network driver)'
+    0x9F  = 'DRIVER_POWER_STATE_FAILURE  (driver hung during sleep/wake)'
+    0xA   = 'IRQL_NOT_LESS_OR_EQUAL'
+    0xC1  = 'SPECIAL_POOL_DETECTED_MEMORY_CORRUPTION  (Driver Verifier)'
+    0xC2  = 'BAD_POOL_CALLER'
+    0xC4  = 'DRIVER_VERIFIER_DETECTED_VIOLATION'
+    0xD1  = 'DRIVER_IRQL_NOT_LESS_OR_EQUAL  (driver accessed bad memory at high IRQL)'
+    0xEF  = 'CRITICAL_PROCESS_DIED  (critical system process killed)'
+    0xF4  = 'CRITICAL_OBJECT_TERMINATION  (often storage-induced)'
+    0x101 = 'CLOCK_WATCHDOG_TIMEOUT  (CPU stall — chipset or hardware)'
+    0x124 = 'WHEA_UNCORRECTABLE_ERROR  (hardware-level fault)'
+    0x139 = 'KERNEL_SECURITY_CHECK_FAILURE  (stack/pool corruption)'
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Find the target crash
+# ─────────────────────────────────────────────────────────────────────
+if (-not $CrashTime) {
+    Write-Log -Level INFO -Message "No -CrashTime given; finding most recent Event 41 in last $DaysBack days"
+    $crash = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        Id=41
+        StartTime=(Get-Date).AddDays(-$DaysBack)
+    } -MaxEvents 1 -ErrorAction SilentlyContinue
+    if (-not $crash) {
+        Write-Log -Level INFO -Message "No Event 41 crashes found in last $DaysBack days. System has been stable."
+        exit $script:EXIT_NOT_FOUND
+    }
+    $CrashTime = $crash.TimeCreated
+} else {
+    # Find the Event 41 closest to the given time (within ±60 seconds)
+    $low  = $CrashTime.AddMinutes(-1)
+    $high = $CrashTime.AddMinutes(1)
+    $crash = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        Id=41
+        StartTime=$low
+        EndTime=$high
+    } -ErrorAction SilentlyContinue | Select-Object -First 1
+    if (-not $crash) {
+        Write-Log -Level FAIL -Message "No Event 41 found within ±60s of $CrashTime"
+        exit $script:EXIT_NOT_FOUND
+    }
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Decode the crash record
+# ─────────────────────────────────────────────────────────────────────
+$bcCode  = [int64]$crash.Properties[0].Value
+$param1  = [int64]$crash.Properties[1].Value
+$param2  = [int64]$crash.Properties[2].Value
+$param3  = [int64]$crash.Properties[3].Value
+$param4  = [int64]$crash.Properties[4].Value
+$pwrBtn  = if ($crash.Properties.Count -gt 6) { [int64]$crash.Properties[6].Value } else { 0 }
+$bcHex   = '0x{0:X}' -f $bcCode
+$bcName  = if ($bugCheckNames.ContainsKey([int]$bcCode)) { $bugCheckNames[[int]$bcCode] } else { '(unknown — consult references/bugcheck-codes.md)' }
+
+# Cause discrimination for BugCheck = 0
+$causeHint = if ($bcCode -eq 0) {
+    if ($pwrBtn -ne 0) { 'Power button was held → user force-shutdown of a hung machine' }
+    else                { 'No power button press recorded → hard power loss / hardware lockup / thermal trip' }
+} else { $null }
+
+# ─────────────────────────────────────────────────────────────────────
+# Walk the pre-crash window
+# ─────────────────────────────────────────────────────────────────────
+$windowStart = $CrashTime.AddMinutes(-$WindowMinutes)
+$preEvents = Get-WinEvent -FilterHashtable @{
+    LogName='System'
+    StartTime=$windowStart
+    EndTime=$CrashTime
+    Level=@(1,2,3)
+} -ErrorAction SilentlyContinue | Sort-Object TimeCreated
+
+# Smoking-gun detection
+$smokingGuns = @()
+foreach ($e in $preEvents) {
+    if ($e.ProviderName -eq 'storahci' -and $e.Id -eq 129) {
+        $smokingGuns += "STORAGE: storahci controller reset at $($e.TimeCreated.ToString('HH:mm:ss')) — drive stopped responding"
+    } elseif ($e.ProviderName -eq 'Microsoft-Windows-WHEA-Logger' -and $e.Level -le 2) {
+        $smokingGuns += "HARDWARE: WHEA error at $($e.TimeCreated.ToString('HH:mm:ss')) — CPU/RAM/PCIe-level fault"
+    } elseif ($e.ProviderName -match 'nvlddmkm|igdkmd|amdkmdag' -and $e.Level -le 2) {
+        $smokingGuns += "GPU: $($e.ProviderName) error at $($e.TimeCreated.ToString('HH:mm:ss')) — GPU driver issue"
+    } elseif ($e.ProviderName -eq 'disk' -and $e.Id -in @(7,51,153,154)) {
+        $smokingGuns += "STORAGE: disk Event $($e.Id) at $($e.TimeCreated.ToString('HH:mm:ss')) — bad block or hardware error"
+    }
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Output
+# ─────────────────────────────────────────────────────────────────────
+if ($Json) {
+    @{
+        crashTime         = $CrashTime.ToString('o')
+        bugcheck          = $bcHex
+        bugcheckName      = $bcName
+        param1            = '0x{0:X}' -f $param1
+        param2            = '0x{0:X}' -f $param2
+        param3            = '0x{0:X}' -f $param3
+        param4            = '0x{0:X}' -f $param4
+        powerButtonHeld   = ($pwrBtn -ne 0)
+        causeHint         = $causeHint
+        windowMinutes     = $WindowMinutes
+        preCrashEvents    = $preEvents.Count
+        smokingGuns       = $smokingGuns
+        timeline          = $preEvents | ForEach-Object {
+            @{
+                time     = $_.TimeCreated.ToString('o')
+                provider = $_.ProviderName
+                id       = $_.Id
+                level    = $_.LevelDisplayName
+                message  = (Format-EventMessage -Message $_.Message -MaxLength 200)
+            }
+        }
+    } | ConvertTo-Json -Depth 5 | ForEach-Object { [Console]::Out.WriteLine($_) }
+} else {
+    Write-Section "Crash record: $($CrashTime.ToString('yyyy-MM-dd HH:mm:ss'))"
+    [Console]::Out.WriteLine("  BugCheck:  $bcHex  $bcName")
+    [Console]::Out.WriteLine("  Param1:    0x{0:X}" -f $param1)
+    [Console]::Out.WriteLine("  Param2:    0x{0:X}" -f $param2)
+    [Console]::Out.WriteLine("  Param3:    0x{0:X}" -f $param3)
+    [Console]::Out.WriteLine("  Param4:    0x{0:X}" -f $param4)
+    [Console]::Out.WriteLine("  PowerBtn:  $(if ($pwrBtn -ne 0) {'held (forced shutdown)'} else {'not pressed'})")
+    if ($causeHint) {
+        [Console]::Out.WriteLine("")
+        [Console]::Out.WriteLine("  Cause hint:  $causeHint")
+    }
+
+    Write-Section "Pre-crash timeline ($WindowMinutes min before crash)"
+    if (-not $preEvents) {
+        [Console]::Out.WriteLine("  (no warning/error/critical events in window — sudden hang or instant fault)")
+    } else {
+        foreach ($e in $preEvents) {
+            $tStr = $e.TimeCreated.ToString('HH:mm:ss')
+            $msg  = Format-EventMessage -Message $e.Message -MaxLength 90
+            [Console]::Out.WriteLine(("  {0}  [{1,-3}] {2,-32} Id={3,-5} {4}" -f $tStr, $e.LevelDisplayName.Substring(0,3), $e.ProviderName.Substring(0,[Math]::Min(32,$e.ProviderName.Length)), $e.Id, $msg))
+        }
+    }
+
+    if ($smokingGuns) {
+        Write-Section "SMOKING GUNS"
+        foreach ($g in $smokingGuns) {
+            [Console]::Out.WriteLine("  - $g")
+        }
+    }
+}
+
+exit $script:EXIT_OK

+ 356 - 0
skills/windows-ops/scripts/health-audit.ps1

@@ -0,0 +1,356 @@
+<#
+.SYNOPSIS
+    Comprehensive Windows workstation health audit. Produces a verdict.
+
+.DESCRIPTION
+    Walks the diagnostic ladder: hardware errors, storage health per disk,
+    recent crashes with BugCheck codes, top resource consumers, startup
+    inventory across all five mechanisms. Emits [PASS]/[FAIL]/[WARN]
+    markers per check and a final verdict block.
+
+    Stdout is data only (a text report by default, or NDJSON when -Json).
+    Stderr carries progress and section headers.
+
+.PARAMETER Days
+    How many days back to scan event logs. Default: 30.
+
+.PARAMETER Json
+    Emit machine-readable NDJSON to stdout (one finding per line).
+
+.PARAMETER Quiet
+    Suppress section headers on stderr. Findings still emit.
+
+.EXAMPLE
+    scripts/health-audit.ps1
+    Run the full audit, scanning the last 30 days.
+
+.EXAMPLE
+    scripts/health-audit.ps1 -Days 7
+    Quick audit covering only the last week.
+
+.EXAMPLE
+    scripts/health-audit.ps1 -Json | ConvertFrom-Json
+    Pipe machine-readable output to a JSON consumer.
+
+.EXAMPLE
+    scripts/health-audit.ps1 -Json > audit.ndjson
+    Save audit findings as NDJSON for later processing.
+
+.NOTES
+    Exit codes:
+      0 success — audit completed, no critical findings
+      1 general error during audit
+      2 usage error (bad arguments)
+      4 critical finding (failing drive, recent unexplained crashes)
+      5 missing precondition (PowerShell version, required module)
+#>
+
+[CmdletBinding()]
+param(
+    [ValidateRange(1, 365)][int]$Days = 30,
+    [switch]$Json,
+    [switch]$Quiet
+)
+
+$ErrorActionPreference = 'Stop'
+. "$PSScriptRoot\_lib\common.ps1"
+
+$Findings = New-Object System.Collections.Generic.List[hashtable]
+
+function Add-Finding {
+    param(
+        [Parameter(Mandatory)][ValidateSet('pass','warn','fail','info')]$Level,
+        [Parameter(Mandatory)][string]$Category,
+        [Parameter(Mandatory)][string]$Subject,
+        [Parameter(Mandatory)][string]$Detail,
+        [hashtable]$Data = @{}
+    )
+    $f = @{
+        level    = $Level
+        category = $Category
+        subject  = $Subject
+        detail   = $Detail
+        data     = $Data
+        ts       = (Get-Date).ToString('o')
+    }
+    $Findings.Add($f)
+    if (-not $Quiet -or $Level -in @('warn','fail')) {
+        $tag = $Level.ToUpper()
+        [Console]::Error.WriteLine("[$tag] $Category :: $Subject -> $Detail")
+    }
+    if ($Json) {
+        [Console]::Out.WriteLine(($f | ConvertTo-Json -Compress -Depth 5))
+    }
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Section: Hardware errors (WHEA)
+# ─────────────────────────────────────────────────────────────────────
+if (-not $Quiet) { Write-Section "1. Hardware errors (WHEA)" }
+
+try {
+    $whea = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        ProviderName='Microsoft-Windows-WHEA-Logger'
+        StartTime=(Get-Date).AddDays(-$Days)
+    } -ErrorAction SilentlyContinue
+    $wheaError = $whea | Where-Object { $_.Level -le 2 }   # Critical/Error
+    $wheaWarn  = $whea | Where-Object { $_.Level -eq 3 }   # Warning
+    if ($wheaError) {
+        Add-Finding -Level fail -Category 'hardware' -Subject 'WHEA errors' `
+            -Detail "$($wheaError.Count) uncorrectable hardware error(s) in last $Days days" `
+            -Data @{ count = $wheaError.Count; first = $wheaError[0].TimeCreated.ToString('o') }
+    } elseif ($wheaWarn) {
+        Add-Finding -Level warn -Category 'hardware' -Subject 'WHEA warnings' `
+            -Detail "$($wheaWarn.Count) corrected hardware event(s) — usually benign but trending"
+    } else {
+        Add-Finding -Level pass -Category 'hardware' -Subject 'WHEA' `
+            -Detail "No hardware errors logged in last $Days days"
+    }
+} catch {
+    Add-Finding -Level warn -Category 'hardware' -Subject 'WHEA query' -Detail "Failed: $_"
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Section: Storage health per disk
+# ─────────────────────────────────────────────────────────────────────
+if (-not $Quiet) { Write-Section "2. Storage health per disk" }
+
+$diskMap = Get-DiskMap
+foreach ($d in $diskMap) {
+    if (-not $Quiet) {
+        [Console]::Error.WriteLine("  Disk $($d.Number): $($d.Model) [$($d.MediaType), $($d.BusType), $($d.SizeGB) GB, $($d.DriveLetters)]")
+    }
+}
+
+# Aggregate disk errors across the time window
+# Event messages use TWO formats for naming the affected disk:
+#   - Event 7/15/51:        "\Device\Harddisk<N>\DR..."
+#   - Event 153/154:        "...for Disk <N> (PDO name: \Device\...)"
+# Match both so per-disk counts cover the full set.
+try {
+    $diskErrs = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        ProviderName='disk'
+        StartTime=(Get-Date).AddDays(-$Days)
+    } -ErrorAction SilentlyContinue
+    $errsByDisk = @{}
+    foreach ($e in $diskErrs) {
+        $n = $null
+        if     ($e.Message -match 'Harddisk(\d+)')         { $n = $matches[1] }
+        elseif ($e.Message -match '\bfor Disk (\d+)\b')    { $n = $matches[1] }
+        if ($null -eq $n) { continue }
+        if (-not $errsByDisk.ContainsKey($n)) { $errsByDisk[$n] = @{} }
+        $id = "$($e.Id)"
+        if ($errsByDisk[$n].ContainsKey($id)) {
+            $errsByDisk[$n][$id] = $errsByDisk[$n][$id] + 1
+        } else {
+            $errsByDisk[$n][$id] = 1
+        }
+    }
+} catch { $errsByDisk = @{} }
+
+# storahci controller resets
+try {
+    $resets = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        ProviderName='storahci'
+        Id=129
+        StartTime=(Get-Date).AddDays(-$Days)
+    } -ErrorAction SilentlyContinue
+    $resetCount = if ($resets) { $resets.Count } else { 0 }
+} catch { $resetCount = 0 }
+
+# Per-disk verdict
+$failingDisks = @()
+foreach ($d in $diskMap) {
+    $n = "$($d.Number)"
+    $errs = if ($errsByDisk.ContainsKey($n)) { $errsByDisk[$n] } else { @{} }
+    $event7   = if ($errs.ContainsKey('7'))   { $errs['7']   } else { 0 }
+    $event154 = if ($errs.ContainsKey('154')) { $errs['154'] } else { 0 }
+    $event51  = if ($errs.ContainsKey('51'))  { $errs['51']  } else { 0 }
+
+    $isSsd = $d.MediaType -eq 'SSD'
+    $threshold7   = if ($isSsd) { 10 }  else { 50 }
+    $threshold154 = if ($isSsd) { 5 }   else { 10 }
+
+    if ($event7 -gt $threshold7 -or $event154 -gt $threshold154 -or $event51 -gt 5) {
+        Add-Finding -Level fail -Category 'storage' -Subject "Disk $n ($($d.Model))" `
+            -Detail "Failing: Event7=$event7, Event154=$event154, Event51=$event51 over $Days days" `
+            -Data @{ diskNumber=$d.Number; model=$d.Model; driveLetters=$d.DriveLetters;
+                     event7=$event7; event154=$event154; event51=$event51 }
+        $failingDisks += $d
+    } elseif ($event7 -gt 5 -or $event154 -gt 2) {
+        Add-Finding -Level warn -Category 'storage' -Subject "Disk $n ($($d.Model))" `
+            -Detail "Watchlist: Event7=$event7, Event154=$event154 — back up important data" `
+            -Data @{ diskNumber=$d.Number; event7=$event7; event154=$event154 }
+    } else {
+        Add-Finding -Level pass -Category 'storage' -Subject "Disk $n ($($d.Model))" `
+            -Detail "Clean — 0 hardware errors over $Days days"
+    }
+}
+
+if ($resetCount -gt 5) {
+    Add-Finding -Level fail -Category 'storage' -Subject 'Controller resets' `
+        -Detail "$resetCount storahci controller resets in last $Days days — active storage failure"
+} elseif ($resetCount -gt 0) {
+    Add-Finding -Level warn -Category 'storage' -Subject 'Controller resets' `
+        -Detail "$resetCount storahci controller resets — drive intermittently unresponsive"
+} else {
+    Add-Finding -Level pass -Category 'storage' -Subject 'Controller resets' `
+        -Detail "No storahci resets in last $Days days"
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Section: Crash history
+# ─────────────────────────────────────────────────────────────────────
+if (-not $Quiet) { Write-Section "3. Crash history" }
+
+try {
+    $crashes = Get-WinEvent -FilterHashtable @{
+        LogName='System'
+        Id=41
+        StartTime=(Get-Date).AddDays(-$Days)
+    } -ErrorAction SilentlyContinue
+    if ($crashes) {
+        $hardShutdowns = 0
+        foreach ($c in $crashes) {
+            $bcCode  = $c.Properties[0].Value
+            $param1  = $c.Properties[1].Value
+            $pwrBtn  = if ($c.Properties.Count -gt 6) { $c.Properties[6].Value } else { 0 }
+            $bcHex   = '0x{0:X}' -f $bcCode
+
+            if ($bcCode -eq 0) {
+                $hardShutdowns++
+                $why = if ($pwrBtn -ne 0) { 'power button held (hang)' } else { 'hard power loss or total hardware lockup' }
+                Add-Finding -Level fail -Category 'crash' -Subject $c.TimeCreated.ToString('yyyy-MM-dd HH:mm') `
+                    -Detail "BugCheck=0x0 (no bugcheck recorded) — $why" `
+                    -Data @{ time=$c.TimeCreated.ToString('o'); bugcheck=$bcHex; powerButtonHeld=($pwrBtn -ne 0) }
+            } else {
+                Add-Finding -Level warn -Category 'crash' -Subject $c.TimeCreated.ToString('yyyy-MM-dd HH:mm') `
+                    -Detail "BugCheck=$bcHex Param1=0x$('{0:X}' -f $param1)" `
+                    -Data @{ time=$c.TimeCreated.ToString('o'); bugcheck=$bcHex; param1=('0x{0:X}' -f $param1) }
+            }
+        }
+        if ($hardShutdowns -ge 2) {
+            Add-Finding -Level fail -Category 'crash' -Subject 'Pattern' `
+                -Detail "$hardShutdowns unclean shutdowns with no bugcheck — investigate PSU, thermals, storage cabling"
+        }
+    } else {
+        Add-Finding -Level pass -Category 'crash' -Subject 'Crash log' -Detail "No Event 41 (Kernel-Power) crashes in last $Days days"
+    }
+} catch {
+    Add-Finding -Level warn -Category 'crash' -Subject 'Crash query' -Detail "Failed: $_"
+}
+
+# Crash dump configuration
+try {
+    $dumpCfg = Get-ItemProperty 'HKLM:\SYSTEM\CurrentControlSet\Control\CrashControl' -ErrorAction Stop
+    $hasMinidumps = (Test-Path 'C:\Windows\Minidump\*.dmp')
+    $hasMemoryDmp = (Test-Path 'C:\Windows\MEMORY.DMP')
+
+    if ($dumpCfg.CrashDumpEnabled -eq 0) {
+        Add-Finding -Level warn -Category 'crash' -Subject 'Dump config' -Detail "CrashDumpEnabled=0 — no dumps will be written on crash"
+    } elseif (-not $hasMinidumps -and -not $hasMemoryDmp -and $crashes) {
+        Add-Finding -Level warn -Category 'crash' -Subject 'Dump config' -Detail "Crashes recorded but no dump files exist — pagefile may be too small or crashes were power-loss"
+    } else {
+        $level = if ($dumpCfg.CrashDumpEnabled -eq 7) { 'pass' } else { 'info' }
+        Add-Finding -Level $level -Category 'crash' -Subject 'Dump config' -Detail "CrashDumpEnabled=$($dumpCfg.CrashDumpEnabled)"
+    }
+} catch {
+    Add-Finding -Level warn -Category 'crash' -Subject 'Dump config' -Detail "Failed to read CrashControl key: $_"
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Section: Startup inventory
+# ─────────────────────────────────────────────────────────────────────
+if (-not $Quiet) { Write-Section "4. Startup inventory" }
+
+$runPaths = @(
+    'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run',
+    'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run'
+)
+$runEntries = 0
+foreach ($p in $runPaths) {
+    if (Test-Path $p) {
+        $props = (Get-ItemProperty $p -ErrorAction SilentlyContinue).PSObject.Properties |
+            Where-Object { $_.Name -notmatch '^PS' }
+        $runEntries += @($props).Count
+    }
+}
+
+$autoSvcs = (Get-Service -ErrorAction SilentlyContinue | Where-Object {
+    $_.StartType -eq 'Automatic' -and $_.Status -eq 'Running'
+}).Count
+
+$logonTasks = (Get-ScheduledTask -ErrorAction SilentlyContinue | Where-Object {
+    $_.State -ne 'Disabled' -and ($_.Triggers.CimClass.CimClassName -match 'Logon|Boot')
+}).Count
+
+$startupFolderCount = 0
+foreach ($d in @("$env:APPDATA\Microsoft\Windows\Start Menu\Programs\Startup",
+                 "$env:ALLUSERSPROFILE\Microsoft\Windows\Start Menu\Programs\StartUp")) {
+    if (Test-Path $d) { $startupFolderCount += (Get-ChildItem $d -Filter *.lnk -ErrorAction SilentlyContinue).Count }
+}
+
+$totalStartup = $runEntries + $autoSvcs + $logonTasks + $startupFolderCount
+$level = if ($totalStartup -gt 60) { 'warn' } elseif ($totalStartup -gt 100) { 'fail' } else { 'pass' }
+Add-Finding -Level $level -Category 'startup' -Subject 'Total auto-launch items' `
+    -Detail "$totalStartup ($runEntries Run + $autoSvcs services + $logonTasks tasks + $startupFolderCount shortcuts)" `
+    -Data @{ runEntries=$runEntries; autoServices=$autoSvcs; logonTasks=$logonTasks; startupFolderShortcuts=$startupFolderCount }
+
+# ─────────────────────────────────────────────────────────────────────
+# Section: Resource pressure (right now)
+# ─────────────────────────────────────────────────────────────────────
+if (-not $Quiet) { Write-Section "5. Resource pressure (right now)" }
+
+try {
+    $os = Get-CimInstance Win32_OperatingSystem
+    $memUsedPct = [math]::Round((($os.TotalVisibleMemorySize - $os.FreePhysicalMemory) / $os.TotalVisibleMemorySize) * 100, 0)
+    $level = if ($memUsedPct -gt 90) { 'warn' } elseif ($memUsedPct -gt 80) { 'info' } else { 'pass' }
+    Add-Finding -Level $level -Category 'resource' -Subject 'Memory' -Detail "$memUsedPct% used"
+} catch {}
+
+# Top 5 processes by accumulated CPU
+try {
+    $topCpu = Get-Process | Where-Object { $_.CPU -gt 30 } | Sort-Object CPU -Descending | Select-Object -First 5
+    foreach ($p in $topCpu) {
+        Add-Finding -Level info -Category 'resource' -Subject "Top CPU: $($p.ProcessName)" `
+            -Detail "$([math]::Round($p.CPU,0))s CPU, $([math]::Round($p.WorkingSet/1MB,0)) MB"
+    }
+} catch {}
+
+# ─────────────────────────────────────────────────────────────────────
+# Verdict
+# ─────────────────────────────────────────────────────────────────────
+$failCount = ($Findings | Where-Object { $_.level -eq 'fail' }).Count
+$warnCount = ($Findings | Where-Object { $_.level -eq 'warn' }).Count
+$passCount = ($Findings | Where-Object { $_.level -eq 'pass' }).Count
+
+if (-not $Json) {
+    Write-Section "VERDICT"
+    [Console]::Out.WriteLine("")
+    [Console]::Out.WriteLine("  Findings: $failCount FAIL, $warnCount WARN, $passCount PASS")
+    [Console]::Out.WriteLine("")
+    if ($failingDisks) {
+        [Console]::Out.WriteLine("  FAILING DRIVES:")
+        foreach ($d in $failingDisks) {
+            [Console]::Out.WriteLine("    - Disk $($d.Number): $($d.Model) [$($d.DriveLetters)]")
+        }
+        [Console]::Out.WriteLine("")
+        [Console]::Out.WriteLine("  Recommended actions:")
+        [Console]::Out.WriteLine("    1. Back up data from failing drive(s) immediately")
+        [Console]::Out.WriteLine("    2. Physically disconnect or set Offline via diskpart")
+        [Console]::Out.WriteLine("    3. Replace drive before further use")
+    } elseif ($failCount -gt 0) {
+        [Console]::Out.WriteLine("  Critical findings present. See [FAIL] markers above.")
+    } else {
+        [Console]::Out.WriteLine("  No critical findings. System health within normal bounds.")
+    }
+    [Console]::Out.WriteLine("")
+}
+
+# Exit code semantics
+if ($failCount -gt 0) { exit $script:EXIT_VALIDATION }
+exit $script:EXIT_OK

+ 183 - 0
skills/windows-ops/scripts/safe-disable-startup.ps1

@@ -0,0 +1,183 @@
+<#
+.SYNOPSIS
+    Disable (or re-enable) Windows startup entries via the StartupApproved
+    registry mechanism — no admin required, fully reversible.
+
+.DESCRIPTION
+    Equivalent of Task Manager's "Disable" button: writes a 12-byte binary
+    flag to HKCU\...\Explorer\StartupApproved\{Run,Run32,StartupFolder}
+    so the entry is skipped at next logon. Works on HKLM entries from a
+    non-admin context (overlay applies per-user only).
+
+    For an entry to be disable-able by this script it must exist in one of:
+      - HKCU/HKLM\...\CurrentVersion\Run                    (64-bit)
+      - HKLM\...\WOW6432Node\Microsoft\...\Run              (32-bit)
+      - Startup folders (user + all-users)
+    Services and scheduled tasks are NOT touched by this script — those
+    need Set-Service / Disable-ScheduledTask respectively.
+
+.PARAMETER Name
+    The Run-key value name to disable. Multiple names accepted (positional
+    or via pipeline).
+
+.PARAMETER Enable
+    Re-enable instead of disable (flips status byte 0x03 -> 0x02).
+
+.PARAMETER List
+    List current state of all StartupApproved entries and exit. Ignores -Name.
+
+.PARAMETER Json
+    Emit machine-readable JSON of the action taken.
+
+.EXAMPLE
+    scripts/safe-disable-startup.ps1 -Name 'Adobe Creative Cloud'
+    Disable a single entry by exact value name.
+
+.EXAMPLE
+    scripts/safe-disable-startup.ps1 -Name 'Granola','MuseHub','CometUpdaterTask*'
+    Disable multiple entries; wildcards expand against actual Run-key entries.
+
+.EXAMPLE
+    scripts/safe-disable-startup.ps1 -List
+    Show current enabled/disabled state of every known startup entry.
+
+.EXAMPLE
+    scripts/safe-disable-startup.ps1 -Name 'Adobe Creative Cloud' -Enable
+    Re-enable a previously-disabled entry.
+
+.NOTES
+    Exit codes:
+      0 success
+      2 usage (no names given and not -List)
+      3 not found (no matching Run-key entry for the given name)
+      4 validation error
+#>
+
+[CmdletBinding(SupportsShouldProcess)]
+param(
+    [Parameter(ValueFromPipeline, Position=0)][string[]]$Name,
+    [switch]$Enable,
+    [switch]$List,
+    [switch]$Json
+)
+
+$ErrorActionPreference = 'Stop'
+. "$PSScriptRoot\_lib\common.ps1"
+
+# Map: registry path -> StartupApproved variant for the overlay
+$pathVariantMap = @(
+    @{ Path = 'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run';                       Variant = 'Run' }
+    @{ Path = 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Run';                       Variant = 'Run' }
+    @{ Path = 'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run';           Variant = 'Run32' }
+)
+
+function Get-RunEntries {
+    $entries = @()
+    foreach ($m in $pathVariantMap) {
+        if (Test-Path $m.Path) {
+            (Get-ItemProperty $m.Path -ErrorAction SilentlyContinue).PSObject.Properties |
+                Where-Object { $_.Name -notmatch '^PS' } |
+                ForEach-Object {
+                    $entries += [PSCustomObject]@{
+                        Name    = $_.Name
+                        Command = $_.Value
+                        Path    = $m.Path
+                        Variant = $m.Variant
+                    }
+                }
+        }
+    }
+    return $entries
+}
+
+function Get-CurrentState {
+    param([Parameter(Mandatory)][string]$EntryName, [Parameter(Mandatory)][string]$Variant)
+    $key = "HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StartupApproved\$Variant"
+    if (-not (Test-Path $key)) { return 'unmanaged' }
+    $val = (Get-ItemProperty $key -Name $EntryName -ErrorAction SilentlyContinue).$EntryName
+    if (-not $val) { return 'unmanaged' }   # No overlay = uses default (enabled)
+    if ($val[0] -eq 0x03) { return 'disabled' }
+    elseif ($val[0] -eq 0x02) { return 'enabled' }
+    else { return "unknown(0x{0:X2})" -f $val[0] }
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Mode: List
+# ─────────────────────────────────────────────────────────────────────
+if ($List) {
+    $allEntries = Get-RunEntries
+    foreach ($e in $allEntries) {
+        $state = Get-CurrentState -EntryName $e.Name -Variant $e.Variant
+        $row = [PSCustomObject]@{
+            Name    = $e.Name
+            State   = $state
+            Variant = $e.Variant
+            Source  = (Split-Path $e.Path -Leaf) + '\' + (Split-Path $e.Path -Parent | Split-Path -Leaf)
+            Command = $e.Command -replace '"',''
+        }
+        if ($Json) {
+            [Console]::Out.WriteLine(($row | ConvertTo-Json -Compress))
+        } else {
+            $tag = switch ($state) { 'disabled' {'[X]'} 'enabled' {'[ ]'} default {'[?]'} }
+            [Console]::Out.WriteLine(("{0} {1,-40} {2,-7} {3}" -f $tag, $e.Name.Substring(0, [Math]::Min(40, $e.Name.Length)), $state, $e.Variant))
+        }
+    }
+    exit $script:EXIT_OK
+}
+
+# ─────────────────────────────────────────────────────────────────────
+# Mode: Disable/Enable
+# ─────────────────────────────────────────────────────────────────────
+if (-not $Name) {
+    Write-Log -Level ERROR -Message "Must provide -Name or -List. See -? for help."
+    exit $script:EXIT_USAGE
+}
+
+$statusByte = if ($Enable) { [byte]0x02 } else { [byte]0x03 }
+$action     = if ($Enable) { 'enable' }   else { 'disable' }
+$valueBytes = ConvertTo-Bytes12 -StatusByte $statusByte
+
+$allEntries = Get-RunEntries
+$matched    = @()
+
+foreach ($pattern in $Name) {
+    $hits = $allEntries | Where-Object { $_.Name -like $pattern }
+    if (-not $hits) {
+        Write-Log -Level WARN -Message "No Run-key entries match pattern: $pattern"
+        continue
+    }
+    foreach ($e in $hits) {
+        if ($PSCmdlet.ShouldProcess("$($e.Name) (Variant=$($e.Variant))", "$action via StartupApproved\$($e.Variant)")) {
+            try {
+                $key = Get-StartupApprovedKey -Variant $e.Variant
+                Set-ItemProperty -Path $key -Name $e.Name -Value $valueBytes -Type Binary -Force
+                $matched += $e
+                $verified = Get-CurrentState -EntryName $e.Name -Variant $e.Variant
+                Write-Log -Level PASS -Message "${action}d: $($e.Name)  [$($e.Variant)] -> verified state: $verified"
+                if ($Json) {
+                    [Console]::Out.WriteLine((@{
+                        action   = $action
+                        name     = $e.Name
+                        variant  = $e.Variant
+                        verified = $verified
+                    } | ConvertTo-Json -Compress))
+                }
+            } catch {
+                Write-Log -Level FAIL -Message "Failed to $action $($e.Name): $_"
+            }
+        }
+    }
+}
+
+if (-not $matched) {
+    Write-Log -Level ERROR -Message "No matching entries acted on."
+    exit $script:EXIT_NOT_FOUND
+}
+
+if (-not $Json -and -not $Quiet) {
+    [Console]::Error.WriteLine("")
+    [Console]::Error.WriteLine("$($matched.Count) entr$(if ($matched.Count -eq 1) {'y'} else {'ies'}) ${action}d. Effect applies at next user logon.")
+    [Console]::Error.WriteLine("Re-run with -List to verify.")
+}
+
+exit $script:EXIT_OK