Просмотр исходного кода

feat(skills): add r-ops — modern R / data-science skill (v3.4.0)

Tidyverse-first data-science skill (the set's first): import -> tidy ->
transform -> visualize -> model -> communicate across 9 reference files
(~115 KB), leading with current idioms (native |>, .by=, across, list_rbind,
tidymodels, tsibble/fable, Quarto + renv) and naming base R / data.table
where they win.

Built to the Skill Creation + Resource protocols:
- check-r-facts.py — §7 staleness verifier. --offline (PR CI) asserts every
  CRAN package in assets/r-packages.json is still named in the prose and the
  currency note carries a year; --live (weekly freshness) resolves each
  package on CRAN (exit 10 gone, exit 7 unreachable). Inline Term, --json,
  semantic exit codes.
- 43-assertion offline tests/run.sh incl. a content-currency guard that fails
  if a superseded idiom reappears as a recommendation.
- Wired into plugin.json (v3.4.0), README, AGENTS, docs/PLAN (96 skills),
  CHANGELOG, tests/check-resources.sh, and freshness.yml.

Salvaged + freshened from the stale stacked PR #6 (which also duplicated the
already-shipped supply-chain-defense); re-landed clean off current main.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0xDarkMatter 7 часов назад
Родитель
Сommit
8bb0ecb6a3

+ 1 - 1
.claude-plugin/plugin.json

@@ -1,7 +1,7 @@
 {
   "$schema": "https://json.schemastore.org/claude-code-plugin-manifest.json",
   "name": "claude-mods",
-  "version": "3.3.0",
+  "version": "3.4.0",
   "description": "Custom commands, skills, agents, rules, hooks, and output styles for Claude Code - session continuity and modern CLI tooling for real-world development workflows",
   "author": {
     "name": "0xDarkMatter"

+ 9 - 0
.github/workflows/freshness.yml

@@ -79,6 +79,15 @@ jobs:
           if [ "$rc" -eq 7 ]; then echo "::warning::mapbox-ops live check unreachable — skipped"; fi
           exit 0
 
+      - name: r-ops recommended packages still on CRAN
+        run: |
+          set +e
+          python skills/r-ops/scripts/check-r-facts.py --live
+          rc=$?
+          if [ "$rc" -eq 10 ]; then echo "::error::r-ops drift — a recommended CRAN package no longer resolves (archived/removed)"; exit 1; fi
+          if [ "$rc" -eq 7 ]; then echo "::warning::r-ops live check unreachable (CRAN/crandb) — skipped"; fi
+          exit 0
+
       - name: GitHub Action refs still resolve
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

+ 1 - 1
AGENTS.md

@@ -5,7 +5,7 @@
 This is **claude-mods** - a collection of custom extensions for Claude Code:
 - **3 expert agents** for pure context-isolation/worker roles (git-agent, firecrawl-expert, project-organizer) - every domain-knowledge agent became an `-ops` skill (v3.0, skills-first)
 - **2 commands** for session management (/sync, /save)
-- **95 skills** for CLI tools, patterns, workflows, and development tasks (incl. `loop-ops` for outer-loop design discipline, `ffmpeg-ops` for probe-first media processing and EDL-driven editing, `supply-chain-defense` for behavioural-first dependency security, `prompt-injection-defense` for instruction-integrity scanning, `pypi-ops` for OIDC Trusted Publishing to PyPI, `net-ops` for network troubleshooting, `windows-ops` / `mac-ops` for workstation diagnostics, `fleet-worker` for cheap parallel worker delegation)
+- **96 skills** for CLI tools, patterns, workflows, and development tasks (incl. `r-ops` for tidyverse-first modern R / data analysis, `loop-ops` for outer-loop design discipline, `ffmpeg-ops` for probe-first media processing and EDL-driven editing, `supply-chain-defense` for behavioural-first dependency security, `prompt-injection-defense` for instruction-integrity scanning, `pypi-ops` for OIDC Trusted Publishing to PyPI, `net-ops` for network troubleshooting, `windows-ops` / `mac-ops` for workstation diagnostics, `fleet-worker` for cheap parallel worker delegation)
 - **13 output styles** for response personality (Vesper, Spartan, Mentor, Executive, Pair, Atlas, Coach, Harbour, Meridian, Noir, Roast, Sage, Scout)
 - **11 hooks** for pre-commit linting, post-edit formatting, dangerous command warnings, uv enforcement, dependency-install + manifest-edit supply-chain advisories, hidden-Unicode scanning (session-start + pre-commit), live config-change + worktree guards, and pmail notifications - security set auto-wired via plugin hooks.json
 - **Pigeon** inter-session messaging (`pigeon send/read/reply`) - SQLite-backed pmail at `~/.claude/pmail.db`

+ 19 - 0
CHANGELOG.md

@@ -5,6 +5,25 @@ All notable changes to claude-mods are documented here. Format follows
 [Semantic Versioning](https://semver.org/). Fuller narrative entries for
 feature releases live in the README "Recent Updates" section.
 
+## [3.4.0] - 2026-06-23
+
+### Added
+- **`r-ops` skill** - the set's first data-science skill: a tidyverse-first,
+  current-best-practice reference for modern R (2024+). `SKILL.md` routes an
+  import → tidy → transform → visualize → model → communicate workflow across
+  9 reference files (~115 KB): tidyverse-core, import-io, strings-dates-factors,
+  visualization, iteration-functional, modeling-stats, data-table, time-series,
+  workflow-tooling. Leads with current idioms (native `|>`, dplyr `.by=`, the
+  `\(x)` lambda, `across()`, `list_rbind`, `slice_*`, tidymodels, tsibble/fable,
+  Quarto + renv); names base R and `data.table` where they win. Ships a
+  43-assertion offline self-test and a `check-r-facts.py` §7 staleness verifier:
+  `--offline` (PR CI) asserts every CRAN package in `assets/r-packages.json` is
+  still named in the prose and the currency note carries a year; `--live`
+  (weekly freshness, never blocks a PR) resolves each package on CRAN, exit 10
+  if one is archived/removed, exit 7 if CRAN is unreachable. Salvaged and
+  freshened from the stale stacked PR #6 (which also duplicated the
+  already-shipped supply-chain-defense); re-landed clean off current `main`.
+
 ## [3.3.0] - 2026-06-22
 
 ### Added

+ 9 - 5
README.md

@@ -12,16 +12,19 @@
 
 > *A comprehensive extension toolkit that transforms Claude Code into a specialized development powerhouse.*
 
-**claude-mods** is a production-ready plugin that extends Claude Code with 95 specialized skills, 3 expert agents, 13 output styles, 11 hooks, and modern CLI tools designed for real-world development workflows. Whether you're debugging React hooks, optimizing PostgreSQL queries, or building production CLI applications, this toolkit equips Claude with the domain expertise and procedural knowledge to work at expert level across multiple technology stacks.
+**claude-mods** is a production-ready plugin that extends Claude Code with 96 specialized skills, 3 expert agents, 13 output styles, 11 hooks, and modern CLI tools designed for real-world development workflows. Whether you're debugging React hooks, optimizing PostgreSQL queries, or building production CLI applications, this toolkit equips Claude with the domain expertise and procedural knowledge to work at expert level across multiple technology stacks.
 
 Built on the [Agent Skills specification](https://agentskills.io/specification) (an open standard backed by Anthropic, Vercel, Google, Microsoft, and 40+ agent platforms), claude-mods fills critical gaps in Claude Code's capabilities: persistent session state that survives across machines, on-demand expert knowledge for specialized domains, token-efficient modern CLI tools (10-100x faster than traditional alternatives), and proven workflow patterns for TDD, code review, and feature development. The toolkit implements Anthropic's [recommended patterns for long-running agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents), ensuring your development context never vanishes when sessions end.
 
 From Python async patterns to Rust ownership models, from AWS Fargate deployments to Craft CMS development - claude-mods provides the specialized knowledge and tools that transform Claude from a general-purpose assistant into a domain expert who understands your stack, remembers your workflow, and ships production code.
 
-**3 agents. 95 skills. 13 styles. 11 hooks. 8 rules. One install.**
+**3 agents. 96 skills. 13 styles. 11 hooks. 8 rules. One install.**
 
 ## Recent Updates
 
+**v3.4.0** (June 2026)
+- 📊 **`r-ops` skill** — the set's first data-science skill: a tidyverse-first, current-best-practice reference for modern R (2024+). `SKILL.md` routes an import → tidy → transform → visualize → model → communicate workflow across **9 reference files (~115 KB)** — tidyverse-core, import-io, strings-dates-factors, visualization, iteration-functional, modeling-stats, data-table, time-series, workflow-tooling. Leads with current idioms (native `|>`, dplyr `.by=`, the `\(x)` lambda, `across()`, `list_rbind`, `slice_*`, tidymodels, the tidyverts `tsibble`/`fable` stack, Quarto + renv) and names base R / `data.table` where they win. Ships a 43-assertion offline self-test plus a `check-r-facts.py` §7 staleness verifier (`--offline` asserts every catalogued CRAN package is still named in the prose and the currency note carries a year; `--live` resolves each package on CRAN) so the modern-stack claim is **machine-enforced, not asserted**. Salvaged and freshened from the stale stacked PR #6 (which also duplicated the already-shipped supply-chain-defense), re-landed clean off current `main`.
+
 **v3.3.0** (June 2026)
 - 🔁 **`loop-ops` skill** — the *outer-loop* design discipline, twin to [`iterate`](skills/iterate/) (the inner loop). Where `iterate` drives one metric in one session, `loop-ops` is the orchestration layer above it: how to design, scaffold, cost, and **safely** run scheduled discover→triage→implement→verify→escalate-or-land agent loops. Its spine is the **risk-tier ladder** — L1 report → L2 assisted → L3 unattended — mapped onto Claude Code's *actual* permission model (the part a generic-agent methodology can't reach): each tier is a concrete permission mode, with the *enumerate-vs-isolate* fork and the load-bearing rule that **a scheduler invokes `claude -p`, not a session that spawns ungated children**. Ships a STATE/run-log/budget state spine, a **13-pattern catalog framed as a morphology** (trigger × posture × locus — incl. event-driven Channels and `/goal`-completion archetypes like metric-chase & backfill), multi-loop coordination + kill switch, and three Resource-Protocol scripts — `loop-scaffold` (scaffold), `loop-check` (readiness scorer that refuses a green light on an unbounded scope / missing gate / undefined escalation), `loop-estimate` (token-$ estimate by pattern × cadence × model). Composes `fleet-worker` (spawn) and `fleet-ops` (land); 109-assertion offline suite. Builds on the public *loop engineering* discipline (Steinberger, Osmani) and the [Ralph loop](https://ghuntley.com/ralph/), grounded in this repo's auto-mode-classifier reference.
 
@@ -79,7 +82,7 @@ Claude Code is powerful out of the box, but it has gaps. This toolkit fills them
 
 - **Session continuity** — Tasks vanish when sessions end. We fix that with `/save` and `/sync`, implementing Anthropic's [recommended pattern](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) for long-running agents.
 
-- **Expert-level knowledge on demand** — 95 on-demand skills covering React, TypeScript, Python, Go, Rust, PostgreSQL, and more, plus 3 specialized agents reserved for genuine context-isolation/worker roles (git operations, web scraping, project reorganization). Skills-first: knowledge loads when relevant instead of living in heavyweight agent prompts.
+- **Expert-level knowledge on demand** — 96 on-demand skills covering React, TypeScript, Python, Go, Rust, PostgreSQL, and more, plus 3 specialized agents reserved for genuine context-isolation/worker roles (git operations, web scraping, project reorganization). Skills-first: knowledge loads when relevant instead of living in heavyweight agent prompts.
 
 - **Modern CLI tools** — Stop using `grep`, `find`, and `cat`. Our rules automatically prefer `ripgrep`, `fd`, `eza`, and `bat` — 10-100x faster and token-efficient.
 
@@ -104,7 +107,7 @@ claude-mods/
 ├── .claude-plugin/     # Plugin metadata
 ├── agents/             # Expert subagents (3)
 ├── commands/           # Slash commands (2)
-├── skills/             # Custom skills (95)
+├── skills/             # Custom skills (96)
 ├── output-styles/      # Response personalities
 ├── hooks/              # Hook examples & docs
 ├── rules/              # Claude Code rules
@@ -197,6 +200,7 @@ See [skill-creator](skills/skill-creator/) for the complete guide.
 | [rust-ops](skills/rust-ops/) | Rust ownership, async/tokio, error handling, traits, serde, ecosystem |
 | [typescript-ops](skills/typescript-ops/) | TypeScript type system, generics, utility types, strict mode, Zod |
 | [javascript-ops](skills/javascript-ops/) | JavaScript/Node.js async patterns, modules, ES2024+, runtime internals |
+| [r-ops](skills/r-ops/) | Modern R - tidyverse-first data analysis, dplyr/tidyr wrangling, ggplot2, stats/modeling (broom, tidymodels), data.table, time series, renv/Quarto workflow |
 | [react-ops](skills/react-ops/) | React hooks, Server Components, state management, performance, testing |
 | [vue-ops](skills/vue-ops/) | Vue 3 Composition API, Pinia, Vue Router, Nuxt 3 |
 | [astro-ops](skills/astro-ops/) | Astro islands, content collections, rendering strategies, deployment |
@@ -567,7 +571,7 @@ When using multiple MCP servers (Chrome DevTools, Vibe Kanban, etc.), their tool
 
 ### Skill Description Budget
 
-With 90+ skills installed (this plugin alone ships 95), skill descriptions can overflow the listing budget. All skill names are always listed, but descriptions share a budget of **1% of the model context window** — on overflow, least-invoked skills lose their descriptions first and **silently stop auto-triggering** (explicit `/name` invocation still works). Each skill's combined `description` + `when_to_use` is also truncated at **1,536 chars**, so trigger phrases belong at the front.
+With 90+ skills installed (this plugin alone ships 96), skill descriptions can overflow the listing budget. All skill names are always listed, but descriptions share a budget of **1% of the model context window** — on overflow, least-invoked skills lose their descriptions first and **silently stop auto-triggering** (explicit `/name` invocation still works). Each skill's combined `description` + `when_to_use` is also truncated at **1,536 chars**, so trigger phrases belong at the front.
 
 - **Check:** run `/doctor` — it shows whether the budget is overflowing and which skills are affected.
 - **Fix:** demote or disable skills you don't use via `skillOverrides` in settings (`"on"` / `"name-only"` / `"user-invocable-only"` / `"off"` per skill, or `/skills` + `Space`). Plugin skills are managed via `/plugin` instead.

+ 1 - 1
docs/PLAN.md

@@ -16,7 +16,7 @@
 | Component | Count | Notes |
 |-----------|-------|-------|
 | Agents | 3 | Pure context-isolation/worker roles only: git-agent (background commits/PRs), firecrawl-expert (noisy scrapes), project-organizer (bulk restructure) |
-| Skills | 95 | Operational skills, CLI tools, workflows, diagnostics, security |
+| Skills | 96 | Operational skills, CLI tools, workflows, diagnostics, security |
 | Commands | 2 | Session management (sync, save) |
 | Rules | 8 | cli-tools, commit-style, naming-conventions, prompt-injection, skill-agent-updates, supply-chain, worktree-boundaries, loop-engineering |
 | Output Styles | 13 | Vesper, Spartan, Mentor, Executive, Pair, Atlas, Coach, Harbour, Meridian, Noir, Roast, Sage, Scout |

+ 128 - 0
skills/r-ops/SKILL.md

@@ -0,0 +1,128 @@
+---
+name: r-ops
+description: "Modern R operations for data analysis, statistics, and reproducible work. Use for: R, Rstats, tidyverse, dplyr, tidyr, ggplot2, the native pipe |>, tibbles, data wrangling (filter/mutate/summarise/group_by/across/joins/pivot), reading and writing data (readr, readxl, arrow/Parquet, DBI/dbplyr databases, data.table::fread, rvest scraping), strings (stringr) and regex, dates/times (lubridate), factors (forcats), iteration and functional programming (purrr map family, list-columns), statistics and modeling (t.test/lm/glm, formulas, broom, tidymodels), high-performance data.table, time series (tsibble/fable, zoo/xts), and project workflow (renv, Quarto, here, testthat, styler, RStudio/Posit Projects). Covers tidyverse-first idioms with base R and data.table as named alternatives."
+when_to_use: "Use for any R / Rstats work — tidyverse data wrangling, statistics, ggplot2 visualization, or reproducible analysis — e.g. 'rewrite this in dplyr/tidyverse', 'how do I pivot/join/group these', 'plot this with ggplot2', 'fit a model and tidy it with broom', 'speed this up with data.table', 'set up an renv + Quarto project'. Leads with modern idioms (native |>, .by=, across, purrr map); names base R / data.table where they win."
+license: MIT
+compatibility: "R >= 4.1 (native |> pipe); tidyverse 2.0; Quarto"
+allowed-tools: "Read Write Bash"
+metadata:
+  author: claude-mods
+  related-skills: "sql-ops, postgres-ops, python-database-ops"
+---
+
+# Modern R Operations
+
+A tidyverse-first, current-best-practice reference for working in R (2024+): data analysis, statistics, visualization, and reproducible workflow. Opinionated where the community has converged, with base R and `data.table` flagged as the right tool when they are.
+
+## The modern R stack at a glance
+
+| Job | Reach for | Not (anymore) |
+|-----|-----------|---------------|
+| Pipe | native `\|>` (R 4.1+) | `%>%` only when you need its placeholder/`.` features |
+| Data frame | `tibble` | `data.frame` defaults (but it's fine) |
+| Wrangle | `dplyr` + `tidyr` | hand-rolled `[`, `subset`, `aggregate` |
+| Read CSV | `readr::read_csv` (prod), `data.table::fread` (speed) | `read.csv` |
+| Excel / Parquet / DB | `readxl` / `arrow` / `DBI`+`dbplyr` | — |
+| Strings / dates / factors | `stringr` / `lubridate` / `forcats` | base `grepl`/`POSIXlt`/`factor` juggling |
+| Plot | `ggplot2` | base graphics (fine for throwaway plots) |
+| Iterate | `purrr::map_*` + `across()` | `sapply` (type-unstable); `lapply` ok in package code |
+| Big / fast | `data.table` (or `dtplyr`, `arrow`+`duckdb`) | — |
+| Model | base `lm`/`glm` + `broom`; `tidymodels` for CV/tuning | `caret` |
+| Time series | `tsibble` + `fable` | `forecast::auto.arima` (maintenance-only) |
+| Reports | Quarto (`.qmd`) | R Markdown (still works) |
+| Reproducibility | `renv` + Projects + `here()` | `setwd()`, saving `.RData` |
+
+## The analysis workflow (and where each reference lives)
+
+```
+import → tidy → transform → visualize → model → communicate
+```
+
+1. **Import** — get data in: [import-io.md](references/import-io.md)
+2. **Tidy & transform** — the dplyr/tidyr core: [tidyverse-core.md](references/tidyverse-core.md)
+3. **Clean types** — strings, dates, factors: [strings-dates-factors.md](references/strings-dates-factors.md)
+4. **Iterate** — map over many things, list-columns: [iteration-functional.md](references/iteration-functional.md)
+5. **Visualize** — ggplot2 + EDA: [visualization.md](references/visualization.md)
+6. **Model** — tests, lm/glm, broom, tidymodels: [modeling-stats.md](references/modeling-stats.md)
+7. **Scale up** — when dplyr is too slow: [data-table.md](references/data-table.md)
+8. **Time series** — tsibble/fable, xts: [time-series.md](references/time-series.md)
+9. **Ship it** — projects, renv, Quarto, testing: [workflow-tooling.md](references/workflow-tooling.md)
+
+Open the reference for the task at hand — they load on demand. For broad orientation, this file is enough.
+
+## Core idioms (internalize these)
+
+```r
+library(tidyverse)
+
+# The native pipe threads a value into the first argument.
+diamonds |>
+  filter(carat > 0.5) |>
+  mutate(price_per_carat = price / carat) |>
+  summarise(
+    mean_ppc = mean(price_per_carat),
+    n = n(),
+    .by = cut                      # per-operation grouping (dplyr 1.1+)
+  ) |>
+  arrange(desc(mean_ppc))
+
+# across() applies one op to many columns
+df |> summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
+
+# map over a list/vector, type-stable; combine results
+files |> map(read_csv) |> list_rbind(names_to = "source")
+
+# ggplot: data + aesthetic mapping + layered geoms
+ggplot(df, aes(x = displ, y = hwy, colour = class)) +
+  geom_point() +
+  geom_smooth(method = "lm")
+```
+
+## Decision shortcuts
+
+**Grouping**: prefer per-operation `.by =` over `group_by() |> ... |> ungroup()` — it avoids sticky-group bugs.
+
+**Joins**: always write `join_by(...)` explicitly. Natural joins on shared names are almost always wrong on real data.
+
+**Which CSV reader?** `read_csv` (readable, good defaults, production) · `fread` (fastest, big files) · `vroom` (many files, column subset).
+
+**dplyr or data.table?** dplyr for readability and teams; data.table (or `dtplyr`) when profiling says dplyr is the bottleneck or data is large. `arrow`+`duckdb` for larger-than-memory.
+
+**lm or tidymodels?** Base `lm`/`glm` is the right default — reach for tidymodels only when you need cross-validation, tuning, or uniform multi-model comparison.
+
+**base R or tidyverse?** Tidyverse for analysis, readability, teams. Base R (or data.table) for package development, minimal-dependency scripts, and performance-critical inner loops. The `|>` pipe is base and dependency-free — use it everywhere.
+
+## High-value gotchas
+
+These bite people repeatedly — full detail in the referenced files:
+
+- **`stringsAsFactors` is `FALSE` since R 4.0** (2020). Old advice warning about automatic factor conversion on import is stale and sometimes backwards. (import-io)
+- **`predict(glm_model, type = "response")`** for probabilities — the default returns link-scale (log-odds). (modeling-stats)
+- **`cor.test()`, not `cor()`** when you care whether a correlation is real. (modeling-stats)
+- **`sapply` is type-unstable** — never in function bodies; use a typed `map_*`. (iteration-functional)
+- **`map_dfr`/`map_dfc` are superseded** → `map() |> list_rbind()` / `list_cbind()`. (iteration-functional)
+- **ggplot mapping vs setting**: `aes(colour = class)` maps a variable; `colour = "blue"` sets a constant. Putting a constant inside `aes()` is the #1 ggplot mistake. (visualization)
+- **`coord_cartesian(ylim=)` zooms; `scale_y_continuous(limits=)` drops data** — the latter silently corrupts smooths/boxplots. (visualization)
+- **Factor order is not cosmetic** — it sets ggplot axis/legend order and regression reference levels. `fct_reorder` for plots, `fct_relevel` for models. (strings-dates-factors)
+- **lubridate periods vs durations**: `months(1)` (calendar) vs `dmonths(1)` (fixed seconds); use `%m+%` for safe month-end arithmetic. (strings-dates-factors)
+- **`data.table` `:=` mutates in place** — `DT2 <- DT` is not a copy; use `copy(DT)`. (data-table)
+- **xts `lag(k = +1)` *leads*** (future data); use `k = -1`. `rollapply` defaults to center alignment — set `align = "right"` to avoid look-ahead bias. (time-series)
+- **Never `setwd()` with an absolute path** — use an RStudio Project + `here::here()`. Don't save/restore `.RData`. (workflow-tooling)
+
+## Currency note
+
+Reflects the R ecosystem as of 2024–2026: R ≥ 4.3, tidyverse 2.0, native `|>`, dplyr `.by=`, the `\(x)` lambda, `list_rbind`/`list_cbind`, the tidyverts (tsibble/fable) time-series stack, and Quarto. Where a once-standard approach has been superseded (base apply → purrr, `forecast` → fable, R Markdown → Quarto, `map_dfr` → `list_rbind`), the modern form leads and the older one is noted for when you encounter it in the wild.
+
+This currency is **verified, not asserted** — [`scripts/check-r-facts.py`](scripts/check-r-facts.py) guards it against silent drift:
+
+```bash
+# Structural (PR CI, no network): every CRAN package in the catalog is still
+# named in this skill's prose, and the currency note still carries a year.
+python scripts/check-r-facts.py --offline        # exit 0 consistent, 10 drift
+
+# Live (weekly freshness job, never blocks a PR): every recommended package
+# still resolves on CRAN.
+python scripts/check-r-facts.py --live            # exit 10 a package is gone, 7 CRAN unreachable
+```
+
+The canonical package list lives in [`assets/r-packages.json`](assets/r-packages.json); when you add or drop a recommendation, update it to match or `--offline` fails CI.

+ 41 - 0
skills/r-ops/assets/r-packages.json

@@ -0,0 +1,41 @@
+{
+  "schema": "claude-mods.r-ops.packages/v1",
+  "generated": "2026-06-22",
+  "note": "Canonical CRAN packages r-ops recommends. check-r-facts.py --offline asserts every name here is still named in the skill prose (catalog must not drift from the references); --live (run weekly by freshness.yml, never a PR gate) asserts each still resolves on CRAN via crandb.r-pkg.org. A package that gets archived/removed upstream is the exact silent-staleness failure §7 guards against.",
+  "registry": "https://crandb.r-pkg.org/",
+  "packages": [
+    {"name": "tidyverse", "role": "meta"},
+    {"name": "dplyr", "role": "wrangle"},
+    {"name": "tidyr", "role": "wrangle"},
+    {"name": "tibble", "role": "wrangle"},
+    {"name": "ggplot2", "role": "visualize"},
+    {"name": "scales", "role": "visualize"},
+    {"name": "readr", "role": "import-io"},
+    {"name": "readxl", "role": "import-io"},
+    {"name": "arrow", "role": "import-io"},
+    {"name": "DBI", "role": "import-io"},
+    {"name": "dbplyr", "role": "import-io"},
+    {"name": "vroom", "role": "import-io"},
+    {"name": "rvest", "role": "import-io"},
+    {"name": "purrr", "role": "iterate"},
+    {"name": "stringr", "role": "strings-dates-factors"},
+    {"name": "lubridate", "role": "strings-dates-factors"},
+    {"name": "forcats", "role": "strings-dates-factors"},
+    {"name": "data.table", "role": "performance"},
+    {"name": "dtplyr", "role": "performance"},
+    {"name": "duckdb", "role": "performance"},
+    {"name": "broom", "role": "model"},
+    {"name": "tidymodels", "role": "model"},
+    {"name": "tsibble", "role": "time-series"},
+    {"name": "fable", "role": "time-series"},
+    {"name": "feasts", "role": "time-series"},
+    {"name": "zoo", "role": "time-series"},
+    {"name": "xts", "role": "time-series"},
+    {"name": "renv", "role": "workflow"},
+    {"name": "here", "role": "workflow"},
+    {"name": "usethis", "role": "workflow"},
+    {"name": "testthat", "role": "workflow"},
+    {"name": "styler", "role": "workflow"},
+    {"name": "quarto", "role": "workflow"}
+  ]
+}

+ 368 - 0
skills/r-ops/references/data-table.md

@@ -0,0 +1,368 @@
+# data.table — High-Performance Data Manipulation
+
+`data.table` is an in-memory data frame replacement optimised for large datasets.
+Use it when dplyr is too slow, memory is constrained, or reference semantics are
+wanted. Core advantage: **no copies** — mutations happen in place via `:=`.
+
+---
+
+## Core Syntax: `DT[i, j, by]`
+
+```r
+DT[i, j, by]
+# i  = row filter (WHERE)
+# j  = column expression (SELECT / mutate)
+# by = grouping variable(s) (GROUP BY)
+```
+
+Empty slots use nothing — not `NULL`, not a comma placeholder for `i`/`by`, but
+literally omit when chaining makes sense.
+
+---
+
+## Creating data.tables
+
+```r
+library(data.table)
+
+# From scratch
+DT <- data.table(id = 1:5, val = rnorm(5), grp = c("a","a","b","b","b"))
+
+# Convert data.frame — copies
+DT <- as.data.table(df)
+
+# Convert in place — no copy (modifies df's class)
+setDT(df)        # df is now a data.table; no assignment needed
+setDF(DT)        # reverse: back to data.frame in place
+```
+
+---
+
+## Row Filtering (`i`)
+
+```r
+DT[val > 0]                    # logical
+DT[grp == "a"]
+DT[grp %in% c("a","b")]
+
+# Keyed binary search (set key first — see Keys section)
+setkey(DT, grp)
+DT["a"]                        # all rows where grp == "a"
+DT[.("a", "b")]                # multi-value lookup
+```
+
+---
+
+## Column Operations (`j`)
+
+```r
+# Select columns — use .() not c()
+DT[, .(id, val)]
+DT[, c("id", "val")]           # also works; .() is idiomatic
+
+# Compute
+DT[, .(mean_val = mean(val), n = .N)]
+
+# Add / update column BY REFERENCE — no copy made
+DT[, new_col := val * 2]
+DT[grp == "a", flag := TRUE]   # conditional assignment
+
+# Multiple columns at once
+DT[, `:=`(sq = val^2, log_val = log(val + 1))]
+
+# Delete a column
+DT[, drop_col := NULL]
+
+# .N — row count (in j or by context)
+DT[, .N]                       # total rows
+DT[, .N, by = grp]             # rows per group
+
+# .SD — Subset of Data (the current group's rows as a data.table)
+DT[, lapply(.SD, mean), by = grp]
+
+# .SDcols — restrict .SD to specific columns
+DT[, lapply(.SD, sum), by = grp, .SDcols = c("val", "sq")]
+
+# .I — row indices of the original DT
+DT[, .I[which.max(val)], by = grp]   # index of max val per group
+
+# .GRP — integer group counter (1, 2, ...)
+DT[, grp_id := .GRP, by = grp]
+```
+
+---
+
+## Grouping (`by` / `keyby`)
+
+```r
+DT[, .(total = sum(val)), by = grp]          # result order not guaranteed
+DT[, .(total = sum(val)), keyby = grp]       # result sorted by grp (sets key too)
+
+# Multi-column grouping
+DT[, .(n = .N), by = .(grp, flag)]
+
+# Expression in by
+DT[, .(mean_val = mean(val)), by = .(positive = val > 0)]
+```
+
+---
+
+## Keys and Indices
+
+```r
+# Set key — sorts DT in place, enables binary search
+setkey(DT, grp)
+key(DT)                        # check current key
+
+# Composite key
+setkey(DT, grp, id)
+
+# Secondary index (doesn't sort; auto-created on first on= use)
+setindex(DT, val)
+
+# Ad-hoc join / filter without setting key
+DT[.(val = "a"), on = "grp"]
+```
+
+---
+
+## Joins
+
+```r
+# Basic join — X[Y] — right join by default (all Y rows kept)
+X[Y, on = .(id)]               # Y's rows drive output
+
+# Left join
+Y[X, on = .(id)]
+
+# Inner join
+merge(X, Y, by = "id")         # uses merge.data.table, returns data.table
+
+# Full outer join
+merge(X, Y, by = "id", all = TRUE)
+
+# Nomatch — control unmatched behaviour
+X[Y, on = .(id), nomatch = NULL]   # inner join via [
+
+# Non-equi join
+X[Y, on = .(start <= date, end >= date)]
+
+# Rolling join — last observation carried forward
+setkey(prices, date)
+trades[prices, roll = TRUE, on = .(date)]   # each trade gets prev price
+
+# roll = "nearest" for nearest-value join
+# roll = -Inf for next observation carried backward (NOCB)
+
+# Anti-join
+X[!Y, on = .(id)]
+```
+
+---
+
+## Reshaping
+
+```r
+# Wide → long (like tidyr::pivot_longer)
+long <- melt(DT,
+  id.vars       = c("id", "grp"),
+  measure.vars  = c("val", "sq"),
+  variable.name = "metric",
+  value.name    = "number")
+
+# Long → wide (like tidyr::pivot_wider)
+wide <- dcast(long,
+  id + grp ~ metric,
+  value.var = "number",
+  fun.aggregate = sum)   # if there are duplicates
+
+# Multiple value columns at once
+dcast(long, id ~ metric, value.var = c("number"))
+```
+
+---
+
+## Fast I/O: fread / fwrite
+
+```r
+# fread — fastest CSV reader (multi-threaded, auto-detects sep, header, types)
+DT <- fread("large.csv")
+DT <- fread("large.csv", select = c("id", "val"), nrows = 1e6)
+DT <- fread("zcat large.csv.gz |")   # pipe input
+
+# fwrite — fastest CSV writer
+fwrite(DT, "output.csv")
+fwrite(DT, "output.csv.gz", compress = "gzip")
+
+# Common options
+fread("data.csv",
+  na.strings = c("", "NA", "NULL"),
+  colClasses = list(character = "id"),
+  skip        = 2)
+```
+
+---
+
+## Chaining
+
+```r
+# Chain [] calls — left to right
+DT[val > 0][, .(mean_val = mean(val)), by = grp][order(-mean_val)]
+
+# Equivalent pipe style (R 4.1+)
+DT |> _[val > 0] |> _[, .(mean_val = mean(val)), by = grp]
+# Note: the _ placeholder pipe syntax for [  is awkward — chaining is cleaner
+```
+
+---
+
+## dtplyr — dplyr Syntax, data.table Speed
+
+```r
+library(dtplyr)
+
+# Wrap once; dplyr verbs generate data.table calls lazily
+lazy <- lazy_dt(DT)
+
+result <- lazy |>
+  filter(val > 0) |>
+  group_by(grp) |>
+  summarise(mean_val = mean(val)) |>
+  as.data.table()   # or collect() / as_tibble()
+
+# See generated code
+lazy |> filter(val > 0) |> show_query()
+```
+
+Use dtplyr when: team knows dplyr, data is large enough to need data.table speed,
+but you don't want to rewrite pipelines. Accept ~10-20% overhead vs native data.table.
+
+---
+
+## dplyr ↔ data.table Translation
+
+| dplyr | data.table |
+|---|---|
+| `filter(DT, val > 0)` | `DT[val > 0]` |
+| `select(DT, id, val)` | `DT[, .(id, val)]` |
+| `mutate(DT, sq = val^2)` | `DT[, sq := val^2]` |
+| `summarise(DT, n = n())` | `DT[, .(.N)]` |
+| `group_by(DT, grp) |> summarise(m = mean(val))` | `DT[, .(m = mean(val)), by = grp]` |
+| `arrange(DT, -val)` | `DT[order(-val)]` |
+| `left_join(X, Y, by = "id")` | `Y[X, on = .(id)]` |
+| `inner_join(X, Y, by = "id")` | `X[Y, on = .(id), nomatch = NULL]` |
+| `pivot_longer(...)` | `melt(DT, id.vars = ..., measure.vars = ...)` |
+| `pivot_wider(...)` | `dcast(DT, formula, value.var = ...)` |
+| `bind_rows(A, B)` | `rbindlist(list(A, B), fill = TRUE)` |
+| `bind_cols(A, B)` | `cbind(A, B)` |
+| `distinct(DT, grp)` | `unique(DT, by = "grp")` |
+| `slice_max(DT, val, by = grp)` | `DT[DT[, .I[which.max(val)], by=grp]$V1]` |
+| `rename(DT, new = old)` | `setnames(DT, "old", "new")` |
+| `relocate` | `setcolorder(DT, c("id", ...))` |
+| `n_distinct(x)` | `uniqueN(x)` |
+| `count(DT, grp)` | `DT[, .N, by = grp]` |
+
+---
+
+## Performance Intuition
+
+```
+Operation            dplyr (tibble)    data.table     Speedup
+─────────────────────────────────────────────────────────────
+groupby sum, 100M    ~4.5 s            ~0.3 s          15×
+equi join, 10M×10M   ~8 s              ~0.5 s          16×
+CSV read, 1GB        ~12 s (readr)     ~2 s (fread)     6×
+melt, 50M rows       ~3 s (tidyr)      ~0.4 s           7×
+```
+
+Speedups vary with hardware, cardinality, and data shape — treat as order-of-magnitude guidance.
+Memory: data.table avoids intermediate copies; dplyr allocates at each verb.
+
+---
+
+## Gotchas — Reference Semantics Surprises
+
+### `:=` mutates the original, always
+
+```r
+DT2 <- DT          # NOT a copy — DT2 and DT point to same memory
+DT2[, new := 1]    # DT is also changed!
+
+# Fix: explicit copy
+DT2 <- copy(DT)
+DT2[, new := 1]    # DT unchanged
+```
+
+### Functions that modify via `:=` inside a function
+
+```r
+# This silently modifies the caller's DT
+bad <- function(dt) {
+  dt[, x := 1]   # mutates caller's object
+}
+
+# If mutation is intentional, document it clearly
+# If not, copy() at the top of the function
+safe <- function(dt) {
+  dt <- copy(dt)
+  dt[, x := 1]
+  dt
+}
+```
+
+### Printing triggers duplicate first-row display
+
+```r
+DT <- data.table(x = 1:3)
+DT[, y := x * 2]   # silent (by design — returns DT invisibly for chaining)
+# In interactive session: no output printed after :=
+# This is expected — use print(DT) or DT[] to force display
+```
+
+### `[` on a data.table column that is itself a list
+
+```r
+DT[, list_col]      # returns the list column as a list
+DT[, .(list_col)]   # returns a one-column data.table
+```
+
+### `by=` uses the string name, not the column object
+
+```r
+grp_var <- "grp"
+DT[, .N, by = grp_var]    # WRONG — treats "grp_var" as column name
+DT[, .N, by = (grp_var)]  # WRONG still
+DT[, .N, by = grp_var]    # Actually works in recent versions — but use:
+DT[, .N, by = eval(grp_var)]          # explicit eval
+DT[, .N, by = c(grp_var)]             # character vector form (safe)
+```
+
+### `setkey` sorts in place — existing row order is gone
+
+```r
+setkey(DT, id)    # DT is now sorted by id; original order lost
+                  # use data.table::rowidv(DT) before setkey if order matters
+```
+
+### Subset with single column returns vector, not data.table
+
+```r
+DT[, val]          # vector
+DT[, .(val)]       # one-column data.table — use .() to keep as DT
+DT[, "val"]        # one-column data.table (character indexing returns DT)
+```
+
+---
+
+## When to Use data.table vs dplyr
+
+| Situation | Choose |
+|---|---|
+| > 1M rows, speed matters | data.table |
+| Memory constrained | data.table (no copy on mutate) |
+| Rolling / non-equi joins | data.table |
+| Team knows dplyr, data is large | dtplyr |
+| < 100k rows, readability priority | dplyr |
+| Rapid exploration / prototyping | dplyr |
+| Production pipeline, large data | data.table |
+| Mixed team, want both | dtplyr for reads, data.table for writes |

+ 584 - 0
skills/r-ops/references/import-io.md

@@ -0,0 +1,584 @@
+# R Data Import & I/O
+
+Comprehensive operational reference for getting data into and out of R. Covers flat files, spreadsheets, databases, columnar/big data, web sources, nested JSON, and R-native serialization.
+
+---
+
+## CSV & Delimited Files
+
+### readr (tidyverse standard)
+
+```r
+library(readr)
+
+# Basic read — prints column spec on first run, suppress with show_col_types = FALSE
+df <- read_csv("data/sales.csv")
+
+# Explicit column types — always do this in production code
+df <- read_csv(
+  "data/sales.csv",
+  col_types = cols(
+    id        = col_integer(),
+    date      = col_date(format = "%Y-%m-%d"),
+    amount    = col_double(),
+    category  = col_character(),
+    flag      = col_logical()
+  ),
+  na = c("", "NA", "N/A", "NULL", "-"),
+  locale = locale(encoding = "UTF-8", decimal_mark = ".", grouping_mark = ",")
+)
+
+# Skip metadata rows at top
+df <- read_csv("data/report.csv", skip = 3, comment = "#")
+
+# No header row
+df <- read_csv("data/raw.csv", col_names = c("x", "y", "z"))
+
+# Semicolon-delimited (European locale where comma = decimal)
+df <- read_csv2("data/european.csv")          # ; delimited, , decimal
+df <- read_delim("data/file.psv", delim = "|")
+
+# Tab-separated
+df <- read_tsv("data/data.tsv")
+
+# Write back
+write_csv(df, "out/cleaned.csv")
+write_csv2(df, "out/european.csv")  # semicolon, European decimal
+```
+
+Key `col_types` shortcuts: `"icdc_l"` — one char per column (i=integer, c=character, d=double, l=logical, _=skip, D=date, T=datetime).
+
+### data.table::fread — fastest option
+
+```r
+library(data.table)
+
+# Fastest CSV reader; auto-detects delimiter, header, encoding
+dt <- fread("data/large.csv")
+
+# Back to tibble for dplyr workflows
+df <- as_tibble(fread("data/large.csv"))
+
+# Select columns up front (avoids loading full file)
+dt <- fread("data/large.csv", select = c("id", "amount", "date"))
+
+# Parallel threads (default = all cores)
+dt <- fread("data/huge.csv", nThread = 4)
+
+# Write — also fastest
+fwrite(dt, "out/output.csv")
+```
+
+Use `fread` when: file > 500 MB, speed matters, delimiter is uncertain.
+
+### vroom — lazy/streaming alternative
+
+```r
+library(vroom)
+
+# Lazy indexing — fast open, reads only requested columns into memory
+df <- vroom("data/large.csv", col_select = c(id, amount))
+
+# Multiple files in one call
+df <- vroom(list.files("data/monthly/", full.names = TRUE))
+```
+
+`vroom` wins for multi-file ingestion and column-subset workflows on very large files.
+
+### Multiple files — readr pattern
+
+```r
+# Read and bind many CSVs (readr >= 2.0)
+df <- read_csv(list.files("data/", pattern = "\\.csv$", full.names = TRUE),
+               id = "source_file")   # adds filename column
+```
+
+---
+
+## Excel
+
+### readxl (read)
+
+```r
+library(readxl)
+
+# Auto-detect xls vs xlsx
+df <- read_excel("data/report.xlsx")
+
+# Specific sheet by name or index
+df <- read_excel("data/report.xlsx", sheet = "Q2")
+df <- read_excel("data/report.xlsx", sheet = 2)
+
+# List all sheets
+excel_sheets("data/report.xlsx")
+
+# Named cell range (Excel range notation)
+df <- read_excel("data/report.xlsx",
+                 sheet  = "Sales",
+                 range  = "B3:G50",
+                 col_names = c("region", "q1", "q2", "q3", "q4", "total"))
+
+# Column types: "skip", "guess", "logical", "numeric", "date", "text", "list"
+df <- read_excel("data/report.xlsx",
+                 col_types = c("numeric", "text", "date", "numeric"))
+
+# NA strings
+df <- read_excel("data/report.xlsx", na = c("", "N/A", "-"))
+```
+
+### writexl (write — no Java dependency)
+
+```r
+library(writexl)
+
+# Single sheet
+write_xlsx(df, "out/results.xlsx")
+
+# Multiple sheets from named list
+write_xlsx(list(summary = summary_df, detail = detail_df), "out/report.xlsx")
+```
+
+For heavy Excel work (formatting, formulas, styled output) use `openxlsx2` instead.
+
+---
+
+## Databases
+
+### DBI + dbplyr (write dplyr, get SQL)
+
+```r
+library(DBI)
+library(dbplyr)
+library(dplyr)
+
+# --- Connect ---
+# PostgreSQL
+con <- dbConnect(RPostgres::Postgres(),
+                 host     = "db.example.com",
+                 port     = 5432,
+                 dbname   = "analytics",
+                 user     = Sys.getenv("DB_USER"),
+                 password = Sys.getenv("DB_PASS"))
+
+# SQLite
+con <- dbConnect(RSQLite::SQLite(), "local.sqlite")
+
+# DuckDB (in-process, no server needed)
+con <- dbConnect(duckdb::duckdb(), dbdir = "project.duckdb")
+# Ephemeral (disappears on session end)
+con <- dbConnect(duckdb::duckdb())
+
+# --- Inspect ---
+dbListTables(con)
+dbListFields(con, "orders")
+
+# --- Reference a table (lazy — no data fetched yet) ---
+orders_db <- tbl(con, "orders")
+
+# Write dplyr; dbplyr translates to SQL
+result <- orders_db |>
+  filter(year == 2024, status == "shipped") |>
+  group_by(region) |>
+  summarise(total = sum(amount, na.rm = TRUE)) |>
+  arrange(desc(total))
+
+# See the SQL dbplyr will run
+show_query(result)
+
+# Fetch data into R
+df <- collect(result)
+
+# --- Write to database ---
+dbWriteTable(con, "clean_orders", df, overwrite = TRUE)
+
+# --- Raw SQL when needed ---
+df <- dbGetQuery(con, "SELECT * FROM orders WHERE amount > 10000")
+
+# --- Always disconnect ---
+dbDisconnect(con)
+```
+
+Backend packages by DBMS:
+
+| DBMS | Package |
+|---|---|
+| PostgreSQL | `RPostgres` |
+| MySQL / MariaDB | `RMariaDB` |
+| SQLite | `RSQLite` |
+| SQL Server | `odbc` + ODBC driver |
+| BigQuery | `bigrquery` |
+| DuckDB | `duckdb` |
+| Snowflake | `odbc` + ODBC driver |
+
+Never load passwords in source files — use `Sys.getenv()` or `keyring::key_get()`.
+
+---
+
+## Big & Columnar Data — Arrow + DuckDB
+
+### Arrow (Parquet, larger-than-memory datasets)
+
+```r
+library(arrow)
+library(dplyr)
+
+# --- Read single Parquet file ---
+df <- read_parquet("data/orders.parquet")
+
+# Read selected columns only
+df <- read_parquet("data/orders.parquet", col_select = c("id", "amount", "date"))
+
+# --- Write Parquet ---
+write_parquet(df, "out/orders.parquet")
+
+# --- Open a multi-file dataset (Hive-partitioned or flat directory) ---
+ds <- open_dataset("data/checkouts/")        # auto-detects partitioning
+ds <- open_dataset("data/checkouts/", format = "parquet")
+ds <- open_dataset("data/checkouts.csv", format = "csv")
+
+# Lazy dplyr pipeline — nothing loaded yet
+result <- ds |>
+  filter(year == 2023) |>
+  group_by(category) |>
+  summarise(n = n(), total = sum(amount)) |>
+  collect()   # <-- triggers computation, loads into memory
+
+# --- Write partitioned dataset (creates directory structure) ---
+df |>
+  group_by(year) |>
+  write_dataset("data/out/", format = "parquet")
+```
+
+Parquet vs CSV: ~2-3x smaller on disk, typed, column-oriented — always prefer it for persistent analytical data.
+
+### DuckDB for SQL-style big data
+
+```r
+library(duckdb)
+library(DBI)
+
+con <- dbConnect(duckdb::duckdb())
+
+# Query Parquet directly — no loading into R
+df <- dbGetQuery(con, "SELECT year, SUM(amount) FROM 'data/*.parquet' GROUP BY year")
+
+# Load CSV directly into DuckDB (faster than R round-trip)
+duckdb_read_csv(con, "raw", "data/large.csv")
+
+# Arrow <-> DuckDB bridge (zero-copy)
+library(arrow)
+ds <- open_dataset("data/orders/")
+result <- ds |>
+  to_duckdb() |>          # hand off to DuckDB engine
+  filter(amount > 1000) |>
+  collect()
+
+dbDisconnect(con, shutdown = TRUE)
+```
+
+Rule of thumb: Arrow for Parquet/file-based workflows; DuckDB for complex SQL, joins across multiple sources, or when you need window functions on large data.
+
+---
+
+## Web: Scraping & APIs
+
+### rvest (HTML scraping)
+
+```r
+library(rvest)
+
+page <- read_html("https://example.com/table-page")
+
+# CSS selectors
+headings <- page |> html_elements("h2") |> html_text2()
+links     <- page |> html_elements("a") |> html_attr("href")
+prices    <- page |> html_elements(".price") |> html_text2()
+title     <- page |> html_element("#main-title") |> html_text2()
+
+# HTML tables — returns a list of data frames
+tables <- page |> html_table()
+df     <- tables[[1]]   # first table
+
+# Navigate structure
+rows <- page |>
+  html_elements("table.results tr") |>
+  html_elements("td") |>
+  html_text2()
+
+# Polite scraping: respect robots.txt, cache, rate-limit
+# install.packages("polite")
+library(polite)
+session <- bow("https://example.com", force = TRUE)
+page    <- scrape(session)
+```
+
+SelectorGadget browser extension is the fastest way to find CSS selectors for a target page.
+
+### httr2 (HTTP APIs)
+
+```r
+library(httr2)
+
+resp <- request("https://api.example.com/v2/orders") |>
+  req_auth_bearer_token(Sys.getenv("API_TOKEN")) |>
+  req_url_query(limit = 100, status = "shipped") |>
+  req_retry(max_tries = 3) |>
+  req_perform()
+
+# Parse JSON response body
+data <- resp |> resp_body_json()
+
+# Pagination helper
+resps <- request("https://api.example.com/items") |>
+  req_perform_iterative(
+    iterate_with_offset("page", start = 1),
+    max_reqs = 20
+  )
+```
+
+### jsonlite (JSON <-> R)
+
+```r
+library(jsonlite)
+
+# Parse JSON string or file
+obj  <- fromJSON('{"name":"Alice","scores":[1,2,3]}')
+obj  <- fromJSON("data/payload.json")
+obj  <- fromJSON("https://api.example.com/data")  # direct URL
+
+# simplifyVector = TRUE (default) auto-converts arrays to vectors/data frames
+df   <- fromJSON("data/records.json")   # works when top-level is array of objects
+
+# Serialize R object to JSON
+json <- toJSON(df, pretty = TRUE, auto_unbox = TRUE)
+write(json, "out/result.json")
+```
+
+---
+
+## Rectangling: Nested JSON/Lists into Tibbles
+
+The tidyr trio for flattening hierarchical list-columns:
+
+```r
+library(tidyr)
+library(dplyr)
+library(jsonlite)
+
+# Source: JSON with nested structure
+raw <- fromJSON("data/api_response.json", simplifyVector = FALSE)
+df  <- tibble(record = raw)
+
+# unnest_wider: named list → one column per name (parallel expansion)
+df |> unnest_wider(record)
+
+# unnest_longer: unnamed list / array → one row per element (sequential expansion)
+df |> unnest_longer(record)
+
+# Combine for multi-level nesting
+df |>
+  unnest_wider(record) |>
+  unnest_longer(items) |>
+  unnest_wider(items, names_sep = "_")   # names_sep avoids collision
+
+# hoist: pull specific fields from deep nesting without full unnest
+df |>
+  hoist(record,
+        order_id  = "id",
+        city      = list("address", "city"),
+        zip       = list("address", "zip"))
+```
+
+`names_sep = "_"` in `unnest_wider` prefixes child column names with the parent name — avoids collisions when siblings share field names.
+
+---
+
+## R-Native Serialization
+
+### RDS — single object, compact
+
+```r
+# Save any R object (model, list, data frame, environment...)
+saveRDS(df, "cache/model.rds")
+df <- readRDS("cache/model.rds")
+
+# Compress: "gzip" (default), "bzip2", "xz" — xz smallest, slowest
+saveRDS(df, "cache/model.rds", compress = "xz")
+```
+
+RDS preserves all R types exactly (factors, dates, custom classes). Not portable outside R.
+
+### qs — fast RDS alternative
+
+```r
+library(qs)
+
+# 3-10x faster than saveRDS, similar compression
+qs::qsave(df, "cache/data.qs")
+df <- qs::qread("cache/data.qs")
+
+# Parallel compression (preset: "fast", "balanced", "high")
+qs::qsave(df, "cache/data.qs", preset = "balanced", nthreads = 4)
+```
+
+Use `qs` over RDS whenever object is > 50 MB and read/write speed matters.
+
+### Base R (avoid for new code)
+
+```r
+save(df1, df2, file = "workspace.RData")   # saves multiple objects by name
+load("workspace.RData")                     # restores into current env — fragile
+
+# Prefer saveRDS/readRDS: explicit, one object, no name injection
+```
+
+---
+
+## Which Reader? Decision Table
+
+| Data source | Size | Recommended | Alternative |
+|---|---|---|---|
+| CSV, known schema | Any | `readr::read_csv` + `col_types` | — |
+| CSV, unknown schema / huge | > 500 MB | `data.table::fread` | `vroom` |
+| Multiple CSVs, column subset | Any | `vroom` | `readr::read_csv(files)` |
+| Excel .xlsx/.xls | Any | `readxl::read_excel` | — |
+| Excel write | Any | `writexl::write_xlsx` | `openxlsx2` (formatting) |
+| SQL database | Any | `DBI` + `dbplyr` | `DBI::dbGetQuery` (raw SQL) |
+| Parquet / Arrow dataset | Any | `arrow::open_dataset` + `collect()` | — |
+| Very large Parquet + SQL | > RAM | `duckdb` + Arrow bridge | — |
+| HTML scraping | — | `rvest` + `polite` | — |
+| REST API | — | `httr2` | — |
+| JSON → tibble | Any | `jsonlite::fromJSON` + `tidyr::unnest_*` | — |
+| R objects (persist) | < 50 MB | `saveRDS` / `readRDS` | — |
+| R objects (persist, fast) | > 50 MB | `qs::qsave` / `qs::qread` | — |
+
+---
+
+## Gotchas
+
+### stringsAsFactors is gone (R >= 4.0)
+
+`base::read.csv()` used to silently convert character columns to factors (`stringsAsFactors = TRUE` was the default before R 4.0.0). **Since R 4.0, the default is `FALSE`**. Old Stack Overflow answers warning you to set `stringsAsFactors = FALSE` are stale. `readr::read_csv` never converted to factors — it always returned character columns as character.
+
+```r
+# Modern base R — no factor surprise
+df <- read.csv("data/file.csv")           # stringsAsFactors = FALSE since R 4.0
+
+# Explicit factor conversion when you actually want factors
+df$category <- factor(df$category)
+df$category <- factor(df$category, levels = c("low", "med", "high"))
+```
+
+### Encoding
+
+```r
+# readr: specify encoding explicitly for non-UTF-8 files
+df <- read_csv("data/legacy.csv", locale = locale(encoding = "latin1"))
+# or
+df <- read_csv("data/file.csv", locale = locale(encoding = "Windows-1252"))
+
+# Detect encoding first
+readr::guess_encoding("data/legacy.csv")
+```
+
+Always write new files as UTF-8. If a downstream system requires a specific encoding, convert at the write step, not the read step.
+
+### Windows path separators
+
+```r
+# Forward slashes work on Windows in R — use them
+df <- read_csv("C:/Users/me/data/file.csv")   # fine
+df <- read_csv("C:\\Users\\me\\data\\file.csv") # also fine but ugly
+
+# file.path() is OS-agnostic and preferred
+path <- file.path("data", "subdir", "file.csv")
+
+# here::here() for project-relative paths (never setwd())
+library(here)
+df <- read_csv(here("data", "file.csv"))
+```
+
+### Column type guessing pitfalls
+
+`readr` guesses from the first 1000 rows by default. If a column has all integers in those rows but floats later, types break silently.
+
+```r
+# Increase guess range or specify types explicitly
+df <- read_csv("data/file.csv", guess_max = 10000)
+
+# Better: specify col_types for any column you care about
+df <- read_csv("data/file.csv",
+               col_types = cols(amount = col_double(),
+                                .default = col_guess()))
+```
+
+### NA handling
+
+```r
+# readr defaults: only "" is NA. Extend as needed.
+read_csv("f.csv", na = c("", "NA", "N/A", "NULL", "none", "-", "."))
+
+# readxl defaults: "" and =NA() formula. Same extension pattern.
+read_excel("f.xlsx", na = c("", "N/A", "-"))
+```
+
+### DBI credentials
+
+Never hardcode passwords. Use environment variables, `.Renviron`, or `keyring`:
+
+```r
+# .Renviron (per-project or user-level)
+# DB_PASS=secret
+
+con <- dbConnect(RPostgres::Postgres(),
+                 password = Sys.getenv("DB_PASS"))
+
+# Or keyring
+con <- dbConnect(RPostgres::Postgres(),
+                 password = keyring::key_get("mydb", "username"))
+```
+
+### collect() — don't forget it
+
+`tbl()` and `open_dataset()` return lazy objects. Without `collect()` you have a query plan, not data.
+
+```r
+# This does nothing — just a lazy reference
+orders_db <- tbl(con, "orders") |> filter(year == 2024)
+
+# This fetches data
+df <- orders_db |> collect()
+```
+
+### Parquet partition columns
+
+When writing partitioned Parquet with `write_dataset()`, the partition column is stored in the directory name, not the file. Arrow re-attaches it on `open_dataset()`. If you convert to a plain data frame first and then write, the column is preserved in the file — choose based on downstream needs.
+
+---
+
+## Quick Install Reference
+
+```r
+# Core I/O stack
+install.packages(c(
+  "readr",        # CSV/delimited (tidyverse core)
+  "data.table",   # fread/fwrite — fastest CSV
+  "vroom",        # multi-file / lazy CSV
+  "readxl",       # Excel read (no Java)
+  "writexl",      # Excel write (no Java)
+  "DBI",          # database interface
+  "dbplyr",       # dplyr → SQL translation
+  "RPostgres",    # PostgreSQL backend
+  "RSQLite",      # SQLite backend
+  "duckdb",       # DuckDB backend
+  "arrow",        # Parquet + datasets
+  "rvest",        # HTML scraping
+  "httr2",        # HTTP / REST APIs
+  "jsonlite",     # JSON parsing
+  "tidyr",        # unnest_wider/longer for rectangling
+  "qs",           # fast RDS alternative
+  "polite",       # ethical scraping (rate-limit + cache)
+  "here",         # project-relative paths
+  "janitor"       # clean_names() for messy headers
+))
+```

+ 436 - 0
skills/r-ops/references/iteration-functional.md

@@ -0,0 +1,436 @@
+# Iteration & Functional Programming in R
+
+Modern R iteration is mostly implicit — vectorised ops, `across()`, and purrr's
+`map()` family replace explicit loops in nearly every data-science context. This
+reference covers the full stack: writing reusable functions, column-wise
+iteration with `across()`, list iteration with purrr, and list-columns for
+model-per-group workflows.
+
+---
+
+## Writing Functions
+
+### Rule of three
+
+Extract a function when you've written the same logic three times. Two copies
+are tolerable; three means a function.
+
+```r
+# Pattern spotted 3× → extract
+rescale01 <- function(x) {
+  rng <- range(x, na.rm = TRUE, finite = TRUE)
+  (x - rng[1]) / (rng[2] - rng[1])
+}
+```
+
+### Argument conventions
+
+| Convention | Rationale |
+|---|---|
+| Data first (`df`, `x`) | Enables pipe chaining |
+| Logical flags default `FALSE` | Opt-in behaviour is safer |
+| `na.rm = FALSE` to match base | Users expect base semantics |
+| `...` to pass through to inner calls | Avoids re-specifying every arg |
+
+```r
+# Passing ... to inner function
+cv <- function(x, na.rm = FALSE) {
+  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
+}
+
+# Using ... for flexible pass-through
+my_read <- function(path, ...) {
+  readr::read_csv(path, show_col_types = FALSE, ...)
+}
+```
+
+### Early return
+
+Prefer explicit early return over nested `if`/`else` for guard clauses.
+
+```r
+process <- function(x) {
+  if (length(x) == 0) return(NULL)
+  if (all(is.na(x))) return(NA_real_)
+  mean(x, na.rm = TRUE)
+}
+```
+
+### Data-masking and `{{ }}` embracing
+
+Functions that call dplyr verbs using column-name arguments need **embracing**.
+Without `{{ }}`, dplyr interprets the argument name literally instead of
+looking up what it contains.
+
+```r
+# WRONG — group_by sees "group_var" not what group_var holds
+grouped_mean <- function(df, group_var, mean_var) {
+  df |> group_by(group_var) |> summarize(mean(mean_var))
+}
+
+# CORRECT — {{ }} tells dplyr to look inside the argument
+grouped_mean <- function(df, group_var, mean_var) {
+  df |>
+    group_by({{ group_var }}) |>
+    summarize(mean = mean({{ mean_var }}, na.rm = TRUE), .groups = "drop")
+}
+
+diamonds |> grouped_mean(cut, carat)
+```
+
+**When to embrace:** check the docs for the two tidy-evaluation subtypes:
+
+- **Data-masking** (`arrange`, `filter`, `mutate`, `summarize`) → embrace with `{{ }}`
+- **Tidy-selection** (`select`, `relocate`, `rename`, `across`) → embrace with `{{ }}`; for
+  multi-column tidy-select args passed to data-masking verbs, use `pick({{ var }})`
+
+```r
+# pick() bridges tidy-selection into data-masking context
+count_missing <- function(df, group_vars, x_var) {
+  df |>
+    group_by(pick({{ group_vars }})) |>
+    summarize(n_miss = sum(is.na({{ x_var }})), .groups = "drop")
+}
+
+flights |> count_missing(c(year, month, day), dep_time)
+```
+
+```r
+# across() inside a function — embrace the column-selector argument
+summarize_means <- function(df, summary_vars = where(is.numeric)) {
+  df |>
+    summarize(
+      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
+      n = n(),
+      .groups = "drop"
+    )
+}
+
+diamonds |> group_by(cut) |> summarize_means()
+diamonds |> group_by(cut) |> summarize_means(c(carat, x:z))
+```
+
+---
+
+## Column-wise Iteration with `across()`
+
+### Core usage
+
+```r
+# Single function — pass without ()
+df |> summarize(across(a:d, median))
+
+# Anonymous function — use \(x) shorthand (R 4.1+)
+df |> summarize(across(a:d, \(x) median(x, na.rm = TRUE)))
+
+# Multiple functions — named list, output named {.col}_{.fn}
+df |> summarize(
+  across(a:d, list(
+    med  = \(x) median(x, na.rm = TRUE),
+    miss = \(x) sum(is.na(x))
+  ))
+)
+
+# Custom name template
+df |> summarize(
+  across(a:d, list(med = \(x) median(x, na.rm = TRUE)),
+         .names = "{.fn}_{.col}")
+)
+```
+
+### Column selectors for `.cols`
+
+```r
+across(everything())              # all non-grouping columns
+across(where(is.numeric))        # type predicate
+across(starts_with("val_"))      # name pattern
+across(c(a, b, x:z))             # explicit set
+across(!where(is.character))     # negation
+```
+
+### `mutate()` with `across()`
+
+By default output columns **replace** inputs. Use `.names` to add new cols.
+
+```r
+# Replace in place (coerce NA → 0)
+df |> mutate(across(a:d, \(x) coalesce(x, 0)))
+
+# Preserve originals, add suffixed cols
+df |> mutate(across(a:d, \(x) coalesce(x, 0), .names = "{.col}_filled"))
+```
+
+### Filtering variants
+
+`across()` is awkward in `filter()`. Use dedicated helpers instead.
+
+```r
+df |> filter(if_any(a:d, is.na))   # at least one NA
+df |> filter(if_all(a:d, is.na))   # all NA
+```
+
+### `across()` vs `pivot_longer()` for grouped column ops
+
+When you need to operate on **pairs** of columns simultaneously (e.g., a value
+column plus its weight column), `across()` cannot express this. Pivot first.
+
+```r
+df_paired |>
+  pivot_longer(everything(),
+               names_to  = c("group", ".value"),
+               names_sep = "_") |>
+  group_by(group) |>
+  summarize(mean = weighted.mean(val, wts))
+```
+
+---
+
+## purrr Map Family
+
+### Anonymous function syntax (R 4.1+)
+
+```r
+# Preferred: base backslash lambda
+\(x) x + 1
+
+# Old tidyverse-only shorthand (still works but avoid in new code)
+~ .x + 1
+```
+
+### `map()` and type-stable variants
+
+`map()` always returns a list. Use typed variants for atomic output — they
+fail loudly if the return type doesn't match, which catches bugs early.
+
+```r
+map(x, f)          # → list
+map_lgl(x, f)      # → logical vector
+map_int(x, f)      # → integer vector
+map_dbl(x, f)      # → double vector
+map_chr(x, f)      # → character vector
+map_vec(x, f)      # → simplest atomic type (like vapply auto-detect)
+```
+
+```r
+# Practical examples
+files <- map(paths, readr::read_csv)          # list of data frames
+medians <- map_dbl(df, \(col) median(col, na.rm = TRUE))
+col_types <- map_chr(df, \(col) class(col)[1])
+n_missing <- map_int(df, \(col) sum(is.na(col)))
+```
+
+### Multi-input variants
+
+```r
+# map2: two parallel inputs
+map2(xs, ys, f)           # f(xs[[i]], ys[[i]])
+walk2(xs, ys, f)          # same but discard output (side effects)
+
+# pmap: arbitrary number of inputs via list
+pmap(list(a = xs, b = ys, c = zs), f)
+
+# imap: index + value
+imap(x, \(val, idx) paste(idx, val))   # idx is name or position
+```
+
+```r
+# walk2 for saving multiple files
+walk2(by_clarity$data, by_clarity$path, write_csv)
+
+# walk2 for saving multiple plots
+walk2(
+  by_clarity$path,
+  by_clarity$plot,
+  \(path, plot) ggsave(path, plot, width = 6, height = 6)
+)
+```
+
+### Combining list of data frames
+
+```r
+# CURRENT — list_rbind / list_cbind
+map(paths, read_csv) |> list_rbind()
+map(paths, read_csv) |> list_cbind()
+
+# SUPERSEDED — avoid in new code
+map_dfr(paths, read_csv)   # was bind_rows(map(...))
+map_dfc(paths, read_csv)   # was bind_cols(map(...))
+```
+
+### Carrying filename metadata into the combined frame
+
+```r
+paths |>
+  set_names(basename) |>          # names carry through map()
+  map(readxl::read_excel) |>
+  list_rbind(names_to = "file") |>
+  mutate(year = readr::parse_number(file))
+```
+
+### Error handling with `possibly()`
+
+`map()` fails entirely on the first error. `possibly()` wraps a function to
+return a sentinel value instead of throwing.
+
+```r
+safe_read <- possibly(\(path) readxl::read_excel(path), otherwise = NULL)
+
+files  <- map(paths, safe_read)
+data   <- list_rbind(files)              # list_rbind silently drops NULLs
+
+failed <- map_vec(files, is.null)
+paths[failed]                            # inspect which paths failed
+```
+
+### `reduce()` and `accumulate()`
+
+```r
+# reduce: fold list into single value
+reduce(list(df1, df2, df3), dplyr::left_join, by = "id")
+reduce(1:5, `+`)                          # → 15
+
+# accumulate: keep intermediate values
+accumulate(1:5, `+`)                      # → c(1, 3, 6, 10, 15)
+```
+
+---
+
+## List-Columns and Model-per-Group
+
+Nest → model → unnest is the canonical workflow for fitting many models.
+
+```r
+library(tidyverse)
+
+nested <- mtcars |>
+  group_by(cyl) |>
+  nest()
+
+# Fit a model per group
+nested <- nested |>
+  mutate(
+    model  = map(data, \(df) lm(mpg ~ wt, data = df)),
+    tidy   = map(model, broom::tidy),
+    glance = map(model, broom::glance)
+  )
+
+# Extract tidy coefficients
+nested |>
+  select(cyl, tidy) |>
+  unnest(tidy)
+
+# Extract model-level stats
+nested |>
+  select(cyl, glance) |>
+  unnest(glance)
+```
+
+```r
+# Inspect list-column structure safely
+df_types <- function(df) {
+  tibble(
+    col_name = names(df),
+    col_type = map_chr(df, \(x) class(x)[1]),
+    n_miss   = map_int(df, \(x) sum(is.na(x)))
+  )
+}
+```
+
+---
+
+## Base R ↔ purrr Translation
+
+| Base R | purrr equivalent | Notes |
+|---|---|---|
+| `lapply(x, f)` | `map(x, f)` | Identical semantics; purrr adds `\(x)` shorthand |
+| `sapply(x, f)` | `map_vec(x, f)` | `sapply` silently simplifies — type unstable. Avoid. |
+| `vapply(x, f, numeric(1))` | `map_dbl(x, f)` | Both are type-stable; purrr is terser |
+| `mapply(f, x, y)` | `map2(x, y, f)` | `mapply` also exists but arg order is awkward |
+| `Map(f, x, y)` | `map2(x, y, f)` | `Map` returns a list like `map2` |
+| `apply(m, 1, f)` | `apply(m, 1, f)` | Row-wise on matrix — no purrr equivalent; keep base |
+| `apply(m, 2, f)` | `map(as.list(df), f)` or `across()` | Column-wise on data frame → use `across()` |
+| `Reduce(f, x)` | `reduce(x, f)` | purrr adds `.init`, `.right`, `.accumulate` |
+| `Filter(pred, x)` | `keep(x, pred)` | `discard(x, pred)` for the inverse |
+| `Find(pred, x)` | `detect(x, pred)` | Returns first match |
+| `Position(pred, x)` | `detect_index(x, pred)` | Returns position of first match |
+
+### When to prefer base
+
+- **Package code with no tidyverse dependency** — `lapply`/`vapply` add zero imports
+- **Matrix row/column ops** — `apply(m, 1, f)` has no clean purrr equivalent
+- **Simple single-function map, no lambda needed** — `lapply(x, sum)` is fine
+
+---
+
+## Iterating Over Files
+
+```r
+# Pattern: list → map → combine
+paths <- list.files("data/", pattern = "[.]csv$", full.names = TRUE)
+
+data <- paths |>
+  map(readr::read_csv, show_col_types = FALSE) |>
+  list_rbind()
+
+# With per-step transformations (prefer multiple simple maps over one complex fn)
+data <- paths |>
+  map(readr::read_csv, show_col_types = FALSE) |>
+  map(\(df) filter(df, !is.na(id))) |>
+  map(\(df) mutate(df, id = tolower(id))) |>
+  list_rbind()
+
+# Even better — bind first, then dplyr on the full frame
+data <- paths |>
+  map(readr::read_csv, show_col_types = FALSE) |>
+  list_rbind() |>
+  filter(!is.na(id)) |>
+  mutate(id = tolower(id))
+```
+
+---
+
+## Gotchas
+
+**`sapply` is type-unstable.** It returns different types depending on the
+result — a vector, a matrix, or a list. In scripts this is fine; in functions
+it makes behaviour unpredictable. Use `map_dbl`/`map_chr`/`map_vec` instead.
+
+**`walk` for side effects.** Any call whose purpose is writing to disk,
+printing, or appending to a DB belongs in `walk`/`walk2`, not `map`. Using
+`map` for side effects silently accumulates a large list of return values.
+
+```r
+walk(paths, \(p) append_file(p))    # not map() — we don't need the return value
+```
+
+**`map2` vs `pmap` arg matching.** `map2(x, y, f)` passes positional args
+`f(x[[i]], y[[i]])`. With `pmap`, names in the list must match the function's
+argument names — use a named list to be explicit.
+
+```r
+args <- list(mean = c(0, 1, 2), sd = c(1, 2, 3), n = c(10, 10, 10))
+pmap(args, rnorm)    # names match rnorm's formal arguments
+```
+
+**`{{ }}` scope.** Embracing only works in functions passed to tidy-eval
+verbs. It has no effect in base R or non-tidy functions, and it does nothing
+outside a function body.
+
+**`map_dfr`/`map_dfc` are superseded.** They still work but are no longer
+recommended. Use `map() |> list_rbind()` / `list_cbind()` — more composable
+and the name-carrying behaviour of `list_rbind(names_to=)` replaces the old
+`.id` argument.
+
+**`across()` replaces columns by default.** Inside `mutate()`, the output
+names match the input names unless you set `.names`. Always set `.names` when
+you want to add columns alongside originals.
+
+**Grouped summarize message.** `summarize()` after `group_by()` emits a
+message about the grouping structure unless you set `.groups = "drop"` or
+`.groups = "keep"`. Suppress it explicitly in production code.
+
+```r
+df |>
+  group_by(cyl) |>
+  summarize(mean_mpg = mean(mpg), .groups = "drop")
+```

+ 399 - 0
skills/r-ops/references/modeling-stats.md

@@ -0,0 +1,399 @@
+# Statistics & Modeling in R
+
+From base inferential tests through tidymodels. Use base R lm/glm for
+straightforward regression; reach for tidymodels when you need CV,
+hyperparameter tuning, or multiple competing model types.
+
+---
+
+## 1. Inferential Tests (base R)
+
+### Reading a p-value
+
+`p` = P(observing data this extreme | null is true). It is **not**
+P(null is true | data). α = 0.05 is convention, not law. Report effect
+sizes and confidence intervals alongside p-values.
+
+### Normality check
+
+```r
+shapiro.test(x)          # H0: data are normal; n < 5000 only
+qqnorm(x); qqline(x)     # Q-Q plot: fat tails / skew visible at a glance
+```
+
+Shapiro-Wilk loses power at small n and is over-powered at large n —
+always pair it with a Q-Q plot.
+
+### One- and two-sample t-tests
+
+```r
+# One-sample
+t.test(x, mu = 0)
+
+# Two-sample unpaired (Welch by default — no equal-variance assumption)
+t.test(y ~ group, data = df)
+t.test(a, b)                    # same thing, vectors
+
+# Paired (measurements are linked row-by-row)
+t.test(before, after, paired = TRUE)
+
+# One-sided
+t.test(x, mu = 0, alternative = "greater")
+```
+
+`paired = TRUE` matters: paired reduces noise by removing between-subject
+variance. Using unpaired on paired data inflates the SE and loses power.
+
+### Non-parametric alternatives
+
+```r
+wilcox.test(y ~ group, data = df)          # Mann-Whitney U (two-sample)
+wilcox.test(before, after, paired = TRUE)  # Wilcoxon signed-rank
+
+# The estimate returned is the Hodges-Lehmann pseudomedian,
+# NOT the sample median. Don't report it as the median.
+wilcox.test(x, conf.int = TRUE)$estimate   # pseudomedian
+```
+
+### Proportions & distributions
+
+```r
+prop.test(c(successes_a, successes_b), c(n_a, n_b))  # two-proportion z-test
+prop.test(x = 42, n = 100, p = 0.5)                  # one-sample vs H0
+
+ks.test(x, "pnorm", mean(x), sd(x))  # K-S goodness-of-fit
+ks.test(x, y)                         # two-sample distributional equality
+```
+
+### Correlation
+
+```r
+cor(x, y)                          # point estimate only — no CI, no p
+cor.test(x, y)                     # CI + p-value; method = "pearson"|"spearman"|"kendall"
+cor.test(x, y, method = "spearman")
+```
+
+Always use `cor.test`, not bare `cor`, when you want inference.
+
+### Chi-square & Fisher
+
+```r
+tbl <- table(df$var1, df$var2)
+chisq.test(tbl)                    # assumes expected counts ≥ 5
+fisher.test(tbl)                   # exact; use when counts are small
+```
+
+### ANOVA
+
+```r
+m <- aov(score ~ group, data = df)
+summary(m)               # F stat and p-value
+TukeyHSD(m)              # post-hoc pairwise with family-wise correction
+
+# Non-parametric equivalent
+kruskal.test(score ~ group, data = df)
+```
+
+### Multiple comparisons
+
+```r
+pairwise.t.test(df$score, df$group, p.adjust.method = "holm")
+# "holm" is uniformly more powerful than Bonferroni; use it by default.
+# "BH" (Benjamini-Hochberg) for FDR control in high-throughput settings.
+p.adjust(p_vec, method = "holm")   # adjust a vector of raw p-values
+```
+
+---
+
+## 2. Linear Models
+
+### Formula operators
+
+| Operator | Meaning |
+|----------|---------|
+| `y ~ x` | regress y on x |
+| `y ~ x1 + x2` | additive main effects |
+| `y ~ x1 : x2` | interaction only |
+| `y ~ x1 * x2` | main effects + interaction (shorthand for `x1 + x2 + x1:x2`) |
+| `y ~ (x1 + x2)^2` | all two-way interactions among x1, x2 |
+| `y ~ I(x^2)` | arithmetic inside formula (raw squaring) |
+| `y ~ poly(x, 2)` | orthogonal polynomial (prefer over `I(x^2)`) |
+| `y ~ .` | all remaining columns as predictors |
+| `y ~ . - z` | all minus z |
+
+### Fitting and reading summary
+
+```r
+m <- lm(mpg ~ wt + hp, data = mtcars)
+summary(m)
+```
+
+Read `summary()` in order:
+
+1. **F-statistic & p-value** (bottom) — does the model beat a flat mean?
+2. **Adjusted R²** — variance explained, penalised for complexity
+3. **Coefficients table** — estimate, SE, t-value, p-value per term
+4. **Residual standard error** — typical prediction error in y-units
+
+```r
+confint(m)               # 95% CIs on coefficients
+coef(m)                  # named vector of estimates
+fitted(m)                # ŷ for training data
+residuals(m)             # raw residuals
+```
+
+### Diagnostics
+
+```r
+par(mfrow = c(2, 2))
+plot(m)
+# Panel 1: Residuals vs Fitted — non-linearity shows as curve
+# Panel 2: Q-Q of residuals — normality of errors
+# Panel 3: Scale-Location — heteroscedasticity (fanning)
+# Panel 4: Cook's distance — influential observations (> 0.5 or > 1 flag)
+
+# Individual Cook's distances
+cooks.distance(m) |> sort(decreasing = TRUE) |> head()
+```
+
+### GLMs
+
+```r
+# Logistic regression (binary outcome)
+m_log <- glm(survived ~ age + fare, data = df, family = binomial)
+summary(m_log)
+
+# CRITICAL: default predict() returns log-odds (link scale)
+predict(m_log, newdata = new_df)                   # log-odds — rarely what you want
+predict(m_log, newdata = new_df, type = "response") # probabilities — usually what you want
+
+# Poisson regression (count outcome)
+m_poi <- glm(count ~ x, data = df, family = poisson)
+
+# Quasi-poisson for overdispersion
+m_qpoi <- glm(count ~ x, data = df, family = quasipoisson)
+```
+
+Exponentiate logistic coefficients for odds ratios:
+
+```r
+exp(coef(m_log))
+exp(confint(m_log))
+```
+
+---
+
+## 3. broom — Model Objects → Tibbles
+
+broom is the bridge between base model objects and the tidyverse.
+Three functions cover almost everything:
+
+| Function | Returns | Use for |
+|----------|---------|---------|
+| `tidy()` | one row per coefficient | extracting estimates, CIs, p-values |
+| `glance()` | one row per model | comparing models; R², AIC, BIC |
+| `augment()` | one row per observation | residuals, fitted values, Cook's D |
+
+```r
+library(broom)
+
+m <- lm(mpg ~ wt + hp, data = mtcars)
+
+tidy(m)                         # coefficients tibble
+tidy(m, conf.int = TRUE)        # + 95% CI columns
+tidy(m, conf.int = TRUE, conf.level = 0.90)
+
+glance(m)                       # r.squared, adj.r.squared, AIC, BIC, sigma, ...
+
+augment(m)                      # .fitted, .resid, .hat, .cooksd, .std.resid
+augment(m, newdata = test_df)   # predictions on new data
+```
+
+Works identically for `glm`, `aov`, `t.test`, `wilcox.test`, `cor.test`,
+and many modelling packages. Check `?tidy.<class>` for method-specific args.
+
+```r
+# Pattern: compare many models at once
+library(purrr)
+models <- list(
+  simple  = lm(mpg ~ wt,       data = mtcars),
+  full    = lm(mpg ~ wt + hp,  data = mtcars),
+  poly    = lm(mpg ~ poly(wt, 2) + hp, data = mtcars)
+)
+map_dfr(models, glance, .id = "model") |>
+  select(model, adj.r.squared, AIC, BIC) |>
+  arrange(AIC)
+```
+
+---
+
+## 4. tidymodels — Modern ML Framework
+
+tidymodels replaces `caret`. Use it when you need:
+- Train/test splits with resampling (CV)
+- Preprocessing pipelines that respect data leakage rules
+- Multiple model types with a unified interface
+- Hyperparameter tuning
+
+Core packages: `rsample`, `recipes`, `parsnip`, `workflows`, `tune`, `yardstick`.
+
+### End-to-end skeleton
+
+```r
+library(tidymodels)   # loads all core packages
+
+# 1. Split ----------------------------------------------------------------
+set.seed(42)
+split  <- initial_split(df, prop = 0.8, strata = outcome)
+train  <- training(split)
+test   <- testing(split)
+
+# Cross-validation folds (on training data only)
+folds  <- vfold_cv(train, v = 10, strata = outcome)
+
+# 2. Recipe (preprocessing) -----------------------------------------------
+rec <- recipe(outcome ~ ., data = train) |>
+  step_impute_median(all_numeric_predictors()) |>
+  step_normalize(all_numeric_predictors()) |>
+  step_dummy(all_nominal_predictors()) |>
+  step_zv(all_predictors())    # remove zero-variance columns
+
+# 3. Model spec (parsnip) -------------------------------------------------
+spec_rf <- rand_forest(mtry = tune(), trees = 500, min_n = tune()) |>
+  set_engine("ranger") |>
+  set_mode("classification")
+
+# 4. Workflow (bundle recipe + model) -------------------------------------
+wf <- workflow() |>
+  add_recipe(rec) |>
+  add_model(spec_rf)
+
+# 5. Tune -----------------------------------------------------------------
+grid <- grid_latin_hypercube(mtry(range = c(2, 10)), min_n(), size = 20)
+
+tune_res <- tune_grid(
+  wf,
+  resamples = folds,
+  grid      = grid,
+  metrics   = metric_set(roc_auc, accuracy),
+  control   = control_grid(save_pred = TRUE)
+)
+
+# 6. Select best & finalise -----------------------------------------------
+best_params <- select_best(tune_res, metric = "roc_auc")
+final_wf    <- finalize_workflow(wf, best_params)
+
+# 7. Last fit (train on full train, evaluate on test) --------------------
+last_fit_res <- last_fit(final_wf, split)
+
+collect_metrics(last_fit_res)   # roc_auc, accuracy on held-out test
+collect_predictions(last_fit_res) |>
+  roc_curve(truth = outcome, .pred_yes) |>
+  autoplot()
+
+# 8. Final model for production ------------------------------------------
+final_model <- fit(final_wf, data = df)  # refit on all data
+predict(final_model, new_data = new_df)
+```
+
+### Common parsnip engines
+
+```r
+# Linear regression
+linear_reg() |> set_engine("lm")
+linear_reg(penalty = tune()) |> set_engine("glmnet")   # ridge/lasso
+
+# Logistic regression
+logistic_reg() |> set_engine("glm")
+logistic_reg(penalty = tune()) |> set_engine("glmnet")
+
+# Random forest
+rand_forest() |> set_engine("ranger") |> set_mode("classification")
+rand_forest() |> set_engine("ranger") |> set_mode("regression")
+
+# Gradient boosting
+boost_tree() |> set_engine("xgboost") |> set_mode("classification")
+
+# Support vector machine
+svm_rbf() |> set_engine("kernlab") |> set_mode("classification")
+```
+
+### yardstick metrics
+
+```r
+# Regression
+metrics(results, truth = y, estimate = .pred)   # rmse, rsq, mae
+
+# Classification (binary)
+roc_auc(results, truth = outcome, .pred_yes)
+accuracy(results, truth = outcome, estimate = .pred_class)
+conf_mat(results, truth = outcome, estimate = .pred_class)
+
+# Multi-metric
+metric_set(roc_auc, accuracy, f_meas)
+```
+
+---
+
+## 5. Which Test?
+
+| Situation | Test |
+|-----------|------|
+| Compare mean to value, normal data | `t.test(x, mu=)` |
+| Compare two means, unpaired, normal | `t.test(y ~ group)` |
+| Compare two means, **paired** | `t.test(x, y, paired=TRUE)` |
+| Two means, non-normal / ordinal | `wilcox.test` |
+| Paired, non-normal | `wilcox.test(paired=TRUE)` |
+| Three+ group means | `aov` + `TukeyHSD` |
+| Three+ groups, non-normal | `kruskal.test` + `pairwise.wilcox.test` |
+| Two proportions | `prop.test` |
+| Categorical association (expected ≥ 5) | `chisq.test` |
+| Categorical association (small counts) | `fisher.test` |
+| Normality screening | `shapiro.test` + Q-Q plot |
+| Distributional equality | `ks.test` |
+| Linear association (inference) | `cor.test` |
+| Continuous outcome, 1+ predictors | `lm` |
+| Binary outcome | `glm(family=binomial)` |
+| Count outcome | `glm(family=poisson)` |
+| CV + tuning + multiple models | tidymodels |
+
+---
+
+## 6. Gotchas
+
+**`wilcox.test` pseudomedian** — `$estimate` is the Hodges-Lehmann
+estimator, not the sample median. For symmetric distributions they're
+close; for skewed data they diverge. Don't label it "median" in a report.
+
+**`predict()` link vs response** — for any GLM, the default `type` is
+`"link"` (log-odds for logistic, log for Poisson). Always pass
+`type = "response"` unless you specifically want the link scale.
+
+**`cor()` gives no inference** — it's just a scalar. Use `cor.test()`
+whenever you need a p-value or CI.
+
+**Data leakage in recipes** — `prep()`/`bake()` must be fitted on
+training data only. tidymodels handles this automatically inside
+`fit_resamples()` / `tune_grid()`. If you call `prep(rec, training=full_data)`
+manually, you've leaked.
+
+**`shapiro.test` limitations** — breaks down above n ≈ 5000 (always
+rejects) and has low power at n < 20 (rarely rejects). Use it as a
+screen, not a verdict. A Q-Q plot at any sample size is more informative.
+
+**Adjusted R² vs R²** — `summary(m)$r.squared` increases with every
+added predictor. Use `adj.r.squared` or AIC/BIC (from `glance()`) for
+model comparison.
+
+**Multiple comparisons** — default `pairwise.t.test` uses `"holm"` only
+if you specify it. The base default is `"holm"` in current R, but be
+explicit. For genomics-scale testing use `"BH"` (FDR control), not Bonferroni.
+
+**`aov` vs `lm`** — `aov()` is `lm()` with a different summary format.
+`model.matrix(aov(...))` is identical. You can pass an `aov` object to
+`broom::tidy()` just like `lm`.
+
+**tidymodels vs base lm** — base `lm`/`glm` is faster to write, returns
+familiar objects, and is fine for: EDA, simple inference, fixed datasets,
+one model. Reach for tidymodels when you need reproducible preprocessing,
+cross-validated performance estimates, or model comparison at scale.

+ 426 - 0
skills/r-ops/references/strings-dates-factors.md

@@ -0,0 +1,426 @@
+# Strings, Dates, and Factors — Modern R Reference
+
+The three tidyverse packages that replace the clunkiest parts of base R:
+**stringr** for strings, **lubridate** for dates/times, **forcats** for
+categoricals. All three are core tidyverse as of tidyverse 2.0.
+
+```r
+library(tidyverse)  # loads all three
+```
+
+---
+
+## 1. Strings and Regular Expressions (stringr)
+
+Every stringr function starts with `str_`. That prefix-consistency is
+intentional — in RStudio, typing `str_` triggers autocomplete over the
+full function set.
+
+### Building strings
+
+```r
+# Concatenation — tidyverse-safe paste0()
+str_c("Hello ", c("Ana", "Bob"))           # vectorises, NA propagates
+str_c("a", "b", "c", sep = "-")            # "a-b-c"
+
+# Interpolation — cleaner for templates
+str_glue("Hello {name}, you are {age}!")   # {expr} is evaluated in scope
+str_glue_data(df, "Row {row_number()}: {col}")
+
+# Base equivalents (avoid in new code):
+# paste0() / paste() / sprintf()
+```
+
+Raw strings avoid backslash hell (R >= 4.0):
+
+```r
+r"(C:\Users\name\file.txt)"   # no escaping needed
+r"[He said "yes"]"            # bracket delimiter if content has )]
+```
+
+### Detecting and counting
+
+```r
+str_detect(x, pattern)        # logical vector; pairs with filter()
+str_which(x, pattern)         # integer positions
+str_subset(x, pattern)        # returns matching strings
+str_count(x, pattern)         # count matches per element
+```
+
+```r
+# Use sum/mean with str_detect for aggregates
+sum(str_detect(words, "^[aeiou]"))          # how many start with vowel
+mean(str_detect(words, "ing$"))             # proportion ending in "ing"
+```
+
+### Extracting
+
+```r
+str_extract(x, pattern)       # first match per string (NA if none)
+str_extract_all(x, pattern)   # list of all matches per string
+str_sub(x, start, end)        # positional slice; negative = from end
+```
+
+```r
+str_extract("2024-05-01", "\\d{4}")         # "2024"
+str_extract_all("a1 b2 c3", "\\d")          # list: c("1","2","3")
+str_sub("Hello", 1, 3)                      # "Hel"
+str_sub("Hello", -3, -1)                    # "llo"
+```
+
+### Replacing and splitting
+
+```r
+str_replace(x, pattern, replacement)        # first match
+str_replace_all(x, pattern, replacement)    # all matches
+
+str_split(x, pattern)                       # returns list
+str_split_fixed(x, pattern, n)             # returns matrix (n cols)
+str_split_i(x, pattern, i)                 # extract i-th piece (vectorised)
+```
+
+```r
+str_replace_all("aabbcc", "[bc]", "X")     # "aaXXXX"
+str_replace_all(x, c("a" = "1", "b" = "2")) # named vector = multiple rules
+
+str_split_i("2024-05-01", "-", 2)          # "05"
+```
+
+### Padding, trimming, case
+
+```r
+str_pad(x, width, side = "left", pad = " ")   # pad to minimum width
+str_trim(x, side = "both")                    # strip whitespace
+str_squish(x)                                 # trim + collapse internal spaces
+
+str_to_lower(x)
+str_to_upper(x)
+str_to_title(x)                               # Title Case
+str_to_sentence(x)                            # Sentence case
+```
+
+### Regex syntax in R
+
+Patterns are PCRE (Perl-compatible) by default. Key elements:
+
+| Pattern | Meaning |
+|---------|---------|
+| `.` | Any character except `\n` |
+| `^` / `$` | Start / end of string |
+| `[abc]` | Character class |
+| `[^abc]` | Negated class |
+| `\d` / `\D` | Digit / non-digit |
+| `\w` / `\W` | Word char / non-word |
+| `\s` / `\S` | Whitespace / non-whitespace |
+| `a?` | 0 or 1 of `a` |
+| `a+` | 1 or more |
+| `a*` | 0 or more |
+| `a{3}` / `a{2,4}` | Exact / range count |
+| `(abc)` | Capturing group |
+| `(?:abc)` | Non-capturing group |
+| `\1` | Backreference to group 1 |
+| `a\|b` | Alternation |
+
+In R strings, `\` must be doubled: to match a literal dot write `"\\."`,
+to match `\d` write `"\\d"`. Raw strings sidestep this:
+
+```r
+str_detect(x, r"(\d{4}-\d{2}-\d{2})")    # ISO date pattern, no doubling
+```
+
+### Modifier functions
+
+Pass these instead of a plain string to tune matching:
+
+```r
+# Case-insensitive
+str_detect(x, regex("hello", ignore_case = TRUE))
+
+# Multiline — ^ and $ match line boundaries
+str_extract(x, regex("^\\w+", multiline = TRUE))
+
+# Literal matching — disables all metacharacters
+str_detect(x, fixed("a.b.c"))            # matches the literal string
+
+# Word boundary (shorthand)
+str_detect(x, boundary("word"))
+```
+
+### Useful patterns
+
+```r
+# Extract email-like tokens
+str_extract_all(text, "[\\w.+-]+@[\\w-]+\\.[\\w.]+")
+
+# Strip HTML tags
+str_remove_all(html, "<[^>]+>")
+
+# Capture and reuse groups
+str_replace(x, "(\\w+) (\\w+)", "\\2 \\1")  # swap first two words
+
+# Normalise whitespace
+str_squish(str_to_lower(x))
+```
+
+### Base R equivalents (for reading legacy code)
+
+| base | stringr |
+|------|---------|
+| `grepl(pat, x)` | `str_detect(x, pat)` |
+| `grep(pat, x)` | `str_which(x, pat)` |
+| `sub(pat, rep, x)` | `str_replace(x, pat, rep)` |
+| `gsub(pat, rep, x)` | `str_replace_all(x, pat, rep)` |
+| `regmatches(x, regexpr(...))` | `str_extract(x, pat)` |
+| `substr(x, s, e)` | `str_sub(x, s, e)` |
+| `paste0(...)` | `str_c(...)` |
+
+---
+
+## 2. Dates and Times (lubridate)
+
+Base R's `as.Date()` / `as.POSIXct()` / `as.POSIXlt()` work but are
+inconsistent. lubridate wraps them with a uniform interface and handles
+the common traps automatically.
+
+### Parsing from strings
+
+Name the parser after the order of components in your data:
+
+```r
+ymd("2024-05-01")                    # "2024-05-01" <date>
+mdy("05/01/2024")                    # same result
+dmy("01-May-2024")                   # same result
+ymd_hms("2024-05-01 14:30:00")       # <dttm>
+ymd_hm("2024-05-01 14:30")
+mdy_hms("01/05/2024 2:30pm")         # AM/PM parsed automatically
+```
+
+These functions are flexible — they handle separators, ordinals (`31st`),
+abbreviated and full month names. They return `NA` with a warning for
+unparseable input rather than throwing an error.
+
+### From components
+
+```r
+make_date(year = 2024, month = 5, day = 1)
+make_datetime(year, month, day, hour, minute, second, tz = "UTC")
+
+# From numeric Unix epoch
+as_datetime(1714521600)          # seconds since 1970-01-01 UTC
+as_date(19843)                   # days since 1970-01-01
+```
+
+### Accessors (get and set)
+
+```r
+x <- ymd_hms("2024-05-01 14:30:45")
+year(x)      # 2024
+month(x)     # 5
+month(x, label = TRUE)   # May (ordered factor)
+day(x)       # 1
+wday(x)      # 4 (1 = Sunday by default)
+wday(x, label = TRUE, abbr = FALSE)  # "Wednesday"
+yday(x)      # day of year: 122
+hour(x)      # 14
+minute(x)    # 30
+second(x)    # 45
+
+# Setters use the same functions on the left-hand side
+year(x) <- 2025
+month(x) <- 12
+```
+
+### Rounding
+
+```r
+floor_date(x, unit = "month")     # first instant of the month
+ceiling_date(x, unit = "week")    # first instant of next week
+round_date(x, unit = "hour")      # nearest hour
+
+# Common units: "second", "minute", "hour", "day", "week", "month",
+#               "bimonth", "quarter", "halfyear", "year"
+```
+
+Useful for binning time series:
+
+```r
+df |> mutate(week = floor_date(ts, "week")) |> count(week)
+```
+
+### Time spans
+
+lubridate has three distinct span types:
+
+| Type | Class | Definition |
+|------|-------|------------|
+| Duration | `dseconds()` etc | Fixed number of seconds |
+| Period | `seconds()` etc | Calendar-aware (months, years) |
+| Interval | `interval()` / `%--%` | A specific span between two instants |
+
+```r
+# Durations — always exact seconds
+ddays(1)              # 86400s regardless of DST
+dhours(3) + dminutes(30)
+
+# Periods — calendar-friendly
+days(1)               # "1 day" — may be 23/24/25 hours across DST
+months(1) + years(2)
+ymd("2024-01-31") + months(1)   # "2024-02-29" (lubridate clips to valid date)
+
+# Intervals — for "how long between these two instants"
+start %--% end
+as.duration(start %--% end)
+as.period(start %--% end)
+int_length(start %--% end)    # seconds
+```
+
+Choose **periods** when you mean "one calendar month later". Choose
+**durations** when you mean "86400 seconds later" (physics/scheduling).
+
+### Time zones
+
+```r
+now(tzone = "Australia/Sydney")
+
+# Change display without changing the instant
+with_tz(x, "America/New_York")
+
+# Change the instant, keep the clock reading (dangerous — use rarely)
+force_tz(x, "Europe/London")
+
+# List valid zone names
+OlsonNames()
+```
+
+### Base R contrast
+
+```r
+# Base: parsing is format-sensitive and error-prone
+as.Date("01/05/2024", format = "%d/%m/%Y")   # must specify format exactly
+as.POSIXct("2024-05-01 14:30", tz = "UTC")
+
+# Base POSIXlt is a list — common traps:
+lt <- as.POSIXlt("2024-05-01")
+lt$year   # 124, not 2024 — stored as years since 1900
+lt$mon    # 4, not 5 — 0-based months (0 = January)
+
+# lubridate spares you both traps:
+year(ymd("2024-05-01"))   # 2024
+month(ymd("2024-05-01"))  # 5
+```
+
+---
+
+## 3. Categorical Variables (forcats)
+
+Factors in base R encode categorical variables as integer codes with a
+`levels` attribute. The coding determines sort order in plots and the
+reference level in models — so getting it right matters.
+
+### Creating factors
+
+```r
+# Base — silently converts unknowns to NA
+factor(x, levels = c("low", "med", "high"))
+
+# forcats::fct() — errors on unknown levels (safer)
+fct(x, levels = c("low", "med", "high"))
+
+# Ordered factor for ordinal data
+factor(x, levels = c("low", "med", "high"), ordered = TRUE)
+```
+
+### Reordering levels
+
+```r
+# Reorder by another numeric variable (plots, not models)
+fct_reorder(f, x)                   # order f levels by median of x
+fct_reorder(f, x, .fun = mean)      # use mean instead
+fct_reorder2(f, x, y)               # for line plots: order by y at max x
+
+# Move specific levels to the front
+fct_relevel(f, "Other", "NA")       # put these first, rest unchanged
+fct_relevel(f, "last_level", after = Inf)  # move to end
+
+# Most frequent first
+fct_infreq(f)
+fct_inorder(f)                      # by first appearance in data
+```
+
+```r
+# Canonical ggplot pattern
+df |>
+  mutate(cat = fct_reorder(cat, value)) |>
+  ggplot(aes(x = value, y = cat)) +
+  geom_col()
+```
+
+### Recoding level values
+
+```r
+# Rename individual levels
+fct_recode(f,
+  "United States" = "US",
+  "United Kingdom" = "GB"
+)
+
+# Collapse multiple levels into one
+fct_collapse(f,
+  anglo = c("US", "GB", "AU", "CA"),
+  other = c("FR", "DE", "JP")
+)
+
+# Lump rare levels together
+fct_lump_n(f, n = 5)            # keep top 5 by frequency, rest → "Other"
+fct_lump_prop(f, prop = 0.05)   # keep levels covering >= 5% of data
+fct_lump_min(f, min = 10)       # keep levels with at least 10 obs
+fct_other(f, keep = c("A","B")) # explicit keep-list, rest → "Other"
+```
+
+### Dropping and adding levels
+
+```r
+fct_drop(f)                     # remove levels with 0 observations
+fct_expand(f, "new_level")      # add a level without adding data
+fct_explicit_na(f, na_level = "(Missing)")  # make NA a visible level
+```
+
+### Why level order matters
+
+**Plots**: ggplot uses factor level order for axis order and legend order.
+The default (alphabetical) is almost never what you want for bar/lollipop
+charts.
+
+**Models**: `lm()`, `glm()`, etc. treat the first level as the reference
+category. Changing the level order changes the intercept and coefficient
+interpretation.
+
+```r
+# Set reference level for modelling
+f <- fct_relevel(f, "control")  # "control" becomes the baseline
+```
+
+---
+
+## Gotchas
+
+**stringr**
+- `str_extract()` returns `NA` for no-match, not `""`. Check with `!is.na()`.
+- In regex strings, `\` must be doubled: `\\d`, `\\s`, `\\.`. Use raw strings `r"(...)"` to avoid this.
+- `str_replace_all()` with a named vector applies rules left-to-right; overlapping replacements may interact unexpectedly.
+- `str_split()` returns a list. Use `str_split_fixed()` or `str_split_i()` for rectangular output.
+- `str_c()` with `NA` returns `NA`; use `coalesce(x, "fallback")` before concatenating if NAs should be treated as empty.
+
+**lubridate**
+- `months(1)` (period) vs `dmonths(1)` (duration = 30.44 days average). Adding periods to dates is usually what you want; adding durations can produce fractional days.
+- `ymd("2024-01-31") + months(1)` returns `"2024-02-29"` (valid leap year date) but `ymd("2023-01-31") + months(1)` returns `NA` — February 31 does not exist. Use `%m+%` for roll-forward: `ymd("2023-01-31") %m+% months(1)` → `"2023-02-28"`.
+- DST gaps: `force_tz()` on a non-existent local time (e.g. the clocks-forward hour) returns `NA`. Use `with_tz()` to shift display instead.
+- `as.numeric(date)` gives days since 1970-01-01 for `<date>`, seconds since epoch for `<dttm>`. Always explicit-convert via `as.integer()` or `int_length()`.
+- Base `POSIXlt$year` is years-since-1900 and `$mon` is 0-based. Never access POSIXlt slots directly in new code.
+
+**forcats**
+- `factor(x)` with an unexpected value silently produces `NA` in the result. Use `fct()` when you want an error instead.
+- `fct_lump_n(f, n)` keeps the top `n` by frequency and lumps the rest into `"Other"`. If ties exist at position `n`, behavior is deterministic but may be surprising — inspect with `fct_count(f)` first.
+- `fct_reorder()` is for visualisation only; it does not control model reference levels. Use `fct_relevel()` for that.
+- Dropping unused levels after filtering: `droplevels(df$col)` or `fct_drop(col)` — forgetting this leaves empty bars in ggplot.
+- `as.integer(factor_var)` gives the internal code (1-based level index), not the original value. To recover the label: `levels(f)[as.integer(f)]`.

+ 427 - 0
skills/r-ops/references/tidyverse-core.md

@@ -0,0 +1,427 @@
+# Tidyverse Core — dplyr / tidyr / Joins Reference
+
+Operational reference for data manipulation with the tidyverse. Covers the pipe, tibbles, all major dplyr verbs, tidyr reshaping, and joins. Targets R 4.3+ / tidyverse 2.0+.
+
+---
+
+## The Pipe
+
+```r
+# Native pipe — preferred for R 4.1+
+x |> f(y)          # equivalent to f(x, y)
+x |> f(y) |> g(z)  # chain reads left-to-right: "then"
+
+# Placeholder: pipe into non-first argument (R 4.2+)
+mtcars |> lm(mpg ~ cyl, data = _)
+
+# magrittr pipe — still valid, slightly more flexible
+library(magrittr)
+x %>% f(.)         # explicit dot placeholder
+x %T>% plot()      # tee: passes x through AND calls plot(x) for side effects
+x %$% cor(mpg, cyl) # expose columns directly (no $)
+```
+
+**Pipe choice decision table:**
+
+| Situation | Use |
+|---|---|
+| All new code, R ≥ 4.1 | `\|>` |
+| Need dot placeholder in non-first position, R < 4.2 | `%>%` |
+| Side-effect step mid-pipe (print/plot without breaking chain) | `%T>%` |
+| Column-access shorthand inside pipe | `%$%` |
+| Package must support R < 4.1 | `%>%` |
+
+Native `|>` has no runtime overhead; magrittr `%>%` involves a function call.
+
+---
+
+## Tibbles vs Data Frames
+
+```r
+library(tibble)
+
+# Creation
+tb <- tibble(x = 1:3, y = x^2)   # column refs work immediately
+tb <- tribble(
+  ~name, ~score,
+  "A",   91,
+  "B",   84
+)
+
+# Inspection
+glimpse(tb)          # compact: types + first values
+print(tb, n = 20)    # show more rows
+View(tb)             # RStudio interactive viewer
+```
+
+**Key behavioural differences from base data.frame:**
+
+| Behaviour | data.frame | tibble |
+|---|---|---|
+| Printing | All rows/cols | First 10 rows, fits screen |
+| Partial column name match | Yes (`df$mp` → `mpg`) | Error |
+| `[` always drops dimension | Yes → often returns vector | No → always returns tibble |
+| String → factor auto-coerce | Old default | Never |
+| `stringsAsFactors` | Needed to suppress | Not relevant |
+
+---
+
+## dplyr — Row Verbs
+
+```r
+library(dplyr)
+
+# filter: keep rows matching conditions (& = AND, | = OR)
+df |> filter(x > 5, y %in% c("a", "b"))   # comma = &
+df |> filter(month == 1 | month == 2)
+df |> filter(month %in% c(1, 2))           # cleaner OR for same-column
+
+# arrange: sort rows
+df |> arrange(year, month, day)
+df |> arrange(desc(dep_delay))
+
+# distinct: unique rows or unique combinations
+df |> distinct()
+df |> distinct(origin, dest)               # unique pairs
+df |> distinct(origin, dest, .keep_all = TRUE)  # keep all cols, first occurrence
+
+# count: shorthand for group_by + summarise(n = n())
+df |> count(carrier)
+df |> count(carrier, sort = TRUE)
+df |> count(carrier, dest, wt = seats)    # weighted count
+
+# slice family
+df |> slice_head(n = 5)
+df |> slice_tail(n = 5)
+df |> slice_sample(n = 10)
+df |> slice_sample(prop = 0.1)
+df |> slice_min(dep_delay, n = 3)
+df |> slice_max(arr_delay, n = 1, with_ties = FALSE)  # exactly 1 row
+
+# Within groups, slice_max/min give the top-n per group:
+df |>
+  group_by(dest) |>
+  slice_max(arr_delay, n = 1)
+```
+
+---
+
+## dplyr — Column Verbs
+
+```r
+# select: pick or drop columns; use tidyselect helpers
+df |> select(year, month, day)
+df |> select(year:dep_time)
+df |> select(!year:dep_time)           # negate range
+df |> select(where(is.numeric))
+df |> select(starts_with("dep_"), ends_with("time"))
+df |> select(contains("delay"), matches("^arr"))
+
+# rename: new = old
+df |> rename(tail_num = tailnum)
+
+# rename_with: apply function to names
+df |> rename_with(toupper)
+df |> rename_with(~ str_replace(.x, "dep_", ""), starts_with("dep_"))
+
+# relocate: reorder columns
+df |> relocate(time_hour, air_time)         # moves to front by default
+df |> relocate(time_hour, .after = day)
+df |> relocate(time_hour, .before = arr_time)
+
+# mutate: add or modify columns (right side by default)
+df |>
+  mutate(
+    gain = dep_delay - arr_delay,
+    speed_mph = distance / air_time * 60,
+    .before = 1                            # put new cols at front
+  )
+df |> mutate(log_price = log(price), .keep = "used")  # keep only cols used
+
+# .keep options in mutate:
+# "all"  (default) — keep all columns
+# "used" — only cols that appear in mutate expressions
+# "unused" — opposite of "used"
+# "none" — only new cols (like transmute, now deprecated)
+```
+
+---
+
+## across() — Multi-Column Operations
+
+```r
+# Apply function(s) to multiple columns inside mutate/summarise
+df |>
+  mutate(across(where(is.numeric), round, digits = 2))
+
+df |>
+  mutate(across(c(x, y, z), ~ .x / max(.x, na.rm = TRUE)))
+
+# Named list of functions → generates name_fn columns
+df |>
+  summarise(across(
+    where(is.numeric),
+    list(mean = mean, sd = sd),
+    na.rm = TRUE
+  ))
+
+# Control output names with .names
+df |>
+  mutate(across(starts_with("score"), ~ .x * 100, .names = "{.col}_pct"))
+
+# c_across: for rowwise operations
+df |>
+  rowwise() |>
+  mutate(total = sum(c_across(starts_with("score"))))
+```
+
+---
+
+## dplyr — Grouping and Summaries
+
+```r
+# group_by + summarise: classic pattern
+df |>
+  group_by(carrier) |>
+  summarise(
+    n         = n(),
+    avg_delay = mean(dep_delay, na.rm = TRUE),
+    p95_delay = quantile(dep_delay, 0.95, na.rm = TRUE)
+  )
+
+# .by argument (dplyr 1.1.0+): per-operation grouping, no ungroup() needed
+df |>
+  summarise(
+    n         = n(),
+    avg_delay = mean(dep_delay, na.rm = TRUE),
+    .by = carrier
+  )
+
+df |>
+  mutate(rank = dense_rank(desc(dep_delay)), .by = c(origin, month))
+
+# .groups controls residual grouping after multi-level summarise
+df |>
+  group_by(year, month, day) |>
+  summarise(n = n(), .groups = "drop")       # fully ungrouped after
+  # .groups = "drop_last" (default), "keep", "rowwise"
+
+# Always ungroup when done with grouped work if using group_by
+df |> group_by(carrier) |> mutate(...) |> ungroup()
+
+# Useful summary functions
+n()                    # row count
+n_distinct(x)          # unique values
+sum(x, na.rm = TRUE)
+mean(x, na.rm = TRUE)
+median(x, na.rm = TRUE)
+first(x); last(x)      # first/last value in group
+nth(x, 2)              # nth value
+```
+
+---
+
+## tidyr — Pivoting
+
+```r
+library(tidyr)
+
+# Tidy data rules:
+# 1. Each variable → one column
+# 2. Each observation → one row
+# 3. Each value → one cell
+
+# pivot_longer: wide → long (most common; column names become a variable)
+df |>
+  pivot_longer(
+    cols         = starts_with("wk"),   # which cols to pivot
+    names_to     = "week",              # new col for old col names
+    values_to    = "rank",              # new col for old cell values
+    values_drop_na = TRUE               # drop implicit NAs from structure
+  )
+
+# Multiple name parts → multiple name columns
+df |>
+  pivot_longer(
+    cols      = -id,
+    names_to  = c("metric", "year"),
+    names_sep = "_"                    # or names_pattern = "(.+)_(\\d+)"
+  )
+
+# pivot_wider: long → wide (inverse; unique values in a column become column names)
+df |>
+  pivot_wider(
+    id_cols      = id,
+    names_from   = measurement,
+    values_from  = value,
+    values_fill  = 0                   # fill structural NAs
+  )
+
+# Multiple value columns
+df |>
+  pivot_wider(
+    names_from  = year,
+    values_from = c(cases, population)  # generates cases_1999, population_1999, …
+  )
+```
+
+---
+
+## tidyr — Splitting and Combining
+
+```r
+# separate_wider_delim: split on a delimiter (replaces separate())
+df |>
+  separate_wider_delim(
+    col   = code,
+    delim = "-",
+    names = c("prefix", "num")
+  )
+
+# separate_wider_position: split by fixed character widths
+df |>
+  separate_wider_position(
+    col   = code,
+    widths = c(prefix = 3, num = 4)
+  )
+
+# separate_wider_regex: split by regex capture groups
+df |>
+  separate_wider_regex(
+    col   = address,
+    patterns = c(street = "[^,]+", ", ", city = ".+")
+  )
+
+# unite: combine columns into one
+df |>
+  unite(col = "date_str", year, month, day, sep = "-")
+```
+
+---
+
+## tidyr — Nesting and Completeness
+
+```r
+# nest: list-column of data frames per group
+nested <- df |>
+  nest(data = -group_col)
+
+# unnest: explode list-columns back out
+nested |>
+  unnest(data)
+
+# unnest_wider / unnest_longer for non-df list columns
+df |> unnest_wider(json_col)    # list → columns
+df |> unnest_longer(tags_col)   # list → rows
+
+# complete: make implicit missing rows explicit
+df |>
+  complete(year, month, fill = list(sales = 0))
+
+# fill: carry values forward/backward (LOCF)
+df |>
+  fill(product, .direction = "down")   # "up", "downup", "updown"
+
+# drop_na: remove rows with NAs in specified columns
+df |> drop_na()              # any NA
+df |> drop_na(x, y)         # NA in x or y only
+```
+
+---
+
+## Joins
+
+```r
+# Mutating joins — add columns from y to x
+left_join(x, y)              # all rows of x; NAs for unmatched y
+inner_join(x, y)             # only matched rows
+right_join(x, y)             # all rows of y; NAs for unmatched x
+full_join(x, y)              # all rows from both; NAs where unmatched
+
+# Filtering joins — filter x based on y; no new columns
+semi_join(x, y)              # keep x rows that have a match in y
+anti_join(x, y)              # keep x rows that have NO match in y
+
+# Natural join (default): matches on all shared column names — usually wrong
+# Always be explicit:
+left_join(flights, planes, join_by(tailnum))
+
+# join_by: explicit key specification
+left_join(x, y, join_by(x_id == y_id))       # different column names
+left_join(x, y, join_by(id, year == yr))      # multiple keys, mixed names
+
+# Non-equi joins (dplyr 1.1.0+): inequality / range / rolling
+# Overlap join: find all y ranges that overlap x range
+left_join(x, y, join_by(overlaps(x_start, x_end, y_start, y_end)))
+
+# Inequality join
+left_join(x, y, join_by(id, x_date >= y_date))
+
+# Disambiguate shared column names in output
+left_join(flights, planes, join_by(tailnum), suffix = c("_flight", "_plane"))
+
+# Validate keys before joining
+planes |> count(tailnum) |> filter(n > 1)   # check for duplicates
+planes |> filter(is.na(tailnum))            # check for NAs in key
+```
+
+**Join choice decision table:**
+
+| Goal | Join |
+|---|---|
+| Enrich x with metadata from y | `left_join` |
+| Keep only matched rows | `inner_join` |
+| Keep all rows, both sides | `full_join` |
+| Does x have a match in y? (filter only) | `semi_join` |
+| What in x has no match in y? | `anti_join` |
+
+Default to `left_join`. Use `inner_join` only when you explicitly want to drop unmatched rows.
+
+---
+
+## Missing Values in Manipulation Context
+
+```r
+# NA is infectious: any arithmetic with NA returns NA
+mean(c(1, 2, NA))           # NA — always pass na.rm = TRUE in summaries
+mean(c(1, 2, NA), na.rm = TRUE)  # 1.5
+
+# Test for NA — never use == NA
+is.na(x)
+!is.na(x)
+filter(df, !is.na(price))
+
+# Replace / coerce
+coalesce(x, 0)              # replace NA with fallback value
+na_if(x, -99)               # treat sentinel value as NA
+replace_na(x, list(col = 0)) # tidyr: per-column replacement in data frames
+
+# NaN behaves like NA for most purposes; distinguish with:
+is.nan(x)
+
+# Implicit missing rows → explicit
+df |> complete(year, qtr)   # add rows for every year×qtr combo
+df |> fill(price)           # LOCF / NOCB
+```
+
+---
+
+## Gotchas
+
+**Pipe placeholder before R 4.2.** `x |> f(y, data = _)` requires R 4.2+. In R 4.1, use `%>%` with `.`.
+
+**Natural joins silently join on all shared columns.** A `year` column in both tables means `join_by(year)` is implicit — and probably wrong. Always name your keys.
+
+**`group_by()` is sticky.** Grouped data frames stay grouped through `mutate()`, `filter()`, `arrange()`. Unintended downstream effects are a common source of wrong counts. Prefer `.by =` for one-shot grouping, or always call `ungroup()` after a `group_by() |> mutate()` block.
+
+**Multi-group `summarise()` peels the last group.** `group_by(a, b) |> summarise(...)` leaves a group on `a`. Pass `.groups = "drop"` to be explicit.
+
+**`distinct()` drops columns by default.** `distinct(origin, dest)` drops all other columns. Use `.keep_all = TRUE` to keep the first occurrence's full row.
+
+**`pivot_wider()` on non-unique id/name combos produces list-columns.** Verify uniqueness with `count()` first; pass `values_fn = list` intentionally if aggregation is desired.
+
+**`separate_wider_*` supersedes `separate()`.** The old `separate()` (and `extract()`) are superseded as of tidyr 1.3.0. Prefer the typed variants: `separate_wider_delim`, `separate_wider_position`, `separate_wider_regex`.
+
+**`slice_min/max` keeps ties by default.** `n = 1` can return more than one row when values are equal. Set `with_ties = FALSE` for a guaranteed single row.
+
+**`across()` with named function list.** Output columns are named `{.col}_{.fn}`. Control this with `.names = "{.col}_pct"` etc. The `{.fn}` token uses the list name, so name your functions meaningfully.

+ 379 - 0
skills/r-ops/references/time-series.md

@@ -0,0 +1,379 @@
+# Time Series Analysis in R
+
+Modern best practice centres on the **tidyverts** ecosystem (`tsibble` + `feasts` + `fable`). The older `xts`/`zoo` pair still dominates in finance; base `ts` shows up everywhere — know it but don't start new work with it.
+
+---
+
+## Object Classes
+
+### Decision Guide
+
+| Situation | Use |
+|---|---|
+| New project, general forecasting | `tsibble` (tidyverts) |
+| Finance data, irregular intervals, xts already in pipeline | `xts` |
+| Reading legacy code / CRAN examples | `ts` (regular, fixed freq) |
+| Need zoo's partial regularity | `zoo` |
+
+### Base `ts`
+
+```r
+# Monthly data starting Jan 2020
+ts_obj <- ts(values, start = c(2020, 1), frequency = 12)
+time(ts_obj)        # decimal dates
+cycle(ts_obj)       # 1..12 per year
+window(ts_obj, start = c(2022, 1))  # subset
+```
+
+Limitation: single time index, fixed frequency, no multiple series or metadata.
+
+### zoo / xts
+
+```r
+library(zoo)
+library(xts)
+
+# zoo: arbitrary index (Date, POSIXct, numeric)
+z <- zoo(values, order.by = dates)
+index(z)         # extract index
+coredata(z)      # extract matrix of values
+
+# xts: strict POSIXct index; extends zoo
+x <- xts(matrix(values, ncol = 1), order.by = as.POSIXct(dates))
+```
+
+### tsibble
+
+```r
+library(tsibble)
+
+tsbl <- as_tsibble(df, key = symbol, index = date)
+# key:   grouping variable (stock ticker, store ID, …)
+# index: time variable — must be a recognised temporal class
+```
+
+`tsibble` enforces no implicit gaps and requires unique (key, index) combinations. Use `fill_gaps()` to make implicit NA gaps explicit.
+
+---
+
+## tidyverts Modern Workflow
+
+Four packages, one pipeline:
+
+| Package | Role |
+|---|---|
+| `tsibble` | Data structure |
+| `feasts` | Features, decomposition, ACF/PACF, STL |
+| `fable` | Models: ARIMA, ETS, TSLM, NNETAR, MEAN, NAIVE |
+| `fabletools` | `model()`, `forecast()`, `accuracy()`, `autoplot()` |
+
+### End-to-End Skeleton
+
+```r
+library(tsibble)
+library(feasts)
+library(fable)
+library(fabletools)
+library(dplyr)
+
+# 1. Build tsibble
+tsbl <- df |>
+  mutate(month = yearmonth(date)) |>
+  as_tsibble(key = series_id, index = month)
+
+# 2. Diagnostics
+tsbl |> gg_season(value)          # seasonal plots
+tsbl |> ACF(value) |> autoplot()
+tsbl |> PACF(value) |> autoplot()
+
+# 3. Decomposition (STL)
+stl_dcmp <- tsbl |>
+  model(STL(value ~ trend(window = 13) + season(window = "periodic"))) |>
+  components()
+autoplot(stl_dcmp)
+
+# 4. Fit models
+fit <- tsbl |>
+  model(
+    arima  = ARIMA(value),          # stepwise search; set stepwise=FALSE for exhaustive
+    ets    = ETS(value),
+    tslm   = TSLM(value ~ trend() + season()),
+    nnetar = NNETAR(value)
+  )
+
+# 5. Forecast
+fc <- fit |> forecast(h = "2 years")
+autoplot(fc, tsbl)
+
+# 6. Accuracy (in-sample; use stretch_tsibble for CV)
+accuracy(fit)
+
+# 7. Cross-validation accuracy
+tsbl_tr <- tsbl |>
+  stretch_tsibble(.init = 36, .step = 1)
+
+fit_cv <- tsbl_tr |> model(arima = ARIMA(value))
+fc_cv  <- fit_cv |> forecast(h = 12)
+fc_cv |> accuracy(tsbl)
+```
+
+### Key fable Model Specs
+
+```r
+# ARIMA with forced order
+ARIMA(log(value) ~ 0 + pdq(1,1,1) + PDQ(1,1,0))
+
+# ETS with explicit method
+ETS(value ~ error("A") + trend("Ad") + season("A"))
+
+# TSLM with external regressors
+TSLM(value ~ trend() + season() + xreg_column)
+
+# Combination / ensemble
+(ARIMA(value) + ETS(value)) / 2
+```
+
+### Transformations
+
+```r
+# Box-Cox inside model spec; guerrero selects lambda
+fit <- tsbl |> model(ARIMA(box_cox(value, lambda = "auto")))
+
+# log shorthand
+ARIMA(log(value))
+```
+
+---
+
+## Stationarity
+
+### Unit Root Tests
+
+```r
+library(urca)          # preferred; more complete than tseries
+
+# ADF — H0: unit root (non-stationary)
+ur_adf <- ur.df(tsbl$value, type = "drift", selectlags = "AIC")
+summary(ur_adf)        # reject H0 → stationary
+
+# KPSS — H0: stationary
+ur_kpss <- ur.kpss(tsbl$value, type = "tau")
+summary(ur_kpss)       # fail to reject H0 → stationary
+
+# Quick ndiffs / nsdiffs (fable-aware)
+library(feasts)
+tsbl |> features(value, list(unitroot_ndiffs, unitroot_nsdiffs))
+```
+
+### Differencing in fable
+
+`ARIMA()` auto-determines d and D. To force:
+
+```r
+ARIMA(value ~ pdq(p, 1, q) + PDQ(P, 1, Q, period = 12))
+```
+
+Manual differencing outside of model:
+
+```r
+tsbl <- tsbl |> mutate(d_value = difference(value))
+```
+
+---
+
+## Order Identification (ACF / PACF)
+
+| Pattern | Interpretation |
+|---|---|
+| ACF cuts off at lag q, PACF tails off | MA(q) |
+| PACF cuts off at lag p, ACF tails off | AR(p) |
+| Both tail off | ARMA(p, q) — mixed |
+| ACF decays slowly | Non-stationary — difference first |
+
+```r
+# feasts / fabletools
+tsbl |> ACF(value, lag_max = 48) |> autoplot()
+tsbl |> PACF(value, lag_max = 48) |> autoplot()
+
+# Or combined
+tsbl |> gg_tsdisplay(value, plot_type = "partial", lag_max = 48)
+```
+
+These give initial *estimates* for p and q — AIC/BIC comparisons across candidate models confirm.
+
+---
+
+## STL Decomposition
+
+STL (Seasonal and Trend decomposition using Loess) is robust and handles multiple seasonalities.
+
+```r
+stl_fit <- tsbl |>
+  model(
+    STL(value ~ trend(window = 13) + season(window = "periodic"),
+        robust = TRUE)           # robust = TRUE down-weights outliers
+  )
+
+components(stl_fit) |> autoplot()    # trend + seasonal + remainder
+components(stl_fit) |> as_tsibble()  # access programmatically
+```
+
+For seasonally-adjusted data:
+
+```r
+components(stl_fit) |>
+  mutate(sa = value - season_year) |>
+  autoplot(sa)
+```
+
+---
+
+## xts Essentials
+
+Use when data arrives as xts (from quantmod, tidyquant, Bloomberg, etc.).
+
+### Core Operations
+
+```r
+library(xts)
+
+# Subset by ISO string
+x["2023"]              # full year
+x["2023-06/2023-12"]   # range
+x["2023-06/"]          # open-ended from
+
+# Period aggregation
+apply.monthly(x, colMeans)
+apply.quarterly(x, sum)
+to.period(x, "months", OHLC = FALSE)  # open/high/low/close columns if TRUE
+
+# Rolling window
+rollapply(x, width = 20, FUN = mean, align = "right")  # see Gotchas
+rollapply(x, width = 20, FUN = sd,   align = "right", fill = NA)
+
+# Merge (outer join; fills with NA)
+merged <- merge(x1, x2)            # union of dates
+merged <- merge(x1, x2, join = "inner")
+
+# Fill missing
+na.locf(x)             # carry last observation forward
+na.locf(x, fromLast = TRUE)  # carry next observation backward
+na.approx(x)           # linear interpolation
+
+# Endpoints (index positions of period boundaries)
+ep <- endpoints(x, on = "months")
+period.apply(x, ep, mean)
+```
+
+### Convert to/from tibble
+
+```r
+library(tibble)
+library(dplyr)
+
+df  <- as_tibble(fortify.zoo(x), rownames = "date")
+xts_out <- xts(df |> select(-date), order.by = as.POSIXct(df$date))
+```
+
+---
+
+## Package Notes
+
+| Package | Status | Notes |
+|---|---|---|
+| `fable` | **Active** | Tidyverts flagship; supersedes `forecast` for new work |
+| `feasts` | **Active** | Feature extraction, decomposition, diagnostics |
+| `forecast` | Maintenance | `auto.arima()` still works; no new features. Migrating: `auto.arima(x)` → `model(ARIMA(x))` |
+| `prophet` | Active (Meta) | Good for business series with holidays, multiple seasonalities; less statistically principled |
+| `xts` / `zoo` | Stable | Finance standard; will not disappear |
+| `timetk` | Active | Bridge: xts ↔ tibble, visualisation, anomaly detection |
+
+---
+
+## Gotchas
+
+### xts lag — `k=+1` LEADS, not lags
+
+```r
+# This is a LEAD (shifts values forward, i.e., tomorrow's price today)
+lag(x, k = 1)
+
+# This is a LAG (shifts values backward — what most people want)
+lag(x, k = -1)
+# Equivalent:
+lag.xts(x, k = -1)
+```
+
+**Mnemonic:** xts `lag` matches `stats::lag` convention — positive k shifts the *time axis* forward, which means values appear earlier in the output. The opposite of dplyr `lag()`.
+
+### rollapply default alignment is "center" — uses future data
+
+```r
+# WRONG for real-time or backtesting contexts:
+rollapply(x, width = 20, FUN = mean)                    # align = "center" (default)
+
+# CORRECT — only uses past data:
+rollapply(x, width = 20, FUN = mean, align = "right")
+```
+
+Center alignment is mathematically valid for smoothing historical data for display, but will cause look-ahead bias in any model or backtest. Always set `align = "right"` explicitly.
+
+### tsibble implicit gaps
+
+```r
+# Implicit NA dates are silently dropped unless you fill them
+tsbl |> has_gaps()           # TRUE/FALSE per key
+tsbl |> fill_gaps()          # inserts NA rows for missing periods
+tsbl |> fill_gaps(value = 0) # inserts 0 instead
+```
+
+### ARIMA in fable is stochastic
+
+`ARIMA()` with `stepwise = TRUE` (default) uses a heuristic search — results can vary between runs. Set `stepwise = FALSE, approximation = FALSE` for fully deterministic exhaustive search (slower).
+
+```r
+model(ARIMA(value, stepwise = FALSE, approximation = FALSE))
+```
+
+### Forecast horizon `h` string parsing
+
+`fable` accepts human-readable strings:
+
+```r
+forecast(h = "2 years")    # resolved against tsibble index frequency
+forecast(h = 24)            # 24 periods (safer — unambiguous)
+```
+
+String form requires the tsibble index to be a `yearmonth`, `yearquarter`, or similar lubridate-aware class.
+
+---
+
+## Quick-Reference Cheatsheet
+
+```r
+# Object creation
+ts(x, start, frequency)
+as_tsibble(df, key, index)
+xts(matrix, order.by = dates)
+
+# Diagnostics
+gg_tsdisplay(value, plot_type = "partial")
+ACF() |> autoplot()
+PACF() |> autoplot()
+gg_season()
+features(value, list(unitroot_ndiffs, feat_stl))
+
+# Model + forecast
+model(ARIMA(y), ETS(y)) |> forecast(h = 24) |> autoplot()
+accuracy(fit)
+
+# STL
+model(STL(y)) |> components() |> autoplot()
+
+# xts
+apply.monthly(x, mean)
+rollapply(x, 20, mean, align = "right")
+lag(x, k = -1)      # actual lag
+na.locf(x)
+x["2023-01/2023-06"]
+```

+ 456 - 0
skills/r-ops/references/visualization.md

@@ -0,0 +1,456 @@
+# ggplot2 — Data Visualization Reference
+
+ggplot2 implements the grammar of graphics: every plot is built by composing layers over a coordinate system. The payoff is a small vocabulary that handles 95% of analysis plots without memorizing ad-hoc APIs.
+
+```r
+library(tidyverse)   # loads ggplot2, dplyr, forcats, scales, etc.
+```
+
+---
+
+## The Layered Mental Model
+
+```
+ggplot(data, aes(...))   # canvas + default aesthetics
+  + geom_*()             # geometric layer (what to draw)
+  + stat_*()             # optional: transform data before drawing
+  + scale_*()            # override axis/colour/size mappings
+  + coord_*()            # coordinate system (flip, polar, fixed)
+  + facet_*()            # small multiples
+  + theme_*() / theme()  # non-data ink (fonts, grid, legend)
+```
+
+Every `+` adds a layer. Layers share the canvas-level `aes()` unless overridden locally. Build incrementally; assign the base to an object and add layers for variants.
+
+```r
+base <- ggplot(df, aes(x = weight, y = height))
+base + geom_point()
+base + geom_point(aes(colour = group)) + geom_smooth(method = "lm")
+```
+
+---
+
+## Aesthetics: Mapping vs. Setting
+
+**Mapping** — inside `aes()`, driven by data:
+
+```r
+geom_point(aes(colour = species, shape = species, size = mass))
+```
+
+**Setting** — outside `aes()`, constant:
+
+```r
+geom_point(colour = "steelblue", size = 2, alpha = 0.6)
+```
+
+The most common gotcha: `geom_point(aes(colour = "blue"))` maps the string literal `"blue"` to the colour scale — it does NOT produce blue points.
+
+### Common Aesthetics
+
+| Aesthetic | Types | Notes |
+|-----------|-------|-------|
+| `x`, `y` | all | positional |
+| `colour` / `color` | all | border/line/point colour |
+| `fill` | bars, areas, polygons | interior colour |
+| `shape` | point | 0–25; 21–25 have fill |
+| `size` | point, line | in mm |
+| `alpha` | all | 0 (transparent) – 1 (opaque) |
+| `linetype` | line | solid, dashed, dotted, etc. |
+| `group` | line, smooth | grouping without visual change |
+| `label` | text geoms | character string |
+
+---
+
+## Key Geoms
+
+### Points and Lines
+
+```r
+geom_point()               # scatterplot; add jitter via position_jitter()
+geom_jitter(width=0.2)     # convenience: jittered points
+geom_line()                # connect points in x order; needs group= for multiple series
+geom_path()                # connect in data order (trajectory plots)
+geom_smooth(method="lm")   # trend line; method: "lm", "loess", "gam"
+geom_smooth(se=FALSE)      # suppress confidence ribbon
+```
+
+### Distributions (one variable)
+
+```r
+geom_histogram(binwidth=5)        # choose binwidth, not bins
+geom_density(adjust=1)            # kernel density; adjust scales bandwidth
+geom_freqpoly(binwidth=5)         # histogram as lines; good for overlaying groups
+geom_boxplot()                    # five-number summary + outliers
+geom_violin()                     # density shape; more info than boxplot
+geom_dotplot(binaxis="y")         # individual points in bins
+```
+
+### Categorical / Counts
+
+```r
+geom_bar()                        # counts rows (stat="count" default)
+geom_col()                        # heights from data (stat="identity")
+geom_count()                      # bubble size = count; cat × cat grids
+```
+
+### Heatmaps / Tiles
+
+```r
+geom_tile(aes(fill=value))        # rectangular heatmap
+geom_raster(aes(fill=value))      # faster tile for regular grids
+```
+
+### Annotations / Text
+
+```r
+geom_text(aes(label=name))        # raw text; overlaps freely
+geom_label(aes(label=name))       # text with background box
+annotate("text", x=5, y=10, label="Peak")   # single annotation, no data frame needed
+annotate("rect", xmin=2, xmax=4, ymin=0, ymax=100, alpha=0.2)
+# For non-overlapping labels:
+library(ggrepel)
+geom_text_repel(aes(label=name))
+geom_label_repel(aes(label=name))
+```
+
+### Area / Ribbon
+
+```r
+geom_area()                       # filled area chart; stack with position_stack()
+geom_ribbon(aes(ymin=lo, ymax=hi))  # confidence band around a line
+```
+
+---
+
+## Which Geom?
+
+| Goal | Geom(s) |
+|------|---------|
+| Two continuous variables | `geom_point` + `geom_smooth` |
+| One continuous distribution | `geom_histogram` or `geom_density` |
+| Continuous by group | `geom_boxplot` or `geom_violin` |
+| Continuous over time | `geom_line` |
+| Count by category | `geom_bar` |
+| Pre-computed values | `geom_col` |
+| Two categorical, covariation | `geom_count` or `geom_tile` after `count()` |
+| Trend with uncertainty | `geom_smooth` + `geom_ribbon` |
+| Labelled points | `geom_text_repel` (ggrepel) |
+| Many overlapping points | `geom_hex` or `geom_bin2d` |
+
+---
+
+## Position Adjustments
+
+```r
+geom_bar(position = "stack")    # default for bar: stack groups
+geom_bar(position = "fill")     # stack to 100% — shows proportions
+geom_bar(position = "dodge")    # side-by-side bars
+geom_point(position = position_jitter(width=0.1, height=0))
+geom_point(position = position_dodge(width=0.8))  # offset overlapping points by group
+```
+
+---
+
+## Stats
+
+Stats transform data before drawing. Most geoms have a paired stat; you can swap them.
+
+```r
+# Draw means ± SE without pre-summarising:
+geom_point(stat = "summary", fun = mean)
+stat_summary(fun = mean, fun.min = function(x) mean(x)-sd(x),
+             fun.max = function(x) mean(x)+sd(x),
+             geom = "pointrange")
+
+# Density from raw data:
+stat_density_2d(aes(fill = after_stat(level)), geom = "polygon")
+
+# after_stat() accesses computed variables:
+geom_histogram(aes(y = after_stat(density)))   # normalised histogram
+```
+
+---
+
+## Scales
+
+Scale functions follow `scale_<aesthetic>_<type>()`.
+
+### Axes
+
+```r
+scale_x_continuous(breaks = seq(0, 100, 25), labels = scales::label_comma())
+scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05)))
+scale_x_log10()                          # log-transformed axis
+scale_x_date(date_breaks = "1 year", date_labels = "%Y")
+scale_x_discrete(limits = rev)           # reverse categorical axis
+```
+
+### Colour / Fill
+
+```r
+# Continuous:
+scale_colour_gradient(low="white", high="steelblue")
+scale_colour_gradient2(midpoint=0, low="blue", mid="white", high="red")
+scale_fill_viridis_c()          # perceptually uniform, colourblind-safe
+scale_fill_viridis_d()          # discrete version
+
+# Discrete:
+scale_colour_brewer(palette = "Set2")    # ColorBrewer palettes
+scale_colour_manual(values = c(A = "#E41A1C", B = "#377EB8"))
+
+# Ordinal:
+scale_colour_ordinal()          # for ordered factors
+```
+
+### Other Scales
+
+```r
+scale_size_continuous(range = c(1, 8))
+scale_alpha_continuous(range = c(0.2, 1))
+scale_shape_manual(values = c(16, 17, 15))
+```
+
+### Labels
+
+```r
+labs(
+  title    = "Main title",
+  subtitle = "Secondary line",
+  caption  = "Source: ...",
+  x        = "X axis label",
+  y        = "Y axis label",
+  colour   = "Legend title",   # match the aesthetic name
+  fill     = "Fill legend"
+)
+```
+
+---
+
+## Facets
+
+```r
+# Wrap a single variable into a grid:
+facet_wrap(~ species)
+facet_wrap(~ species, ncol = 2, scales = "free_y")
+
+# Two-way grid:
+facet_grid(rows ~ cols)
+facet_grid(cut ~ color, scales = "free")
+
+# Strip labels:
+facet_wrap(~ species, labeller = label_both)   # "species: Adelie" etc.
+```
+
+Faceting is usually cleaner than colour-coding when you have 3+ groups with overlap.
+
+---
+
+## Coordinate Systems
+
+```r
+coord_flip()                      # swap x and y; useful for long category names
+coord_fixed(ratio = 1)            # equal aspect ratio
+coord_cartesian(ylim = c(0, 50))  # zoom without dropping data (vs. scale limits which drop)
+coord_polar()                     # polar coords (pie charts, rose plots)
+coord_trans(y = "sqrt")           # transform after statistics
+```
+
+Use `coord_cartesian()` to zoom; use scale `limits` only when you want to exclude data from stats.
+
+---
+
+## Themes
+
+```r
+# Built-in themes:
+theme_minimal()      # clean, white background, subtle grid
+theme_bw()           # white background, black border
+theme_classic()      # no grid lines — publication style
+theme_void()         # blank canvas; useful for maps
+theme_light()
+
+# Fine-tune anything:
+theme(
+  legend.position    = "bottom",         # "top","left","right","none"
+  legend.direction   = "horizontal",
+  axis.text.x        = element_text(angle = 45, hjust = 1),
+  axis.title         = element_text(size = 12, face = "bold"),
+  plot.title         = element_text(size = 14, face = "bold"),
+  panel.grid.minor   = element_blank(),
+  strip.background   = element_blank()   # cleaner facet labels
+)
+
+# Set a global default for a session:
+theme_set(theme_minimal(base_size = 12))
+```
+
+---
+
+## EDA Workflow: Question-Driven Exploration
+
+The EDA loop: plot → notice → refine question → plot again.
+
+**Step 1 — Understand each variable's distribution**
+
+```r
+# Continuous:
+ggplot(df, aes(x = price)) + geom_histogram(binwidth = 100)
+ggplot(df, aes(x = price)) + geom_density()
+
+# Categorical:
+df |> count(cut) |> ggplot(aes(x = fct_reorder(cut, n), y = n)) + geom_col()
+```
+
+**Step 2 — Examine covariation**
+
+```r
+# Continuous × Continuous:
+ggplot(df, aes(x = carat, y = price)) + geom_point(alpha = 0.1) + geom_smooth()
+
+# Continuous × Categorical — compare distributions:
+ggplot(df, aes(x = price, y = fct_reorder(cut, price, median))) + geom_boxplot()
+
+# Two categorical — count grid:
+df |> count(cut, color) |>
+  ggplot(aes(x = cut, y = color, fill = n)) +
+  geom_tile()
+```
+
+**Step 3 — Handle outliers and missing values**
+
+```r
+# Zoom without losing data from smooths:
+ggplot(df, aes(x, y)) + geom_point() + coord_cartesian(ylim = c(0, 500))
+
+# Suppress NA warnings when intentional:
+geom_point(na.rm = TRUE)
+
+# Distinguish NA from non-NA:
+df |> mutate(cancelled = is.na(dep_time)) |>
+  ggplot(aes(x = sched_dep_time, colour = cancelled)) +
+  geom_freqpoly(binwidth = 0.25)
+```
+
+---
+
+## Plot Composition with patchwork
+
+```r
+library(patchwork)
+
+p1 <- ggplot(df, aes(x, y)) + geom_point()
+p2 <- ggplot(df, aes(x)) + geom_histogram()
+p3 <- ggplot(df, aes(y)) + geom_boxplot()
+
+p1 + p2            # side by side
+p1 / p2            # stacked
+(p1 | p2) / p3    # 2 on top, 1 spanning bottom
+
+# Unified legend + shared title:
+(p1 + p2) +
+  plot_annotation(title = "Overview", tag_levels = "A") +
+  plot_layout(guides = "collect")
+```
+
+---
+
+## Saving Plots
+
+```r
+ggsave("output/plot.png", width = 8, height = 5, dpi = 300)
+ggsave("output/plot.pdf", width = 8, height = 5)   # vector output for print
+
+# Explicit plot argument:
+ggsave("plot.png", plot = p1, width = 6, height = 4, dpi = 150)
+```
+
+`ggsave` infers format from the extension. Use `.pdf`/`.svg` for publication; `.png` for web and presentations. Always set explicit `width`/`height` — the default proportions are rarely right.
+
+---
+
+## Gotchas
+
+### 1. Mapping vs. setting colour (the most common mistake)
+
+```r
+# WRONG — maps the string "blue" to colour scale, produces red/salmon:
+geom_point(aes(colour = "blue"))
+
+# RIGHT — sets all points to blue:
+geom_point(colour = "blue")
+```
+
+### 2. `group` aesthetic — when colour isn't set but lines need grouping
+
+```r
+# Multiple subjects measured over time: lines jump between subjects without group=
+ggplot(df, aes(x = time, y = value, group = subject)) + geom_line()
+
+# colour= implicitly sets group; explicit group= needed when colour isn't mapped:
+ggplot(df, aes(x = time, y = value)) +
+  geom_smooth(aes(group = cohort), se = FALSE)
+```
+
+### 3. Factor ordering controls bar/boxplot order
+
+```r
+# Alphabetical order is almost never the right order:
+df |> mutate(city = fct_reorder(city, sales, sum)) |>
+  ggplot(aes(x = city, y = sales)) + geom_col()
+
+# forcats helpers:
+fct_reorder(f, x)            # reorder by another variable
+fct_infreq(f)                # most frequent first
+fct_rev(f)                   # reverse current order
+fct_relevel(f, "Other", after=Inf)   # push "Other" to end
+```
+
+### 4. Scale limits vs. coord_cartesian — they are not equivalent
+
+```r
+# Drops data outside limits → changes smooths, counts, boxplot stats:
+scale_y_continuous(limits = c(0, 50))
+
+# Zooms view only, keeps all data in stats:
+coord_cartesian(ylim = c(0, 50))
+```
+
+### 5. `colour` (British) and `color` (American) are both accepted — but pick one per project.
+
+### 6. Local `data=` in a geom overrides global data — useful for annotation layers
+
+```r
+labels_df <- df |> filter(label_me)
+ggplot(df, aes(x, y)) +
+  geom_point() +
+  geom_text_repel(data = labels_df, aes(label = name))
+```
+
+### 7. `geom_bar` vs. `geom_col`
+
+- `geom_bar()` counts rows — `x` only, `y` is computed.
+- `geom_col()` uses pre-computed heights — both `x` and `y` required.
+
+### 8. Log scales suppress zeros — use `log1p` transform or `scale_x_log10()` only on positive data.
+
+---
+
+## Quick Reference: Useful Extension Packages
+
+| Package | Purpose |
+|---------|---------|
+| `ggrepel` | Non-overlapping text/label geoms |
+| `patchwork` | Compose multiple plots |
+| `scales` | Label formatters (`label_comma`, `label_percent`, `label_dollar`) |
+| `ggthemes` | Extra themes (including colourblind-safe palettes) |
+| `ggridges` | Ridge/joy plots (`geom_density_ridges`) |
+| `ggforce` | Advanced annotations, mark hulls, zoom |
+| `gghighlight` | Highlight subsets without pre-filtering |
+| `ggdist` | Distribution geoms for uncertainty viz |
+
+---
+
+## Base Graphics vs. ggplot2
+
+Base graphics (`plot()`, `hist()`, `barplot()`) are fine for throwaway exploration at the REPL — they need zero setup and print instantly. Use ggplot2 for anything that will be communicated, iterated on, or composed into a report. The grammar pays for itself the moment you want facets, consistent themes, or a second layer.

+ 545 - 0
skills/r-ops/references/workflow-tooling.md

@@ -0,0 +1,545 @@
+# Workflow Tooling — Project Hygiene, Reproducibility, Environment
+
+Operational reference for R project workflow: environment isolation,
+dependency management, reproducible reports, code style, pipelines,
+testing, and the base-R vs tidyverse decision.
+
+---
+
+## Project Structure
+
+### RStudio / Posit Projects (`.Rproj`)
+
+Create one project per analysis. Every `.Rproj` file sets the working
+directory to its own folder on open — no manual path wrangling needed.
+
+```r
+# Start fresh:  File > New Project > New/Existing Directory in RStudio
+# Or from R:
+usethis::create_project("my-analysis")
+```
+
+The project root becomes `here::here()` automatically.
+
+### Path discipline — never use `setwd()` with absolute paths
+
+Absolute paths break on any other machine, in CI, or after a folder rename.
+
+```r
+# BAD — ties code to one machine
+setwd("/Users/mack/projects/analysis")
+data <- read.csv("data/raw.csv")
+
+# GOOD — works everywhere the .Rproj exists
+library(here)
+data <- read.csv(here("data", "raw.csv"))
+```
+
+`here::here()` walks up the directory tree to find the project root
+(`.Rproj`, `.git`, `DESCRIPTION`, `.here`). Pass path components as
+separate strings; `here` handles the OS separator.
+
+### Restart R often
+
+Accumulated state in a live session hides bugs. Bind **Cmd/Ctrl+Shift+F10**
+to "Restart R" and use it between major steps. If your script doesn't
+run clean from a fresh session, it's broken.
+
+### Do not save or restore `.RData`
+
+The hidden `.RData` file silently reloads stale objects. Disable it globally:
+
+```r
+# In ~/.Rprofile or via Tools > Global Options in RStudio
+usethis::use_blank_slate()  # sets save.defaults and restore defaults
+
+# Equivalent manual toggle (in .Rprofile):
+# RStudio GUI: Tools > Global Options > Workspace > Never save/restore
+```
+
+Set in project `.Rprofile` too if sharing with collaborators who may
+have different global defaults.
+
+---
+
+## Dependency Management
+
+### `renv` — project-local library
+
+`renv` records exact package versions in a lockfile and restores them
+on any machine. The gold standard for reproducible R environments.
+
+```r
+# Initialise in a new or existing project
+renv::init()
+
+# After installing or updating packages, snapshot the lockfile
+renv::snapshot()
+
+# On a collaborator's machine or in CI
+renv::restore()
+
+# Check for out-of-sync state
+renv::status()
+```
+
+Key files committed to version control:
+
+- `renv.lock` — exact versions (JSON, human-readable)
+- `.Rprofile` — sources `renv/activate.R` automatically
+- `renv/activate.R` — bootstraps renv on clone
+
+`.renv/library/` goes in `.gitignore` (large, platform-specific).
+
+### `pak` — fast, reliable installs
+
+`pak` resolves dependencies in parallel and handles CRAN, GitHub,
+Bioconductor, and local packages uniformly. Use it inside `renv`
+workflows for faster installs.
+
+```r
+# Install pak (once)
+install.packages("pak")
+
+# Install from CRAN
+pak::pak("dplyr")
+
+# Install from GitHub (owner/repo)
+pak::pak("tidyverse/dplyr")
+
+# Install multiple at once
+pak::pak(c("dplyr", "ggplot2", "tidyr"))
+
+# Within renv: pak integrates transparently
+options(renv.config.pak.enabled = TRUE)  # in .Rprofile
+```
+
+`pak` is faster than `install.packages()` for initial setup; `renv`
+owns reproducibility; the two compose well.
+
+---
+
+## Reproducible Reports with Quarto
+
+Quarto (`.qmd`) is the current standard for reproducible documents.
+It supersedes R Markdown for new work while remaining compatible with
+the same `knitr`/`pandoc` backend. R Markdown (`.Rmd`) is still
+maintained and supported.
+
+### Document anatomy
+
+```yaml
+---
+title: "Analysis Title"
+author: "Your Name"
+date: today
+format: html          # or pdf, docx, revealjs, dashboard, …
+execute:
+  echo: true
+  warning: false
+---
+```
+
+Code chunks use `#|` YAML-style options:
+
+````r
+```{r}
+#| label: load-data
+#| message: false
+#| echo: false          # hide code, show output
+library(tidyverse)
+df <- read_csv(here::here("data", "raw.csv"))
+```
+````
+
+````r
+```{r}
+#| label: plot-dist
+#| fig-width: 8
+#| fig-height: 4
+#| fig-cap: "Distribution of values"
+ggplot(df, aes(x = value)) + geom_histogram()
+```
+````
+
+### Common chunk options
+
+| Option | Values | Effect |
+|---|---|---|
+| `echo` | `true`/`false`/`fenced` | Show source code |
+| `eval` | `true`/`false` | Run the chunk |
+| `include` | `true`/`false` | Include output (false suppresses everything) |
+| `message` | `true`/`false` | Show package messages |
+| `warning` | `true`/`false` | Show warnings |
+| `cache` | `true`/`false` | Cache results (invalidated on code change) |
+| `fig-width` / `fig-height` | numeric (inches) | Figure dimensions |
+| `label` | string (no spaces) | Chunk identifier (required for cross-refs) |
+
+Set document-wide defaults in the YAML `execute:` block; override
+per-chunk with `#|` options.
+
+### Output formats
+
+```bash
+# CLI render
+quarto render report.qmd
+quarto render report.qmd --to pdf
+quarto render report.qmd --to docx
+
+# From R
+quarto::quarto_render("report.qmd", output_format = "html")
+
+# Preview with live reload
+quarto preview report.qmd
+```
+
+Format | YAML `format:` value | Notes
+---|---|---
+HTML (default) | `html` | Self-contained with `embed-resources: true`
+PDF | `pdf` | Requires LaTeX (`tinytex::install_tinytex()`)
+Word | `docx` | Use reference doc for corporate styles
+Slides | `revealjs` | HTML slideshow
+Dashboard | `dashboard` | `shinylive` or `shiny` for interactivity
+Website | `website` (in `_quarto.yml`) | Multi-page projects
+
+### Project-level `_quarto.yml`
+
+```yaml
+project:
+  type: website
+  output-dir: _site
+
+website:
+  title: "My Analysis"
+  navbar:
+    left:
+      - href: index.qmd
+        text: Home
+      - analysis.qmd
+
+format:
+  html:
+    theme: cosmo
+    toc: true
+```
+
+---
+
+## Code Style
+
+Follow the [tidyverse style guide](https://style.tidyverse.org). Key rules:
+
+### Naming
+
+```r
+# snake_case for variables and functions
+daily_revenue <- df |> group_by(date) |> summarise(rev = sum(amount))
+compute_rate <- function(x, n) x / n
+
+# No camelCase, no dots (dots reserved for S3 methods)
+```
+
+### Assignment
+
+```r
+x <- 10          # use <-  for assignment
+mean(x = 10)     # = is fine for function arguments
+```
+
+### Spacing
+
+```r
+# Spaces around <- and binary operators (except ^ and :)
+z <- (a + b)^2 / d
+
+# Space after comma, not before
+mean(x, na.rm = TRUE)
+
+# No space before parenthesis in function calls
+mean(x)           # not  mean (x)
+```
+
+### Pipes
+
+```r
+# |> (native, R >= 4.1) — prefer over magrittr %>% for new code
+# space before pipe, pipe at end of line
+flights |>
+  filter(!is.na(arr_delay)) |>
+  group_by(carrier) |>
+  summarise(mean_delay = mean(arr_delay))
+
+# Keep pipelines vertical when > 2 steps
+# Break function args onto new lines when > ~80 chars
+flights |>
+  mutate(
+    speed    = distance / air_time * 60,
+    dep_hour = dep_time %/% 100
+  )
+```
+
+### Tooling
+
+```r
+# Auto-format a file or selection
+styler::style_file("analysis.R")
+styler::style_dir("R/")      # whole directory
+
+# Lint for style + common bugs
+lintr::lint("analysis.R")
+lintr::lint_dir("R/")
+
+# RStudio: Cmd/Ctrl+Shift+P → "styler" for palette shortcuts
+```
+
+Both tools are CI-friendly:
+
+```bash
+# In CI (GitHub Actions etc.)
+Rscript -e "lintr::lint_dir('R/', linters = lintr::linters_with_defaults())"
+```
+
+---
+
+## Pipelines at Scale — `targets`
+
+For analyses where intermediate steps are slow, `targets` gives you
+Make-like dependency tracking in R: reruns only what changed.
+
+```r
+# _targets.R (project root)
+library(targets)
+
+tar_option_set(packages = c("tidyverse", "here"))
+
+list(
+  tar_target(raw_data, read_csv(here("data", "raw.csv"))),
+  tar_target(clean_data, clean(raw_data)),
+  tar_target(model,      fit_model(clean_data)),
+  tar_target(report,     render_report(model),
+             format = "file")
+)
+```
+
+```r
+# Run the pipeline
+targets::tar_make()
+
+# Visualise dependency graph
+targets::tar_visnetwork()
+
+# Check what's out of date
+targets::tar_outdated()
+```
+
+`targets` integrates with `renv` and Quarto. Reach for it when
+`source("analysis.R")` takes minutes and reruns waste your time.
+
+---
+
+## Testing
+
+### `testthat` (3rd edition)
+
+```r
+# Scaffold a package or analysis project test suite
+usethis::use_testthat()
+
+# tests/testthat/test-clean.R
+test_that("remove_outliers drops values beyond 3 SD", {
+  x <- c(1, 2, 3, 100)
+  result <- remove_outliers(x, sd_threshold = 3)
+  expect_length(result, 3)
+  expect_false(100 %in% result)
+})
+
+# Run all tests
+devtools::test()   # inside a package
+testthat::test_dir("tests/testthat/")  # standalone
+```
+
+Use `expect_snapshot()` for output that's hard to specify precisely
+(regression tests on printed output, ggplot objects via `vdiffr`).
+
+### `usethis` scaffolding
+
+```r
+usethis::create_project("my-pkg")   # analysis project
+usethis::create_package("mypkg")    # R package
+usethis::use_r("helpers")           # R/helpers.R + tests/testthat/test-helpers.R
+usethis::use_github_actions()       # R-CMD-check / lintr CI
+usethis::use_renv()                 # add renv to existing project
+```
+
+---
+
+## Getting Help
+
+### `reprex` — reproducible examples
+
+Before posting a question, produce a minimal reproducible example:
+
+```r
+# Copy failing code to clipboard, then:
+reprex::reprex()        # formats for GitHub/Stack Overflow
+reprex::reprex(venue = "so")   # Stack Overflow formatting
+reprex::reprex(venue = "slack") # Slack-friendly
+```
+
+`reprex()` runs your code in a clean session, captures output/errors,
+and copies markdown to your clipboard. If it fails inside `reprex`,
+your example is not self-contained — fix that first.
+
+Include minimal data:
+
+```r
+# Inline small data with dput()
+dput(head(my_df, 10))
+# Paste the output into your reprex as  my_df <- <pasted output>
+
+# Or use built-in data
+reprex::reprex({
+  library(dplyr)
+  mtcars |> filter(cyl == 4) |> summarise(mpg = mean(mpg))
+})
+```
+
+### Where to ask
+
+| Channel | Best for |
+|---|---|
+| [Posit Community](https://community.rstudio.com) | Tidyverse, RStudio, Shiny, Quarto |
+| Stack Overflow `[r]` | General R questions with reprex |
+| GitHub Issues (package repo) | Confirmed bugs, feature requests |
+| `#rstats` on Mastodon/Twitter | Community discussion |
+
+### Reading docs efficiently
+
+```r
+?dplyr::mutate              # function docs
+vignette("dplyr")           # package vignettes
+browseVignettes("ggplot2")  # all vignettes in browser
+# pkgdown sites: https://dplyr.tidyverse.org
+```
+
+---
+
+## Base R vs Tidyverse — Decision Table
+
+Both are valid. The native pipe `|>` works in either world with no
+dependencies (R >= 4.1).
+
+| Situation | Reach for | Why |
+|---|---|---|
+| Interactive analysis, EDA | **tidyverse** | Readable pipelines, consistent API across dplyr/tidyr/ggplot2 |
+| Team projects, code review | **tidyverse** | Shared vocabulary lowers onboarding cost |
+| Package development (public) | **base R** or selective imports | Minimise user-facing `Imports`; CRAN policy discourages heavy dep trees |
+| Minimal-dep scripts / system tools | **base R** | No install requirements beyond R itself |
+| Very large data (> memory pressure) | **data.table** | 2–10× faster than dplyr on multi-GB data; lower memory copies |
+| Performance-critical inner loops | **base R** / **data.table** | Avoid tidyverse overhead in tight iteration |
+| Subsetting / indexing gymnastics | **base R** `[` `[[` `$` | More expressive for non-rectangular access patterns |
+| Apply-family parallelism | **base R** `lapply` / `parallel` | No extra dependency; composes with `future` |
+| Everything else | **Your preference** | Mix freely — tidyverse and base R interoperate |
+
+**Native pipe `|>` notes:**
+- No `magrittr` dependency required
+- Placeholder `_` (R >= 4.2): `x |> lm(y ~ ., data = _)`
+- Does not support `.` as implicit first argument (magrittr feature)
+- Slightly faster than `%>%` in microbenchmarks (negligible in practice)
+
+---
+
+## Gotchas
+
+### Absolute paths break portability
+
+```r
+# This crashes on every other machine
+read_csv("/Users/mack/Desktop/data.csv")
+
+# Use here::here() relative to project root
+read_csv(here::here("data", "raw.csv"))
+```
+
+### `.RData` persistence corrupts reproducibility
+
+If `save.image()` or `.RData` auto-restore is on, stale objects
+accumulate. Scripts that "work" in your session may fail for anyone
+else. Disable at project and global level — see "Do not save `.RData`"
+above.
+
+### `library()` calls inside packages (vs scripts)
+
+In scripts / analysis: `library(pkg)` at the top is correct.
+In package code (`R/*.R`): NEVER call `library()` or `require()`.
+Use `pkg::function()` (recommended) or declare in `DESCRIPTION` under
+`Imports:` and call the function unqualified. `library()` in package
+code modifies the user's search path silently.
+
+```r
+# Package code — correct
+clean <- function(df) {
+  df |> dplyr::filter(!is.na(value))
+}
+
+# Package code — wrong (affects the user's session)
+library(dplyr)
+clean <- function(df) df |> filter(!is.na(value))
+```
+
+### `renv` + `pak` interaction
+
+`pak` must be enabled before `renv::init()` if you want it as the
+installer. Set `options(renv.config.pak.enabled = TRUE)` in
+`.Rprofile` (before `renv` sources its activation script) or in
+`renv/settings.json`:
+
+```json
+{ "package.install.backend": "pak" }
+```
+
+### Quarto caching stale results
+
+`#| cache: true` caches on code hash, but not on upstream data changes.
+If your source data changes, manually bust:
+
+```r
+targets::tar_invalidate("affected_target")  # if using targets
+# Or delete the _cache/ directory for the affected chunk
+```
+
+Use `cache: false` (the default) unless render time is genuinely painful.
+
+### `here::here()` root detection order
+
+`here` finds the root via (in priority order): `.here` file,
+`DESCRIPTION`, `.Rproj`, `.git`, `.svn`. If your project has
+nested git repos or unusual layouts, place an explicit `.here` file
+at the true root with `here::set_here()`.
+
+---
+
+## Quick Reference — Key Packages
+
+| Package | Install | Purpose |
+|---|---|---|
+| `here` | CRAN | Portable paths from project root |
+| `renv` | CRAN | Project-local library + lockfile |
+| `pak` | CRAN | Fast, unified package installer |
+| `quarto` | CRAN (R pkg) + [quarto.org](https://quarto.org) CLI | Render `.qmd` from R |
+| `styler` | CRAN | Auto-format R code (tidyverse style) |
+| `lintr` | CRAN | Static analysis / linting |
+| `targets` | CRAN | Make-like reproducible pipelines |
+| `testthat` | CRAN | Unit testing (3rd edition) |
+| `usethis` | CRAN | Project / package scaffolding |
+| `reprex` | CRAN | Minimal reproducible examples |
+| `devtools` | CRAN | Package development workflow |
+
+```r
+# Install the whole workflow toolkit at once
+pak::pak(c(
+  "here", "renv", "pak", "quarto",
+  "styler", "lintr", "targets",
+  "testthat", "usethis", "reprex", "devtools"
+))
+```

+ 226 - 0
skills/r-ops/scripts/check-r-facts.py

@@ -0,0 +1,226 @@
+#!/usr/bin/env python3
+"""Staleness verifier for r-ops: the recommended R stack must stay real and cited.
+
+r-ops recommends ~30 CRAN packages and claims to reflect a current R ecosystem.
+That is exactly the fact that drifts silently (SKILL-RESOURCE-PROTOCOL.md §7): a
+package gets archived/removed upstream, or the prose stops mentioning a package
+the catalog still lists, and nobody notices for months. Two modes guard it:
+
+  --offline (default, safe for PR CI): structural consistency, no network.
+    * assets/r-packages.json parses and every entry has name + role
+    * every catalogued package is still named somewhere in the skill prose
+      (SKILL.md / references/*.md) — the catalog can't drift from the docs
+    * SKILL.md still carries a dated "ecosystem as of <year>" currency note
+  --live (scheduled freshness.yml, never a PR gate): does each package still
+    resolve on CRAN via crandb.r-pkg.org? A 404 = the package is gone (drift).
+
+Usage:   check-r-facts.py [--offline | --live] [--catalog FILE] [--skill DIR] [--json] [--timeout S]
+Input:   argv flags only (no stdin).
+Output:  stdout = findings (plain rows, or a --json envelope). Data only.
+Stderr:  the verdict line, notices, errors.
+Exit:    0 ok, 2 usage, 3 catalog/skill missing, 4 catalog unparseable,
+         7 CRAN unreachable (live, advisory — never a real failure),
+         10 drift found (offline: uncited/undocumented; live: package gone)
+
+Examples:
+  check-r-facts.py --offline                 # PR CI: catalog ⇆ prose consistency
+  check-r-facts.py --live                     # weekly: every package still on CRAN
+  check-r-facts.py --offline --json | jq '.data[]'
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import re
+import sys
+import urllib.error
+import urllib.parse
+import urllib.request
+from pathlib import Path
+
+EX_OK = 0
+EX_USAGE = 2
+EX_NOTFOUND = 3
+EX_UNPARSEABLE = 4
+EX_UNAVAILABLE = 7
+EX_DRIFT = 10
+
+HERE = Path(__file__).resolve().parent
+DEFAULT_CATALOG = HERE.parent / "assets" / "r-packages.json"
+DEFAULT_SKILL = HERE.parent
+FALLBACK_REGISTRY = "https://crandb.r-pkg.org/"
+CURRENCY_RE = re.compile(r"ecosystem as of\s+(\d{4})")
+
+
+class Term:
+    """Minimal ANSI helper (term.sh is bash-only; per TERMINAL-DESIGN.md §9 the
+    Python port is inline). Honors FORCE_COLOR / NO_COLOR / TERM_ASCII and the
+    bound stream's TTY + encoding so piped data stays plain ASCII."""
+
+    _C = {"green": "\033[32m", "red": "\033[31m", "cyan": "\033[36m",
+          "dim": "\033[2m", "off": "\033[0m"}
+
+    def __init__(self, stream=sys.stderr):
+        enc = (getattr(stream, "encoding", "") or "").lower()
+        self.ascii = os.environ.get("TERM_ASCII") == "1" or "utf" not in enc
+        if os.environ.get("FORCE_COLOR"):
+            self.color = True
+        elif (os.environ.get("NO_COLOR") is not None
+              or os.environ.get("TERM") == "dumb"
+              or not getattr(stream, "isatty", lambda: False)()):
+            self.color = False
+        else:
+            self.color = True
+
+    def c(self, name, text):
+        return f"{self._C.get(name,'')}{text}{self._C['off']}" if self.color else text
+
+    def mark(self, ok):
+        g = ("+" if self.ascii else "✓") if ok else ("x" if self.ascii else "✗")
+        return self.c("green" if ok else "red", g)
+
+
+def load_catalog(path: Path) -> tuple[list[dict], str]:
+    """Returns (packages, registry). Each package has name + role."""
+    if not path.is_file():
+        print(f"error: package catalog not found: {path}", file=sys.stderr)
+        raise SystemExit(EX_NOTFOUND)
+    try:
+        data = json.loads(path.read_text(encoding="utf-8"))
+        pkgs = data["packages"]
+        if not isinstance(pkgs, list) or not pkgs:
+            raise ValueError("'packages' must be a non-empty array")
+        for p in pkgs:
+            if not isinstance(p, dict) or "name" not in p or "role" not in p:
+                raise ValueError(f"package entry missing name/role: {p!r}")
+        registry = data.get("registry") or FALLBACK_REGISTRY
+        return pkgs, registry
+    except (json.JSONDecodeError, KeyError, TypeError, ValueError) as exc:
+        print(f"error: could not parse catalog {path}: {exc}", file=sys.stderr)
+        raise SystemExit(EX_UNPARSEABLE)
+
+
+def read_corpus(skill_dir: Path) -> tuple[str, str]:
+    """Returns (skill_md_text, all_prose_text) across SKILL.md + references/*.md."""
+    doc = skill_dir / "SKILL.md"
+    if not doc.is_file():
+        print(f"error: SKILL.md not found under {skill_dir}", file=sys.stderr)
+        raise SystemExit(EX_NOTFOUND)
+    skill_md = doc.read_text(encoding="utf-8")
+    parts = [skill_md]
+    for ref in sorted((skill_dir / "references").glob("*.md")):
+        parts.append(ref.read_text(encoding="utf-8"))
+    return skill_md, "\n".join(parts)
+
+
+def check_offline(pkgs: list[dict], skill_dir: Path) -> list[dict]:
+    skill_md, corpus = read_corpus(skill_dir)
+    findings: list[dict] = []
+    for p in pkgs:
+        name = p["name"]
+        # case-sensitive exact substring: CRAN names are case-sensitive (DBI != dbi)
+        if name not in corpus:
+            findings.append({"package": name, "issue": "catalogued but not named in skill prose"})
+    if not CURRENCY_RE.search(skill_md):
+        findings.append({"package": "(SKILL.md)", "issue": "no dated 'ecosystem as of <year>' currency note"})
+    return findings
+
+
+def cran_status(registry: str, name: str, timeout: float) -> tuple[str, object]:
+    url = registry.rstrip("/") + "/" + urllib.parse.quote(name)
+    req = urllib.request.Request(url, method="GET",
+                                 headers={"User-Agent": "claude-mods-r-ops-check/1"})
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            return ("ok", resp.status)
+    except urllib.error.HTTPError as exc:
+        if exc.code == 404:
+            return ("gone", 404)
+        return ("unreachable", exc.code)  # 5xx etc: transient, not a content finding
+    except (urllib.error.URLError, TimeoutError, OSError) as exc:
+        return ("unreachable", str(getattr(exc, "reason", exc)))
+
+
+def check_live(pkgs: list[dict], registry: str, timeout: float) -> tuple[list[dict], list[dict]]:
+    drift: list[dict] = []
+    unreachable: list[dict] = []
+    for p in pkgs:
+        name = p["name"]
+        status, info = cran_status(registry, name, timeout)
+        if status == "gone":
+            drift.append({"package": name, "issue": "no longer resolves on CRAN (404)"})
+        elif status != "ok":
+            unreachable.append({"package": name, "issue": f"unreachable: {info}"})
+    return drift, unreachable
+
+
+def main(argv: list[str]) -> int:
+    p = argparse.ArgumentParser(
+        prog="check-r-facts.py",
+        description="Verify r-ops' recommended R stack stays cited (offline) and live on CRAN (live).",
+    )
+    mode = p.add_mutually_exclusive_group()
+    mode.add_argument("--offline", action="store_true", help="structural consistency, no network (default)")
+    mode.add_argument("--live", action="store_true", help="check every package still resolves on CRAN")
+    p.add_argument("--catalog", default=str(DEFAULT_CATALOG), help="package catalog JSON")
+    p.add_argument("--skill", default=str(DEFAULT_SKILL), help="skill directory (SKILL.md + references/)")
+    p.add_argument("--timeout", type=float, default=10.0, help="per-request timeout seconds (live)")
+    p.add_argument("--json", action="store_true", help="emit a JSON envelope")
+    try:
+        args = p.parse_args(argv)
+    except SystemExit as exc:
+        return EX_USAGE if exc.code not in (0, None) else (exc.code or EX_OK)
+
+    pkgs, registry = load_catalog(Path(args.catalog))
+    live = args.live and not args.offline
+    t = Term(sys.stderr)
+
+    if live:
+        drift, unreachable = check_live(pkgs, registry, args.timeout)
+        findings = drift + unreachable
+        if args.json:
+            print(json.dumps({
+                "data": findings,
+                "meta": {"mode": "live", "packages_checked": len(pkgs),
+                         "drift": len(drift), "unreachable": len(unreachable),
+                         "registry": registry, "schema": "claude-mods.r-ops.r-facts/v1"},
+            }, indent=2))
+        else:
+            for f in findings:
+                print(f"{'DRIFT' if f in drift else 'UNREACH'}  {f['package']}: {f['issue']}")
+        # §7: confirmed drift -> 10; else a transient/unreachable -> 7 (advisory); else 0.
+        if drift:
+            print(f"{t.mark(False)} r-facts/live: {len(drift)} package(s) gone from CRAN "
+                  f"{t.c('dim', '(' + registry + ')')}", file=sys.stderr)
+            return EX_DRIFT
+        if unreachable:
+            print(f"{t.mark(False)} r-facts/live: CRAN unreachable for "
+                  f"{len(unreachable)}/{len(pkgs)} {t.c('dim', '(advisory - retry next run)')}",
+                  file=sys.stderr)
+            return EX_UNAVAILABLE
+        print(f"{t.mark(True)} r-facts/live: all {len(pkgs)} package(s) resolve on CRAN",
+              file=sys.stderr)
+        return EX_OK
+
+    # offline (default)
+    findings = check_offline(pkgs, Path(args.skill))
+    if args.json:
+        print(json.dumps({
+            "data": findings,
+            "meta": {"mode": "offline", "packages_checked": len(pkgs),
+                     "drift": len(findings), "consistent": not findings,
+                     "schema": "claude-mods.r-ops.r-facts/v1"},
+        }, indent=2))
+    else:
+        for f in findings:
+            print(f"DRIFT  {f['package']}: {f['issue']}")
+    ok = not findings
+    print(f"{t.mark(ok)} r-facts/offline: {len(pkgs)} package(s) checked, "
+          f"{len(findings)} inconsistency "
+          f"{t.c('dim', '(catalog vs skill prose)')}", file=sys.stderr)
+    return EX_DRIFT if findings else EX_OK
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))

+ 129 - 0
skills/r-ops/tests/run.sh

@@ -0,0 +1,129 @@
+#!/usr/bin/env bash
+# Self-test for the r-ops skill (knowledge skill — no scripts to exercise).
+#
+# Offline-deterministic (no network, no R install required). Asserts structural
+# integrity (frontmatter, references present + linked) and — the load-bearing
+# check — a CONTENT-CURRENCY GUARD: the skill claims to lead with modern R
+# idioms and to have shed the superseded ones, so this suite fails if a future
+# edit reintroduces a deprecated tidyverse idiom as a recommendation or strips
+# the modern stance. That turns the "reflects current R" promise into something
+# CI enforces. Resolves paths relative to itself so it works in the repo and
+# once installed to ~/.claude/skills/r-ops/.
+#
+# Usage:   bash tests/run.sh
+# Exit:    0 all pass, 1 one or more failures
+
+set -uo pipefail
+
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+SKILL="$(dirname "$HERE")"
+DOC="$SKILL/SKILL.md"
+REF="$SKILL/references"
+
+PASS=0; FAIL=0
+ok() { PASS=$((PASS+1)); printf '  PASS  %s\n' "$1"; }
+no() { FAIL=$((FAIL+1)); printf '  FAIL  %s\n' "$1"; }
+has() { case "$2" in *"$1"*) ok "$3";; *) no "$3 (missing '$1')";; esac; }
+
+echo "=== r-ops self-test ==="
+
+# ── SKILL.md frontmatter ───────────────────────────────────────────────────
+echo "-- frontmatter --"
+[[ -f "$DOC" ]] && ok "SKILL.md present" || { no "SKILL.md missing"; echo "=== $PASS passed, $FAIL failed ==="; exit 1; }
+fm="$(sed -n '1,/^---$/{/^---$/!p}' "$DOC" 2>/dev/null)"
+# first line must be the opening fence
+[[ "$(sed -n '1p' "$DOC")" == "---" ]] && ok "frontmatter fence opens at line 1" || no "no opening frontmatter fence"
+doc="$(cat "$DOC")"
+has 'name: r-ops'   "$doc" "frontmatter declares name: r-ops"
+has 'description:'  "$doc" "frontmatter has description"
+has 'when_to_use:'  "$doc" "frontmatter has when_to_use (current bar)"
+has 'license: MIT'  "$doc" "frontmatter declares license"
+
+# ── references: the 9 documented files exist and are substantial ───────────
+echo "-- references present --"
+EXPECT=(tidyverse-core import-io strings-dates-factors visualization \
+        iteration-functional modeling-stats data-table time-series workflow-tooling)
+for r in "${EXPECT[@]}"; do
+  f="$REF/$r.md"
+  if [[ -f "$f" ]]; then
+    bytes=$(wc -c < "$f")
+    if [[ "$bytes" -ge 4000 ]]; then ok "$r.md present and substantial (${bytes}b)"
+    else no "$r.md too small (${bytes}b < 4000)"; fi
+  else
+    no "$r.md missing"
+  fi
+done
+
+# ── every references/ link in SKILL.md resolves (no ghost links) ───────────
+echo "-- internal links resolve --"
+linked=0; broken=0
+while IFS= read -r rel; do
+  [[ -z "$rel" ]] && continue
+  linked=$((linked+1))
+  [[ -f "$SKILL/$rel" ]] || { no "SKILL.md links missing file: $rel"; broken=$((broken+1)); }
+done < <(grep -oE '\]\(references/[^)#]+\)' "$DOC" | sed -E 's/^\]\(//; s/\)$//' | sort -u)
+[[ "$linked" -gt 0 ]] && ok "SKILL.md links its references ($linked unique)" || no "SKILL.md links no references"
+[[ "$broken" -eq 0 ]] && ok "all referenced files resolve" || no "$broken reference link(s) broken"
+
+# ── content-currency guard: modern idioms present ─────────────────────────
+echo "-- modern idioms present --"
+# Tokens the skill must keep recommending. Grepping the whole tree means a
+# reference rewrite that drops the modern stance is caught too.
+present_idiom() { # token label
+  if grep -rqF "$1" "$SKILL" --include='*.md'; then ok "$2"; else no "$2 (idiom '$1' vanished)"; fi
+}
+present_idiom '|>'          "native pipe |> taught"
+present_idiom '.by'         "per-op grouping .by= taught"
+present_idiom 'across('     "across() taught"
+present_idiom 'pivot_longer' "pivot_longer/pivot_wider taught"
+present_idiom 'list_rbind'  "list_rbind (map_dfr replacement) taught"
+present_idiom 'tidymodels'  "tidymodels covered"
+
+# ── content-currency guard: deprecated idioms not recommended ──────────────
+echo "-- deprecated idioms absent --"
+# These superseded calls are unambiguous and currently absent everywhere
+# (verified at land time). If one reappears as a code recommendation, fail —
+# the skill would no longer reflect modern R. (map_dfr is intentionally NOT in
+# this set: the skill discusses it by name to mark it superseded.)
+DEPRECATED=('gather(' 'spread(' 'funs(' 'aes_string(' 'mutate_at(' 'mutate_if(' \
+            'summarise_at(' 'summarize_at(' 'sample_n(' 'top_n(' 'data_frame(')
+for d in "${DEPRECATED[@]}"; do
+  if grep -rqF "$d" "$SKILL" --include='*.md'; then
+    no "deprecated idiom present: $d"
+  else
+    ok "no '$d'"
+  fi
+done
+
+# ── staleness verifier: offline contract (SKILL-RESOURCE-PROTOCOL §7) ───────
+echo "-- check-r-facts.py (offline) --"
+VERIFIER="$SKILL/scripts/check-r-facts.py"
+CATALOG="$SKILL/assets/r-packages.json"
+ec() { local want="$1" lbl="$2"; shift 2; "$@" >/dev/null 2>&1; local got=$?
+       [[ "$got" == "$want" ]] && ok "$lbl (exit $got)" || no "$lbl (want $want got $got)"; }
+# Pick a python that actually executes — skips the Windows Store python3 stub.
+PY=""
+for c in python python3 py; do
+  if command -v "$c" >/dev/null 2>&1 && "$c" -c "" >/dev/null 2>&1; then PY="$c"; break; fi
+done
+[[ -f "$VERIFIER" ]] && ok "verifier present" || no "verifier missing"
+[[ -f "$CATALOG"  ]] && ok "package catalog present" || no "catalog missing"
+if [[ -n "$PY" ]]; then
+  TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
+  ec 0 "py_compile"            "$PY" -m py_compile "$VERIFIER"
+  ec 0 "--help"                "$PY" "$VERIFIER" --help
+  ec 0 "--offline consistent"  "$PY" "$VERIFIER" --offline
+  jout="$("$PY" "$VERIFIER" --offline --json 2>/dev/null)"
+  has 'claude-mods.r-ops.r-facts/v1' "$jout" "--json envelope schema"
+  ec 3 "missing catalog -> 3"  "$PY" "$VERIFIER" --offline --catalog "$TMP/nope.json"
+  printf '{"packages":"x"}' > "$TMP/bad.json"
+  ec 4 "malformed catalog -> 4" "$PY" "$VERIFIER" --offline --catalog "$TMP/bad.json"
+  printf '{"packages":[{"name":"zzznotreal","role":"x"}]}' > "$TMP/drift.json"
+  ec 10 "uncited package -> 10" "$PY" "$VERIFIER" --offline --catalog "$TMP/drift.json"
+else
+  no "no working python to exercise the verifier"
+fi
+
+# ── summary ────────────────────────────────────────────────────────────────
+echo "=== $PASS passed, $FAIL failed ==="
+[[ "$FAIL" -eq 0 ]] || exit 1

+ 9 - 2
tests/check-resources.sh

@@ -70,12 +70,17 @@ LOOP_EX="skills/loop-ops/assets/examples/pr-watch/loop.config.yaml"
 run "example audits clean"          0 bash skills/loop-ops/scripts/loop-check.sh "$LOOP_EX"
 run "example doctors clean (offline)" 0 bash skills/loop-ops/scripts/loop-doctor.sh --offline "$LOOP_EX"
 
+echo "== r-ops: R-stack staleness verifier"
+run "r-facts --offline consistent" 0 "$PY" skills/r-ops/scripts/check-r-facts.py --offline
+run "r-facts --help"               0 "$PY" skills/r-ops/scripts/check-r-facts.py --help
+
 echo "== protocol: every new verifier is executable + compiles"
 for s in skills/claude-api-ops/scripts/check-model-table.py \
          skills/claude-code-ops/scripts/validate-hooks-json.py \
          skills/playwright-ops/scripts/triage-flakes.py \
          skills/mapbox-ops/scripts/check-mapbox-facts.py \
-         skills/loop-ops/scripts/check-pricing-sync.py; do
+         skills/loop-ops/scripts/check-pricing-sync.py \
+         skills/r-ops/scripts/check-r-facts.py; do
     "$PY" -m py_compile "$s" 2>/dev/null && pass "py_compile $(basename "$s")" || bad "py_compile $(basename "$s")"
 done
 bash -n skills/terraform-ops/scripts/check-action-refs.sh 2>/dev/null \
@@ -106,6 +111,7 @@ purity "flake-triage" "$PY" skills/playwright-ops/scripts/triage-flakes.py "$__t
 rm -f "$__tf"
 purity "fleet-doctor"  bash skills/fleet-worker/scripts/fleet-doctor.sh --offline
 purity "pricing-sync"  "$PY" skills/loop-ops/scripts/check-pricing-sync.py --offline
+purity "r-facts"       "$PY" skills/r-ops/scripts/check-r-facts.py --offline
 grep -q '_lib/term.sh' skills/terraform-ops/scripts/check-action-refs.sh \
     && pass "check-action-refs sources term.sh" || bad "check-action-refs missing term.sh"
 grep -q '_lib/term.sh' skills/fleet-worker/scripts/fleet-doctor.sh \
@@ -113,7 +119,8 @@ grep -q '_lib/term.sh' skills/fleet-worker/scripts/fleet-doctor.sh \
 for s in skills/claude-api-ops/scripts/check-model-table.py \
          skills/claude-code-ops/scripts/validate-hooks-json.py \
          skills/playwright-ops/scripts/triage-flakes.py \
-         skills/loop-ops/scripts/check-pricing-sync.py; do
+         skills/loop-ops/scripts/check-pricing-sync.py \
+         skills/r-ops/scripts/check-r-facts.py; do
     grep -q 'class Term' "$s" && pass "$(basename "$s") carries inline Term" \
         || bad "$(basename "$s") missing inline Term"
 done