JSONL Patterns Reference

Comprehensive patterns for working with JSONL (JSON Lines) files -- one JSON object per line, the dominant format for structured logs, agent conversation records, and streaming data.

JSONL Basics

Format Rules

One valid JSON object per line
No trailing commas between lines
No wrapping array or outer object
Each line is independently parseable
Newlines within string values must be escaped as \n

Streaming vs Slurp

# STREAMING (default): processes one line at a time, constant memory
jq -c 'select(.level == "error")' app.jsonl

# SLURP (-s): loads ALL lines into a single array, requires memory for entire file
jq -sc 'group_by(.level)' app.jsonl

# Rule of thumb:
#   File < 100MB  --> slurp is fine
#   File 100MB-1GB --> slurp with caution, prefer streaming + sort/uniq
#   File > 1GB    --> never slurp, use streaming or split+parallel

Key jq Flags for JSONL

Flag	Purpose	Example
`-c`	Compact output (one line per object)	`jq -c '.' file.jsonl`
`-r`	Raw string output (no quotes)	`jq -r '.message' file.jsonl`
`-s`	Slurp all lines into array	`jq -s 'length' file.jsonl`
`-e`	Exit with error if output is false/null	`jq -e '.status == 200' line.json`
`-R`	Read each line as raw string	`jq -R 'fromjson? // empty' messy.jsonl`
`--stream`	SAX-style path/value pairs	`jq --stream '.' huge.json`
`--slurpfile`	Load a file as variable	`jq --slurpfile ids ids.json 'select(.id
`--arg`	Pass string variable	`jq --arg name "foo" 'select(.name == $name)' data.jsonl`
`--argjson`	Pass JSON variable	`jq --argjson min 100 'select(.count > $min)' data.jsonl`
`--unbuffered`	Flush output after each line	`tail -f app.jsonl \\| jq --unbuffered -r '.message'`

Extraction Patterns

Select by Field Value

# Exact match
jq -c 'select(.level == "error")' app.jsonl

# Numeric comparison
jq -c 'select(.status >= 400)' app.jsonl

# String contains
jq -c 'select(.message | test("timeout"))' app.jsonl

# Regex match
jq -c 'select(.path | test("^/api/v[0-9]+/users"))' app.jsonl

# Case-insensitive match
jq -c 'select(.message | test("error"; "i"))' app.jsonl

# Null check
jq -c 'select(.error != null)' app.jsonl

# Boolean field
jq -c 'select(.retry == true)' app.jsonl

Select by Nested Field

# Dot notation for nesting
jq -c 'select(.request.method == "POST")' app.jsonl

# Deep nesting
jq -c 'select(.context.user.role == "admin")' app.jsonl

# Safe navigation (no error if path missing)
jq -c 'select(.request?.headers?["authorization"] != null)' app.jsonl

Select by Array Contains

# Array contains value
jq -c 'select(.tags | index("critical"))' app.jsonl

# Any element matches condition
jq -c 'select(.events | any(.type == "error"))' app.jsonl

# All elements match condition
jq -c 'select(.checks | all(.passed == true))' app.jsonl

# Array length
jq -c 'select((.retries | length) > 3)' app.jsonl

Extract and Flatten Nested Structures

# Flatten one level of nesting
jq -c '{timestamp, level, msg: .message, user: .context.user.id}' app.jsonl

# Explode array into separate lines
jq -c '.events[]' app.jsonl

# Flatten array with parent context
jq -c '. as $parent | .events[] | {request_id: $parent.request_id, event: .type, ts: .timestamp}' app.jsonl

# Extract from array of objects
jq -c '.results[] | select(.score < 0.5) | {name, score}' results.jsonl

# Recursive descent (find all values for a key at any depth)
jq -c '.. | .error_message? // empty' app.jsonl

Handle Optional and Nullable Fields

# Default value for missing field
jq -r '.region // "unknown"' app.jsonl

# Default for nested missing field
jq -r '.response.body.error // .response.status_text // "no error info"' app.jsonl

# Skip lines where field is missing (instead of outputting null)
jq -r '.optional_field // empty' app.jsonl

# Coalesce multiple possible fields
jq -r '(.error_message // .err_msg // .error // "none")' app.jsonl

# Type check before access
jq -c 'if .data | type == "array" then .data | length else 0 end' app.jsonl

Multi-Level Nesting (Agent Conversation Logs)

# Claude Code conversation logs have deeply nested tool calls
# Structure: {role, content: [{type: "tool_use", name, input}, ...]}

# Extract all tool call names
jq -c '.content[]? | select(.type == "tool_use") | .name' conversation.jsonl

# Extract tool inputs
jq -c '.content[]? | select(.type == "tool_use") | {tool: .name, input: .input}' conversation.jsonl

# Extract text content blocks
jq -r '.content[]? | select(.type == "text") | .text' conversation.jsonl

# Extract tool results
jq -c '.content[]? | select(.type == "tool_result") | {tool_use_id, content}' conversation.jsonl

# Find tool calls that contain specific patterns in their input
jq -c '.content[]? | select(.type == "tool_use" and (.input | tostring | test("SELECT")))' conversation.jsonl

De-Escape Nested JSON Strings

# When a field contains a JSON string that needs parsing
jq -c '.payload | fromjson' app.jsonl

# Safe de-escape (skip if not valid JSON)
jq -c '.payload | fromjson? // {raw: .}' app.jsonl

# Double-escaped JSON (escaped twice)
jq -c '.data | fromjson | fromjson' app.jsonl

# Extract field from de-escaped nested JSON
jq -r '.payload | fromjson | .result.status' app.jsonl

# Handle mixed escaped/unescaped
jq -c 'if (.payload | type) == "string" then .payload | fromjson else .payload end' app.jsonl

Aggregation Patterns

All aggregation patterns use -s (slurp) which loads the entire file into memory. For large files, prefilter with rg first.

Count by Field Value

# Count per level
jq -sc 'group_by(.level) | map({level: .[0].level, count: length})' app.jsonl

# Count per status code
jq -sc 'group_by(.status) | map({status: .[0].status, count: length}) | sort_by(-.count)' app.jsonl

# Count unique values
jq -sc '[.[].user_id] | unique | length' app.jsonl

# Frequency distribution
jq -rsc 'group_by(.level) | map("\(.[0].level)\t\(length)") | .[]' app.jsonl

Sum, Average, Min, Max

# Sum
jq -sc 'map(.bytes) | add' app.jsonl

# Average
jq -sc 'map(.duration_ms) | add / length' app.jsonl

# Min and max
jq -sc 'map(.duration_ms) | {min: min, max: max, avg: (add / length)}' app.jsonl

# Percentile approximation (p50, p95, p99)
jq -sc '
  map(.duration_ms) | sort |
  length as $n |
  {
    p50: .[($n * 0.50 | floor)],
    p95: .[($n * 0.95 | floor)],
    p99: .[($n * 0.99 | floor)],
    max: .[-1]
  }
' app.jsonl

# Sum grouped by category
jq -sc '
  group_by(.service) |
  map({service: .[0].service, total_bytes: (map(.bytes) | add)})
' app.jsonl

Group By with Aggregation

# Group by service, show count and error rate
jq -sc '
  group_by(.service) |
  map({
    service: .[0].service,
    total: length,
    errors: (map(select(.level == "error")) | length),
    error_rate: ((map(select(.level == "error")) | length) / length * 100 | round)
  })
' app.jsonl

# Group by hour
jq -sc '
  group_by(.timestamp | split("T")[1] | split(":")[0]) |
  map({
    hour: .[0].timestamp | split("T")[1] | split(":")[0],
    count: length
  })
' app.jsonl

# Nested group by (service then level)
jq -sc '
  group_by(.service) |
  map({
    service: .[0].service,
    by_level: (group_by(.level) | map({level: .[0].level, n: length}))
  })
' app.jsonl

Top-N Queries

# Top 10 slowest requests
jq -sc 'sort_by(-.duration_ms) | .[:10] | .[] | {path: .path, ms: .duration_ms}' app.jsonl

# Top 5 most frequent errors
jq -sc '
  map(select(.level == "error")) |
  group_by(.message) |
  map({message: .[0].message, count: length}) |
  sort_by(-.count) | .[:5]
' app.jsonl

# Top users by request count
jq -sc '
  group_by(.user_id) |
  map({user: .[0].user_id, requests: length}) |
  sort_by(-.requests) | .[:10]
' app.jsonl

Histogram and Distribution Analysis

# Response time histogram (buckets: 0-100, 100-500, 500-1000, 1000+)
jq -sc '
  map(.duration_ms) |
  {
    "0-100ms": (map(select(. < 100)) | length),
    "100-500ms": (map(select(. >= 100 and . < 500)) | length),
    "500-1000ms": (map(select(. >= 500 and . < 1000)) | length),
    "1000ms+": (map(select(. >= 1000)) | length)
  }
' app.jsonl

# Status code distribution
jq -rsc '
  group_by(.status) |
  map("\(.[0].status)\t\(length)") |
  sort | .[]
' app.jsonl

# Log level distribution over time (by hour)
jq -rsc '
  group_by(.timestamp | split("T")[1] | split(":")[0]) |
  map(
    (.[0].timestamp | split("T")[1] | split(":")[0]) as $hour |
    {
      hour: $hour,
      info: (map(select(.level == "info")) | length),
      warn: (map(select(.level == "warn")) | length),
      error: (map(select(.level == "error")) | length)
    }
  ) | .[] | [.hour, .info, .warn, .error] | @tsv
' app.jsonl | column -t -N HOUR,INFO,WARN,ERROR

Running Totals and Cumulative Sums

# Cumulative error count over time
jq -sc '
  sort_by(.timestamp) |
  reduce .[] as $item (
    {total: 0, rows: []};
    .total += 1 |
    .rows += [{ts: $item.timestamp, cumulative: .total}]
  ) | .rows[] | [.ts, .cumulative] | @tsv
' <(jq -c 'select(.level == "error")' app.jsonl)

# Running average of response times
jq -sc '
  sort_by(.timestamp) |
  foreach .[] as $item (
    {n: 0, sum: 0};
    .n += 1 | .sum += $item.duration_ms;
    {ts: $item.timestamp, running_avg: (.sum / .n | round)}
  )
' app.jsonl

Transformation Patterns

Reshape Objects

# Flatten nested to flat
jq -c '{
  ts: .timestamp,
  level: .level,
  msg: .message,
  user: .context.user.id,
  method: .request.method,
  path: .request.path
}' app.jsonl

# Add computed fields
jq -c '. + {
  date: (.timestamp | split("T")[0]),
  hour: (.timestamp | split("T")[1] | split(":")[0] | tonumber),
  is_error: (.level == "error")
}' app.jsonl

# Rename fields
jq -c '{timestamp: .ts, message: .msg, severity: .lvl}' app.jsonl

# Remove fields
jq -c 'del(.stack_trace, .internal_debug_info)' app.jsonl

Merge Fields from Multiple Lines

# Combine start and end events by request_id
jq -sc '
  group_by(.request_id) |
  map(
    (map(select(.event == "start")) | .[0]) as $start |
    (map(select(.event == "end")) | .[0]) as $end |
    {
      request_id: .[0].request_id,
      start: $start.timestamp,
      end: $end.timestamp,
      status: $end.status,
      path: $start.path
    }
  )[]
' events.jsonl

# Merge consecutive lines (e.g., multiline log entries)
jq -sc '
  reduce .[] as $item (
    [];
    if (. | length) == 0 then [$item]
    elif $item.continuation == true then
      (.[-1].message += "\n" + $item.message) | .
    else . + [$item]
    end
  )[]
' app.jsonl

Convert Between Formats

# JSONL to CSV
jq -r '[.timestamp, .level, .message] | @csv' app.jsonl > app.csv

# JSONL to TSV
jq -r '[.timestamp, .level, .message] | @tsv' app.jsonl > app.tsv

# JSONL to CSV with header
echo "timestamp,level,message" > app.csv
jq -r '[.timestamp, .level, .message] | @csv' app.jsonl >> app.csv

# CSV to JSONL (using mlr)
mlr --c2j cat app.csv > app.jsonl

# JSONL to formatted table
jq -r '[.timestamp, .level, .message] | @tsv' app.jsonl | column -t -s$'\t'

# JSONL to markdown table
echo "| Timestamp | Level | Message |"
echo "|-----------|-------|---------|"
jq -r '"| \(.timestamp) | \(.level) | \(.message) |"' app.jsonl

Annotate Lines with Computed Fields

# Add line number
jq -c --argjson n 0 '. + {line_num: (input_line_number)}' app.jsonl

# Add duration since previous event (requires slurp)
jq -sc '
  sort_by(.timestamp) |
  . as $all |
  [range(length)] |
  map(
    $all[.] + (
      if . > 0 then {gap_from_prev: "computed"}
      else {gap_from_prev: null}
      end
    )
  )[]
' app.jsonl

# Tag lines matching criteria
jq -c '. + {
  severity_class: (
    if .level == "error" or .level == "fatal" then "critical"
    elif .level == "warn" then "warning"
    else "normal"
    end
  )
}' app.jsonl

# Enrich with filename when processing multiple files
fd -e jsonl . logs/ -x bash -c 'jq -c --arg src "$1" ". + {source: \$src}" "$1"' _ {}

Comparison Patterns

Diff Two JSONL Files by Matching Key

# Find entries in A but not in B (by id)
jq -r '.id' b.jsonl | sort > /tmp/b_ids.txt
jq -c --slurpfile bids <(jq -Rs 'split("\n") | map(select(. != ""))' /tmp/b_ids.txt) '
  select(.id | IN($bids[0][]))  | not
' a.jsonl

# Simpler approach using comm
jq -r '.id' a.jsonl | sort > /tmp/a_ids.txt
jq -r '.id' b.jsonl | sort > /tmp/b_ids.txt
comm -23 /tmp/a_ids.txt /tmp/b_ids.txt  # IDs in A but not B
comm -13 /tmp/a_ids.txt /tmp/b_ids.txt  # IDs in B but not A
comm -12 /tmp/a_ids.txt /tmp/b_ids.txt  # IDs in both

# Find records that exist in both but have different values
jq -sc '
  [., input] |
  (.[0] | map({(.id): .}) | add) as $a |
  (.[1] | map({(.id): .}) | add) as $b |
  ($a | keys) as $keys |
  [$keys[] | select($a[.] != $b[.])] |
  map({id: ., a: $a[.], b: $b[.]})
' <(jq -sc '.' a.jsonl) <(jq -sc '.' b.jsonl)

Side-by-Side Field Comparison

# Compare a specific field between two runs
paste <(jq -r '[.id, .score] | @tsv' run1.jsonl | sort) \
      <(jq -r '[.id, .score] | @tsv' run2.jsonl | sort) |
  awk -F'\t' '$2 != $4 {print $1, "run1=" $2, "run2=" $4}'

# Summary comparison of two log files
echo "=== File A ===" && jq -sc '{
  lines: length,
  errors: (map(select(.level == "error")) | length),
  unique_users: ([.[].user_id] | unique | length)
}' a.jsonl
echo "=== File B ===" && jq -sc '{
  lines: length,
  errors: (map(select(.level == "error")) | length),
  unique_users: ([.[].user_id] | unique | length)
}' b.jsonl

Find New, Missing, and Changed Records

# Comprehensive diff report
jq -r '.id' a.jsonl | sort > /tmp/a.ids
jq -r '.id' b.jsonl | sort > /tmp/b.ids

echo "--- New in B (not in A) ---"
comm -13 /tmp/a.ids /tmp/b.ids

echo "--- Removed from A (not in B) ---"
comm -23 /tmp/a.ids /tmp/b.ids

echo "--- Changed (in both, different values) ---"
comm -12 /tmp/a.ids /tmp/b.ids | while read id; do
  a_hash=$(rg "\"id\":\"$id\"" a.jsonl | md5sum | cut -d' ' -f1)
  b_hash=$(rg "\"id\":\"$id\"" b.jsonl | md5sum | cut -d' ' -f1)
  [ "$a_hash" != "$b_hash" ] && echo "$id"
done

Performance Patterns

Two-Stage rg + jq Pipeline

The single most important performance pattern. ripgrep is 10-100x faster than jq at scanning text.

# BAD: jq scans every line (slow on large files)
jq -c 'select(.level == "error" and .service == "auth")' huge.jsonl

# GOOD: rg filters text first, jq only parses matching lines
rg '"error"' huge.jsonl | rg '"auth"' | jq -c '.'

# GOOD: for precise matching after rg prefilter
rg '"error"' huge.jsonl | jq -c 'select(.level == "error" and .service == "auth")'

# Benchmarks (typical 1GB JSONL file):
#   jq alone:     45 seconds
#   rg + jq:      3 seconds
#   rg alone:     0.8 seconds

GNU parallel for Splitting Large Files

# Split a 10GB file and process in parallel
split -l 500000 huge.jsonl /tmp/chunk_

# Count errors across all chunks
ls /tmp/chunk_* | parallel "rg -c '\"error\"' {}" | awk -F: '{sum+=$2} END {print sum}'

# Extract and merge results
ls /tmp/chunk_* | parallel "jq -c 'select(.level == \"error\")' {}" > all_errors.jsonl

# Cleanup
rm /tmp/chunk_*

# One-liner with process substitution
parallel --pipe -L 100000 'jq -c "select(.level == \"error\")"' < huge.jsonl > errors.jsonl

jq --stream for SAX-Style Processing

For files too large to fit in memory, even line-by-line (e.g., a single 5GB JSON array).

# Count items in a huge JSON array without loading it
jq --stream 'select(.[0] | length == 1) | .[0][0]' huge-array.json | tail -1

# Extract specific field from each item in huge array
jq -cn --stream 'fromstream(1 | truncate_stream(inputs)) | .name' huge-array.json

# Filter items from huge array
jq -cn --stream '
  fromstream(1 | truncate_stream(inputs)) |
  select(.status == "failed")
' huge-array.json

Indexing Frequently-Queried Files

# Build an index of line offsets by key value
awk '{
  match($0, /"request_id":"([^"]+)"/, m)
  if (m[1]) print m[1], NR
}' app.jsonl | sort > app.idx

# Look up specific request by index
LINE=$(grep "req-abc-123" app.idx | awk '{print $2}')
sed -n "${LINE}p" app.jsonl | jq .

# Build a SQLite index for repeated queries
sqlite3 log_index.db "CREATE TABLE idx (request_id TEXT, line INTEGER)"
awk '{
  match($0, /"request_id":"([^"]+)"/, m)
  if (m[1]) print "INSERT INTO idx VALUES (\047" m[1] "\047, " NR ");"
}' app.jsonl | sqlite3 log_index.db

# Query by index
LINE=$(sqlite3 log_index.db "SELECT line FROM idx WHERE request_id = 'req-abc-123'")
sed -n "${LINE}p" app.jsonl | jq .

Memory-Efficient Aggregation Without Slurp

# Count by level without loading entire file
jq -r '.level' app.jsonl | sort | uniq -c | sort -rn

# Top error messages without slurp
jq -r 'select(.level == "error") | .message' app.jsonl | sort | uniq -c | sort -rn | head -20

# Unique users without slurp
jq -r '.user_id' app.jsonl | sort -u | wc -l

# Sum without slurp
jq -r '.bytes' app.jsonl | awk '{sum+=$1} END {print sum}'

# These are all O(1) memory (streaming) vs O(n) memory (slurp)

jsonl-patterns.md 17 KB History Raw