Turning sound into pixels: podcast clips for social, waveform "audiograms", debugging audio by looking at it.
# scrolling waveform over a brand background + episode title:
ffmpeg -i episode.mp3 -loop 1 -i bg_1080x1920.png -filter_complex \
"[0:a]showwaves=s=1080x300:mode=cline:colors=white:rate=30[w];
[1:v][w]overlay=0:1200:shortest=1,drawtext=text='EP 42 — Title':fontsize=56:fontcolor=white:x=(w-text_w)/2:y=320" \
-c:v libx264 -crf 21 -preset fast -pix_fmt yuv420p -c:a aac -b:a 128k -shortest audiogram.mp4
shortest=1 on the overlay + -shortest at the end stop the looped image from
running forever. showwaves modes: cline (filled, the podcast look), line,
p2p, point.
# frequency bars (the "visualizer" look):
"[0:a]showfreqs=s=1280x420:mode=bar:fscale=log[v]"
# scrolling spectrogram (also the debugging view — see below):
"[0:a]showspectrum=s=1280x720:mode=combined:color=intensity:scale=log:slide=scroll[v]"
# musical/CQT spectrum (notes align to rows — lovely for music):
"[0:a]showcqt=s=1280x720[v]"
# minimal volume meter / phase scope:
"[0:a]avectorscope=s=720x720:zoom=1.5[v]"
All consume [0:a] and produce a video stream — overlay/hstack them like any
other video (filtergraph.md).
# waveform PNG (one image of the whole file — episode art, quick inspection):
ffmpeg -i in.mp3 -filter_complex "showwavespic=s=1920x480:colors=#3aa3ff" -frames:v 1 wave.png
# spectrogram PNG — the audio-debugging x-ray:
ffmpeg -i in.wav -lavfi "showspectrumpic=s=1920x1080:scale=log" -frames:v 1 spec.png
Reading the spectrogram: a hard ceiling at ~16 kHz = the file was once a lossy
128k MP3 regardless of its current extension; mains hum = a solid line at
50/60 Hz (kill with highpass); clicks = vertical needles. Faster than ears for
"is this 'lossless' file actually lossless".
ffmpeg-only reactivity is limited to the built-in scopes. For brand-grade
audio-reactive motion (pulsing logos, beat-synced glow), render with a
composition tool (hyperframes' audio-reactive bindings or Remotion's useAudioData)
and use ffmpeg for the I/O around it: extract the audio (-vn), supply stems,
encode/package the rendered result (encoding.md).
# 2x2 labelled grid of four variants:
ffmpeg -i a.mp4 -i b.mp4 -i c.mp4 -i d.mp4 -filter_complex \
"[0:v]drawtext=text='crf20':fontsize=36:fontcolor=white:box=1:boxcolor=black@0.5:x=12:y=12[a];
[1:v]drawtext=text='crf26':fontsize=36:fontcolor=white:box=1:boxcolor=black@0.5:x=12:y=12[b];
[2:v]drawtext=text='nvenc':fontsize=36:fontcolor=white:box=1:boxcolor=black@0.5:x=12:y=12[c];
[3:v]drawtext=text='av1':fontsize=36:fontcolor=white:box=1:boxcolor=black@0.5:x=12:y=12[d];
[a][b][c][d]xstack=inputs=4:layout=0_0|w0_0|0_h0|w0_h0" -an grid.mp4
Inputs must share dimensions (scale first if not). The 2-input case (hstack +
difference blend) lives in quality-metrics.md.