Local Whisper Transcription with Speaker Diarization: My GPU-Powered Docker Setup

Sat, 18 Apr 2026 14:00:00 +0200

I wanted a transcription tool that runs entirely on my own hardware — no audio leaves the machine, no cloud APIs, no subscriptions. Something that handles any language (including tricky ones like Flemish dialect), produces speaker-labeled output, and can be tuned with domain-specific vocabulary for whatever context I’m transcribing.

What I ended up with is a Docker container powered by an NVIDIA RTX 3090 that transcribes audio with Whisper, aligns every word to a precise timestamp, and identifies who said what — all in about two minutes for a 42-minute recording.

Self-Hosted on steeman.be

Local Whisper Transcription with Speaker Diarization: My GPU-Powered Docker Setup