Table of contents

whisperx — speech-to-text + diarization

WhisperX — OpenAI Whisper with word-level timestamps and speaker diarization. Faster than vanilla Whisper via faster-whisper’s CTranslate2 backend.

Image tag

docker.io/manvarharsh/whisperx:cuda12

What’s in this image

  • Base: nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
  • Python 3.10 (conda)
  • WhisperX, faster-whisper, pyannote (diarization)
  • ffmpeg
  • OpenSSH server

Default ports

PortService
22SSH
8000API / app port

Use cases

  • Transcribing long-form audio with word-level timestamps
  • Speaker diarization (who-said-what)
  • Subtitle / caption generation in many languages
  • Bulk transcription of audio archives

Environment variables

VariableDescription
ENABLE_SSHEnable SSH server
ENABLE_WHISPERXStart the WhisperX service
WHISPERX_EXTRA_ARGSExtra CLI args
HF_TOKENHugging Face token (required for pyannote diarization model access)
SSH_PUBLIC_KEYPublic key for SSH

Persistence

Mount at /data. Input audio in /data/audio/, output transcripts in /data/output/.

See also