Speech to Text with OpenAI Whisper In Python
Speech to text in Python allows you to convert audio and video files into readable text automatically. This is useful for tasks like audio transcription, video subtitles, meeting notes, and voice-based applications. In this beginner-friendly tutorial, you will learn how to use OpenAI Whisper to perform accurate speech to text in Python.
It helps converts spoken language from audio or video files into clear, readable text with high accuracy.
In simple words 👇
Whisper listens to audio and converts spoken words into text using python based speech-to-text processing.
It supports multiple languages, handles background noise, and works well with real-world audio like meetings, podcasts, and videos. Because of this, OpenAI Whisper is widely used for audio transcription and speech to text tasks in Python applications.
By the end of this tutorial, you will understand how to install Whisper, load audio files, and generate text output using Python. This tutorial is designed for beginners, so no prior experience with speech recognition is required.
Whisper is widely used by developers for:
- Audio to text conversion
- Speech recognition in Python
- Multilingual transcription
- Audio translation to English
Table of Contents
What is OpenAI Whisper?
Key Features of Whisper Model
- Supports 99+ languages
- Works offline after installation
- Open-source and free to use
- Easy integration with Python
- Supports transcription and translation
High accuracy for noisy audio
Advantages of Using OpenAI Whisper for Audio Transcription
- Accurate speech recognition
- Supports Indian languages
- Free and open-source
- Works without internet
Easy Python implementation
Limitations of OpenAI Whisper
- Large models need high RAM
- CPU transcription is slow
- GPU recommended for long audio
- Not real-time by default
Whisper Transcription vs Translation
1. Speech-to-Text Transcription Using Whisper
- Converts speech to text
- Output language = audio language
Example:
Hindi audio → Hindi text
2. Language Translation with OpenAI Whisper
- Converts speech to English text
- Output language = English
Example:
Hindi audio → English text
Available Whisper Models and Supported Languages
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|---|---|---|---|---|---|
| tiny | 39 M | tiny.en | tiny | 1 GB | 10x |
| base | 74 M | base.en | base | 1 GB | 7x |
| small | 244 M | small.en | small | 2 GB | 4x |
| medium | 769 M | medium.en | medium | 5 GB | 2x |
| large | 1550 M | N/A | large | 10 GB | 1x |
| turbo | 809 M | N/A | turbo | 6 GB | 8x |
The default ‘Turbo’ setting is great if you just want to write down speech that is already in English.
However, ‘Turbo’ cannot translate.
If you have audio in a different language (like Spanish or French) and need it turned into English text, do not use Turbo. Instead, pick one of the other options on the list, such as ‘Small,’ ‘Medium,’ or ‘Large.’“
Using OpenAI Whisper in Python
Step 1: Install Whisper Library
pip install -U openai-whisper
Step 2: Install FFmpeg (Required)
Whisper uses FFmpeg to read audio files.
Windows :
- Download FFmpeg. click here to visite offilcial website and then click ffmpeg-release-full.7z for download.
- Extract Downloaded FFmpeg folder.
- Rename the extracted folder to simply
ffmpeg. - Your path should look like this:
C:\ffmpeg\. Inside this folder, you should see a bin folder: C:\ffmpeg\bin\
Setting System Environment PATH for FFmpeg
This is the critical step that allows you to run ffmpeg from any terminal window.
- Press the Windows Key and type
env. - Select Edit the system environment variables.
- Click the Environment Variables… button at the bottom right.
- Under System variables (the bottom box), find the variable named Path and select Edit.
- Click New and paste the path to your bin folder: C:\ffmpeg\bin
- Click OK on all open windows to save.
Linux / macOS
sudo apt install ffmpeg
Audio Transcription Using OpenAI Whisper
Converting Audio to Text Using OpenAI Whisper in Python(Other language to English Language)
import whisper
- Imports the OpenAI Whisper library into your Python environment.
model = whisper.load_model("large-v2")
- First, the Whisper large-v2 model is loaded. Initially, it takes some time to download the model.
- large-v2 model is Most accurate but requires significantly more memory and computational power (GPU recommended).
- If you can't efford expensive computation , then use other model like tiny, base, small, medium
result = model.transcribe(audio="hindi.mp3",language="hi",task="translate")
- audio="hindi.mp3": The input file you want to process.
- language="hi": You are explicitly telling the model the audio is in Hindi.
- task="translate": This is the crucial part and It tells Whisper to translate the audio into English.
output of model.transcribe() method looks like :
{
"text": "Welcome to the tutorial on Python programming.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 2.5,
"text": "Welcome to the tutorial on Python programming.",
"tokens": [50364, 3450, ...],
"temperature": 0.0,
"avg_logprob": -0.28,
"compression_ratio": 1.1,
"no_speech_prob": 0.04
}
],
"language": "en"
}
print(result["text"])
- Prints the final transcription output.
Full Python Code
import whisper
model = whisper.load_model("large-v2")
result = model.transcribe(audio="audio.mp3",language="hi",task="translate")
print(result["text"])
Supported Audio Formats
Whisper supports most common audio formats:
- MP3
- WAV
- M4A
- FLAC
- OGG
Using GPU with OpenAI Whisper for Faster Transcription
Why You Should Use a GPU for Speech-to-Text Tasks
If you have a dedicated NVIDIA GPU available, you should absolutely use it instead of your CPU.
- Whisper comes in sizes: tiny, base, small, medium, large, and turbo, with larger models being more accurate but slower.
- large model in whisper is highly accureate but slow model.
- when we want to use medium / large model of whisper, we should use GPU insted of CPU
- GPU handles translation much faster than CPU.
- GPUs are designed for parallel processing, which makes them ideal for AI
- In addition, NVIDIA GPUs support CUDA, which accelerates OpenAI and Whisper
- GPUs process large datasets and long audio files much faster.
- Consequently, response time and throughput improve greatly.
CPU vs GPU Performance for Whisper Transcription
| Feature | CPU | GPU |
| Processing Style | Serial (One by one) | Parallel (All at once) |
| Large Model Support | Very Poor / Slow | Excellent |
| Math Precision | FP32 (Slower) | FP16 (Faster) |
| Best Use Case | Short, low-quality audio | Long files & high accuracy |
Python Code with Explaination
1️⃣: Install PyTorch with CUDA (GPU Support)
Install PyTorch based on your CUDA version.
Example (CUDA 11.8):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
👉 This step is mandatory for GPU acceleration.
2️⃣: Verify GPU is Available
import torch
print(torch.cuda.is_available())
Output :
True
✅ If True, GPU is ready
❌ If False, Whisper will use CPU
3️⃣: Load Whisper Model on GPU
First,
import whisper
→ Imports the Whisper library for speech-to-text and translation.Next,
import torch
→ Used to check whether GPU (CUDA) is available.Then,
import json
→ Helps save transcription output in a JSON file.
4️⃣: Check Whether GPU Is Available
device = "cuda" if torch.cuda.is_available() else "cpu"
- First,
torch.cuda.is_available()checks if an NVIDIA GPU is present. - If yes,
cudais selected and Whisper runs on GPU. - Otherwise, Whisper automatically falls back to CPU.
print(f"Using device: {device}")
- As a result, this line prints which device is being used.
- This is useful for debugging and confirmation.
5️⃣: Load Whisper Model on Selected Device
model = whisper.load_model("large-v2", device=device)
- Next, the
large-v2Whisper model is loaded. - Importantly, the model is loaded directly on the selected device:
- GPU if available
- CPU if GPU is not available
- Therefore, transcription speed depends on the device.
6️⃣: Transcribe and Translate Audio
result = model.transcribe(
audio="audio.mp3",
language="hi",
task="translate"
)
- First,
"v25.mp3"is the input audio file. - Then,
language="hi"tells Whisper the audio language is Hindi - After that,
task="translate"instructs Whisper to:- Convert Hindi speech
- Into English text
- As a result, the output text is always in English.
7️⃣: Save Full Transcription Output to JSON File
with open("transcription.json", "w", encoding="utf-8") as f:
json.dump(result, f, indent=4)
- Next, a file named
transcription.jsonis created. - Then, the full Whisper output is saved, including:
- Transcribed text
- Language info
- Segments
- Timestamps
- Additionally,
encoding="utf-8"ensures proper support for Unicode text.
8️⃣: Print Final Translated Text
print(result["text"])
- Finally, only the translated text is printed.
- This is useful when:
- You only want readable output
- Not the full JSON data
9️⃣: Complete Code
import whisper
import torch
import json
# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load model on GPU
model = whisper.load_model("large-v2", device=device)
# Transcribe audio
result = model.transcribe(audio="audio.mp3",language="hi",task="translate")
# Print output
with open("transcription.json", "w",encoding="utf-8") as f:
json.dump(result, f, indent=4)
print(result["text"])
Transcribing Video to Text Using OpenAI Whisper
✅ Overall Flow (Big Picture)
-
First, convert your video (
.mp4) into audio (.mp3) - Next, install and configure Whisper
-
Then, load the
large-v2model - Finally, transcribe and translate the audio into English
✅ Why Convert MP4 to MP3 Before Video Transcription?
Faster Processing
- MP3 files are much smaller than MP4 video files.
- Therefore, Whisper can load and process audio more quickly.
- As a result, transcription time is reduced.
Lower System Resource Usage
- MP4 contains both video and audio streams.
- However, Whisper only needs audio.
- So, converting to MP3 saves CPU, GPU, and RAM usage.
Better Compatibility
- Some MP4 files use unsupported or unstable audio codecs.
- In contrast, MP3 is widely supported.
- Hence, you avoid decoding errors.
Improved Transcription Accuracy
- When you extract clean audio:
- Background noise is reduced.
- Audio clarity improves.
- Consequently, Whisper produces more accurate text.
Easier Preprocessing
- With MP3, you can easily:
- Normalize volume
- Remove noise
- Trim silence
- Therefore, you can optimize audio before transcription.
Better Storage Management
- MP4 files are large and consume more disk space.
- On the other hand, MP3 files are compact.
- So, managing datasets becomes easier.
✅ Python Program to Convert Video to Text Using Whisper
1️⃣: Folder Structure
my_project/ ├── videos/ │ ├── jsons/ │ ├── audios/ │ ├── video_to_audio.py └── speech_to_text.py
Setup of FFmpeg and Whisper we have already discussed.
2️⃣: Convert MP4 to MP3 Using FFmpeg
You need to write programme to converting video files (.mp4) to audio file (.mp3) files using FFmpeg.
- You need to put your all videos file into
videosfolder.
Import Required Modules
import os
import subprocess
-
First,
import os
→ Used to interact with folders and files on your system. -
Next,
import subprocess
→ Allows Python to run external commands, such as FFmpeg.
List All Video Files
files = os.listdir("videos")
print(files)
- First, first line reads all files inside the videos folder.
- As a result, it stores the file names in the files list.
- Then, second line prints all video file names.
- This is helpful to confirm which files will be processed.
Loop Through All Files And Convert Files Into mp3 Using FFmpeg
for file in files:
filename = file.split(".")[0]
subprocess.run(["ffmpeg","-i",f"videos/{file}", f"audios/{filename}.mp3"])
- First,
for file in files:starts a loop. - As a result, each video file inside the
fileslist is processed one by one. - This makes the conversion automatic and scalable.
- Next,
filename = file.split(".")[0] - First, the file name is split using the dot
(.)character. - Then, the part before the dot is selected.
- As a result, the extension (
.mp4,.mkv, etc.) is removed.
- Then,
subprocess.run([...])executes an external command. - Specifically, it runs the FFmpeg tool from Python. Inside the command:
-
"ffmpeg":Starts the FFmpeg program. "-i",f"videos/{file}": Specifies the input video file from thevideosfolder.- As a result, audio is extracted from the video and saved as an MP3 file.
Complete Code
video_to_audio.py
#covert videos to mp3
import os
import subprocess
files = os.listdir("videos")
print(files)
for file in files:
filename = file.split(".")[0]
subprocess.run(["ffmpeg","-i",f"videos/{file}", f"audios/{filename}.mp3"])
Whisper Audio Transcription & Translation .
speech_to_text.py
import whisper
import json
import os
model = whisper.load_model("large-v2") # lodad large-v2 model
audios = os.listdir("new_audio") # list all audio file of new_audio folder
for audio in audios:
title = audio.split(".")[0]
result = model.transcribe(audio=f"new_audio/{audio}", language="hi",task="translate",word_timestamps=False)
chunks = []
for segment in result["segments"]:
chunks.append({
"start": segment["start"],
"end": segment["end"],
"text": segment["text"].strip(),
"title" : title
})
chunks_with_metadata = {"chunks": chunks,"text": result["text"]}
with open(f"jsons/{title}.json","w") as f:
json.dump(chunks_with_metadata, f)
- First,
import whisper, json, os
→ Imports required libraries for speech recognition, file handling, and JSON output. - Next,
model = whisper.load_model("large-v2")
→ Imports required libraries for speech recognition, file handling, and JSON output. - Then,
audios = os.listdir("new_audio")
→ Reads all audio files from thenew_audiofolder. - After that,
for audio in audios:
→ Loops through each audio file one by one. - Next,
title = audio.split(".")[0]
→ Extracts the file name (without extension) to use as a title. - Then,
model.transcribe(...)
→ Transcribes Hindi audio and translates it into English.
→ word_timestamps=False keeps output lighter and faster. - After that,
loop over result["segments"]
→ Extracts start time, end time, and text for each segment. - Next, each segment is stored as a chunk with metadata (title + timestamps).
- Then, all chunks and full text are combined into one JSON object.
- Finnaly, the output is saved as a JSON file inside the
jsonsfolder.