OpenAI Whisper Python

What is OpenAI Whisper?

Key Features of Whisper Model

Supports 99+ languages
Works offline after installation
Open-source and free to use
Easy integration with Python
Supports transcription and translation
High accuracy for noisy audio

Advantages of Using OpenAI Whisper for Audio Transcription

Accurate speech recognition
Supports Indian languages
Free and open-source
Works without internet
Easy Python implementation

Limitations of OpenAI Whisper

Large models need high RAM
CPU transcription is slow
GPU recommended for long audio
Not real-time by default

Whisper Transcription vs Translation

1. Speech-to-Text Transcription Using Whisper

Converts speech to text
Output language = audio language

Example:
Hindi audio → Hindi text

2. Language Translation with OpenAI Whisper

Converts speech to English text
Output language = English

Example:
Hindi audio → English text

Available Whisper Models and Supported Languages

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	1 GB	10x
base	74 M	`base.en`	`base`	1 GB	7x
small	244 M	`small.en`	`small`	2 GB	4x
medium	769 M	`medium.en`	`medium`	5 GB	2x
large	1550 M	N/A	`large`	10 GB	1x
turbo	809 M	N/A	`turbo`	6 GB	8x

The default ‘Turbo’ setting is great if you just want to write down speech that is already in English.

However, ‘Turbo’ cannot translate.

If you have audio in a different language (like Spanish or French) and need it turned into English text, do not use Turbo. Instead, pick one of the other options on the list, such as ‘Small,’ ‘Medium,’ or ‘Large.’“

Using OpenAI Whisper in Python

Step 1: Install Whisper Library

				
					pip install -U openai-whisper

Step 2: Install FFmpeg (Required)

Whisper uses FFmpeg to read audio files.

Windows :

Download FFmpeg. click here to visite offilcial website and then click ffmpeg-release-full.7z for download.
Extract Downloaded FFmpeg folder.
Rename the extracted folder to simply ffmpeg.
Your path should look like this: C:\ffmpeg\ . Inside this folder, you should see a bin folder: C:\ffmpeg\bin\

Setting System Environment PATH for FFmpeg

This is the critical step that allows you to run ffmpeg from any terminal window.

Press the Windows Key and type env.
Select Edit the system environment variables.
Click the Environment Variables… button at the bottom right.
Under System variables (the bottom box), find the variable named Path and select Edit.
Click New and paste the path to your bin folder: C:\ffmpeg\bin
Click OK on all open windows to save.

Linux / macOS

				
					sudo apt install ffmpeg

Audio Transcription Using OpenAI Whisper

Converting Audio to Text Using OpenAI Whisper in Python(Other language to English Language)

				
					import whisper

				
					model = whisper.load_model("large-v2")

				
					result = model.transcribe(audio="hindi.mp3",language="hi",task="translate")

output of model.transcribe() method looks like :

				
					{
    "text": "Welcome to the tutorial on Python programming.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 2.5,
            "text": "Welcome to the tutorial on Python programming.",
            "tokens": [50364, 3450, ...],
            "temperature": 0.0,
            "avg_logprob": -0.28,
            "compression_ratio": 1.1,
            "no_speech_prob": 0.04
        }
    ],
    "language": "en"
}

				
					print(result["text"])

Full Python Code

				
					import whisper

model = whisper.load_model("large-v2")
result = model.transcribe(audio="audio.mp3",language="hi",task="translate")
print(result["text"])

Supported Audio Formats

Whisper supports most common audio formats:

MP3
WAV
M4A
FLAC
OGG

Using GPU with OpenAI Whisper for Faster Transcription

Why You Should Use a GPU for Speech-to-Text Tasks

If you have a dedicated NVIDIA GPU available, you should absolutely use it instead of your CPU.

Whisper comes in sizes: tiny, base, small, medium, large, and turbo, with larger models being more accurate but slower.
large model in whisper is highly accureate but slow model.
when we want to use medium / large model of whisper, we should use GPU insted of CPU
GPU handles translation much faster than CPU.
GPUs are designed for parallel processing, which makes them ideal for AI
In addition, NVIDIA GPUs support CUDA, which accelerates OpenAI and Whisper
GPUs process large datasets and long audio files much faster.
Consequently, response time and throughput improve greatly.

CPU vs GPU Performance for Whisper Transcription

Feature	CPU	GPU
Processing Style	Serial (One by one)	Parallel (All at once)
Large Model Support	Very Poor / Slow	Excellent
Math Precision	FP32 (Slower)	FP16 (Faster)
Best Use Case	Short, low-quality audio	Long files & high accuracy

Python Code with Explaination

1️⃣: Install PyTorch with CUDA (GPU Support)

Install PyTorch based on your CUDA version.

Example (CUDA 11.8):

				
					pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

👉 This step is mandatory for GPU acceleration.

2️⃣: Verify GPU is Available

				
					import torch

print(torch.cuda.is_available())

Output :

				
					True

✅ If True, GPU is ready
❌ If False, Whisper will use CPU

3️⃣: Load Whisper Model on GPU

First, import whisper
→ Imports the Whisper library for speech-to-text and translation.
Next, import torch
→ Used to check whether GPU (CUDA) is available.
Then, import json
→ Helps save transcription output in a JSON file.

4️⃣: Check Whether GPU Is Available

				
					device = "cuda" if torch.cuda.is_available() else "cpu"

First, torch.cuda.is_available() checks if an NVIDIA GPU is present.
If yes, cuda is selected and Whisper runs on GPU.
Otherwise, Whisper automatically falls back to CPU.

				
					print(f"Using device: {device}")

As a result, this line prints which device is being used.
This is useful for debugging and confirmation.

5️⃣: Load Whisper Model on Selected Device

				
					model = whisper.load_model("large-v2", device=device)

Next, the large-v2 Whisper model is loaded.
Importantly, the model is loaded directly on the selected device:
- GPU if available
- CPU if GPU is not available
Therefore, transcription speed depends on the device.

6️⃣: Transcribe and Translate Audio

				
					result = model.transcribe(
    audio="audio.mp3",
    language="hi",
    task="translate"
)

First, "v25.mp3" is the input audio file.
Then, language="hi" tells Whisper the audio language is Hindi
After that, task="translate" instructs Whisper to:
- Convert Hindi speech
- Into English text
As a result, the output text is always in English.

7️⃣: Save Full Transcription Output to JSON File

				
					with open("transcription.json", "w", encoding="utf-8") as f:
        json.dump(result, f, indent=4)

Next, a file named transcription.json is created.
Then, the full Whisper output is saved, including:
- Transcribed text
- Language info
- Segments
- Timestamps
Additionally, encoding="utf-8" ensures proper support for Unicode text.

8️⃣: Print Final Translated Text

				
					print(result["text"])

Finally, only the translated text is printed.
This is useful when:
- You only want readable output
- Not the full JSON data

9️⃣: Complete Code

				
					import whisper
import torch
import json

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load model on GPU
model = whisper.load_model("large-v2", device=device)

# Transcribe audio
result = model.transcribe(audio="audio.mp3",language="hi",task="translate")

# Print output
with open("transcription.json", "w",encoding="utf-8") as f:
    json.dump(result, f, indent=4)
print(result["text"])

Transcribing Video to Text Using OpenAI Whisper

✅ Overall Flow (Big Picture)

First, convert your video (.mp4) into audio (.mp3)
Next, install and configure Whisper
Then, load the large-v2 model
Finally, transcribe and translate the audio into English

✅ Why Convert MP4 to MP3 Before Video Transcription?

Faster Processing

MP3 files are much smaller than MP4 video files.
Therefore, Whisper can load and process audio more quickly.
As a result, transcription time is reduced.

Lower System Resource Usage

MP4 contains both video and audio streams.
However, Whisper only needs audio.
So, converting to MP3 saves CPU, GPU, and RAM usage.

Better Compatibility

Some MP4 files use unsupported or unstable audio codecs.
In contrast, MP3 is widely supported.
Hence, you avoid decoding errors.

Improved Transcription Accuracy

When you extract clean audio:
- Background noise is reduced.
- Audio clarity improves.
Consequently, Whisper produces more accurate text.

Easier Preprocessing

With MP3, you can easily:
- Normalize volume
- Remove noise
- Trim silence
Therefore, you can optimize audio before transcription.

Better Storage Management

MP4 files are large and consume more disk space.
On the other hand, MP3 files are compact.
So, managing datasets becomes easier.

✅ Python Program to Convert Video to Text Using Whisper

1️⃣: Folder Structure

my_project/
├── videos/
│
├── jsons/
│   
├── audios/
│   
├── video_to_audio.py
└── speech_to_text.py

Setup of FFmpeg and Whisper we have already discussed.

2️⃣: Convert MP4 to MP3 Using FFmpeg

You need to write programme to converting video files (.mp4) to audio file (.mp3) files using FFmpeg.

You need to put your all videos file into videos folder.

Import Required Modules

				
					import os
import subprocess

First, import os
→ Used to interact with folders and files on your system.
Next, import subprocess
→ Allows Python to run external commands, such as FFmpeg.

List All Video Files

				
					files = os.listdir("videos")
print(files)

First, first line reads all files inside the videos folder.
As a result, it stores the file names in the files list.
Then, second line prints all video file names.
This is helpful to confirm which files will be processed.

Loop Through All Files And Convert Files Into mp3 Using FFmpeg

				
					for file in files:
     filename = file.split(".")[0]
     subprocess.run(["ffmpeg","-i",f"videos/{file}", f"audios/{filename}.mp3"])

Loop Through All Files

First, for file in files: starts a loop.
As a result, each video file inside the files list is processed one by one.
This makes the conversion automatic and scalable.

Extract File Name Without Extension

Next, filename = file.split(".")[0]
First, the file name is split using the dot (.) character.
Then, the part before the dot is selected.
As a result, the extension (.mp4, .mkv, etc.) is removed.

Run FFmpeg Command Using subprocess.

Then, subprocess.run([...]) executes an external command.
Specifically, it runs the FFmpeg tool from Python.
"ffmpeg" :Starts the FFmpeg program.
"-i",f"videos/{file}": Specifies the input video file from the videos folder.
As a result, audio is extracted from the video and saved as an MP3 file.

Complete Code

video_to_audio.py

				
					#covert videos to mp3 
import os
import subprocess

files = os.listdir("videos")
print(files)
for file in files:
    filename = file.split(".")[0]
    subprocess.run(["ffmpeg","-i",f"videos/{file}", f"audios/{filename}.mp3"])

Whisper Audio Transcription & Translation .

speech_to_text.py

				
					import whisper
import json
import os

model = whisper.load_model("large-v2") # lodad large-v2 model
audios = os.listdir("new_audio") # list all audio file of new_audio folder


for audio in audios:
    title = audio.split(".")[0]
    result = model.transcribe(audio=f"new_audio/{audio}", language="hi",task="translate",word_timestamps=False)
    chunks = []
    for segment in result["segments"]:
        chunks.append({
            "start": segment["start"],
            "end": segment["end"], 
            "text": segment["text"].strip(),
            "title" : title
        })
    chunks_with_metadata = {"chunks": chunks,"text": result["text"]}
    with open(f"jsons/{title}.json","w") as f:
        json.dump(chunks_with_metadata, f)

First, import whisper, json, os
→ Imports required libraries for speech recognition, file handling, and JSON output.
Next, model = whisper.load_model("large-v2")
→ Imports required libraries for speech recognition, file handling, and JSON output.
Then, audios = os.listdir("new_audio")
→ Reads all audio files from the new_audio folder.
After that, for audio in audios:
→ Loops through each audio file one by one.
Next, title = audio.split(".")[0]
→ Extracts the file name (without extension) to use as a title.
Then, model.transcribe(...)
→ Transcribes Hindi audio and translates it into English.
→ word_timestamps=False keeps output lighter and faster.
After that, loop over result["segments"]
→ Extracts start time, end time, and text for each segment.
Next, each segment is stored as a chunk with metadata (title + timestamps).
Then, all chunks and full text are combined into one JSON object.
Finnaly, the output is saved as a JSON file inside the jsons folder.

Speech to Text with OpenAI Whisper In Python

Table of Contents