Speech to Text with OpenAI Whisper In Python

speech to text using open ai whisper in python programming langunage

Speech to text in Python allows you to convert audio and video files into readable text automatically. This is useful for tasks like audio transcription, video subtitles, meeting notes, and voice-based applications. In this beginner-friendly tutorial, you will learn how to use OpenAI Whisper to perform accurate speech to text in Python.

It helps converts spoken language from audio or video files into clear, readable text with high accuracy.

In simple words 👇
Whisper listens to audio and converts spoken words into text using python based speech-to-text processing.

It supports multiple languages, handles background noise, and works well with real-world audio like meetings, podcasts, and videos. Because of this, OpenAI Whisper is widely used for audio transcription and speech to text tasks in Python applications.

By the end of this tutorial, you will understand how to install Whisper, load audio files, and generate text output using Python. This tutorial is designed for beginners, so no prior experience with speech recognition is required.

Whisper is widely used by developers for:

  • Audio to text conversion
  • Speech recognition in Python
  • Multilingual transcription
  • Audio translation to English

Table of Contents

What is OpenAI Whisper?

Key Features of Whisper Model

  • Supports 99+ languages
  • Works offline after installation
  • Open-source and free to use
  • Easy integration with Python
  • Supports transcription and translation
  • High accuracy for noisy audio

Advantages of Using OpenAI Whisper for Audio Transcription

  • Accurate speech recognition
  • Supports Indian languages
  • Free and open-source
  • Works without internet
  • Easy Python implementation

Limitations of OpenAI Whisper

  • Large models need high RAM
  • CPU transcription is slow
  • GPU recommended for long audio
  • Not real-time by default

Whisper Transcription vs Translation

1. Speech-to-Text Transcription Using Whisper

  • Converts speech to text
  • Output language = audio language

Example:
    Hindi audio → Hindi text

 

2. Language Translation with OpenAI Whisper

  • Converts speech to English text
  • Output language = English

Example:
   Hindi audio → English text

Available Whisper Models and Supported Languages

SizeParametersEnglish-only modelMultilingual modelRequired VRAMRelative speed
tiny39 Mtiny.entiny1 GB10x
base74 Mbase.enbase1 GB7x
small244 Msmall.ensmall2 GB4x
medium769 Mmedium.enmedium5 GB2x
large1550 MN/Alarge10 GB1x
turbo809 MN/Aturbo6 GB8x

The default ‘Turbo’ setting is great if you just want to write down speech that is already in English.

However, ‘Turbo’ cannot translate.

If you have audio in a different language (like Spanish or French) and need it turned into English text, do not use Turbo. Instead, pick one of the other options on the list, such as ‘Small,’ ‘Medium,’ or ‘Large.’

Using OpenAI Whisper in Python

Step 1: Install Whisper Library

				
					pip install -U openai-whisper
				
			

Step 2: Install FFmpeg (Required)

Whisper uses FFmpeg to read audio files.

Windows :

  1. Download FFmpeg. click here to visite offilcial website and then click  ffmpeg-release-full.7z for download.
  2. Extract Downloaded FFmpeg folder.
  3. Rename the extracted folder to simply ffmpeg.
  4. Your path should look like this: C:\ffmpeg\ . Inside this folder, you should see a bin folder: C:\ffmpeg\bin\

Setting System Environment PATH for FFmpeg

This is the critical step that allows you to run ffmpeg from any terminal window.

  1. Press the Windows Key and type env.
  2. Select Edit the system environment variables.
  3. Click the Environment Variables… button at the bottom right.
  4. Under System variables (the bottom box), find the variable named Path and select Edit.
  5. Click New and paste the path to your bin folder: C:\ffmpeg\bin
  6. Click OK on all open windows to save. 

Linux / macOS

				
					sudo apt install ffmpeg

				
			

Audio Transcription Using OpenAI Whisper

Converting Audio to Text Using OpenAI Whisper in Python(Other language to English Language)

				
					import whisper

				
			
				
					model = whisper.load_model("large-v2")

				
			
				
					result = model.transcribe(audio="hindi.mp3",language="hi",task="translate")

				
			

output of model.transcribe() method looks like :

				
					{
    "text": "Welcome to the tutorial on Python programming.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 2.5,
            "text": "Welcome to the tutorial on Python programming.",
            "tokens": [50364, 3450, ...],
            "temperature": 0.0,
            "avg_logprob": -0.28,
            "compression_ratio": 1.1,
            "no_speech_prob": 0.04
        }
    ],
    "language": "en"
}
				
			
				
					print(result["text"])

				
			

Full Python Code

				
					import whisper

model = whisper.load_model("large-v2")
result = model.transcribe(audio="audio.mp3",language="hi",task="translate")
print(result["text"])

				
			

Supported Audio Formats

Whisper supports most common audio formats:

  • MP3
  • WAV
  • M4A
  • FLAC
  • OGG

Using GPU with OpenAI Whisper for Faster Transcription

Why You Should Use a GPU for Speech-to-Text Tasks

If you have a dedicated NVIDIA GPU available, you should absolutely use it instead of your CPU.

  • Whisper comes in sizes: tiny, base, small, medium, large, and turbo, with larger models being more accurate but slower.
  • large model in whisper is highly accureate but slow model.
  • when we want to use medium / large model of whisper, we should use GPU insted of CPU
  • GPU handles translation much faster than CPU.
  • GPUs are designed for parallel processing, which makes them ideal for AI
  • In addition, NVIDIA GPUs support CUDA, which accelerates OpenAI and Whisper
  • GPUs process large datasets and long audio files much faster.
  • Consequently, response time and throughput improve greatly.

CPU vs GPU Performance for Whisper Transcription

Feature CPU GPU
Processing Style Serial (One by one) Parallel (All at once)
Large Model Support Very Poor / Slow Excellent
Math Precision FP32 (Slower) FP16 (Faster)
Best Use Case Short, low-quality audio Long files & high accuracy

Python Code with Explaination

1️⃣: Install PyTorch with CUDA (GPU Support)

Install PyTorch based on your CUDA version.

Example (CUDA 11.8):

				
					pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

				
			

👉 This step is mandatory for GPU acceleration.

2️⃣: Verify GPU is Available

				
					import torch

print(torch.cuda.is_available())

				
			

Output :

				
					True
				
			

✅ If True, GPU is ready
❌ If False, Whisper will use CPU

3️⃣: Load Whisper Model on GPU

  • First, import whisper
    → Imports the Whisper library for speech-to-text and translation.

  • Next, import torch
    → Used to check whether GPU (CUDA) is available.

  • Then, import json
    → Helps save transcription output in a JSON file.

4️⃣: Check Whether GPU Is Available

				
					device = "cuda" if torch.cuda.is_available() else "cpu"

				
			
  • First, torch.cuda.is_available() checks if an NVIDIA GPU is present.
  • If yes, cuda is selected and Whisper runs on GPU.
  • Otherwise, Whisper automatically falls back to CPU.
				
					print(f"Using device: {device}")

				
			
  • As a result, this line prints which device is being used.
  • This is useful for debugging and confirmation.

5️⃣: Load Whisper Model on Selected Device

				
					model = whisper.load_model("large-v2", device=device)

				
			
  • Next, the large-v2 Whisper model is loaded.
  • Importantly, the model is loaded directly on the selected device:
    • GPU if available
    • CPU if GPU is not available
  • Therefore, transcription speed depends on the device.

6️⃣: Transcribe and Translate Audio

				
					result = model.transcribe(
    audio="audio.mp3",
    language="hi",
    task="translate"
)

				
			
  • First, "v25.mp3" is the input audio file.
  • Then, language="hi" tells Whisper the audio language is Hindi
  • After that, task="translate" instructs Whisper to:
    • Convert Hindi speech
    • Into English text
  • As a result, the output text is always in English.

7️⃣: Save Full Transcription Output to JSON File

				
					with open("transcription.json", "w", encoding="utf-8") as f:
        json.dump(result, f, indent=4)

				
			
  • Next, a file named transcription.json is created.
  • Then, the full Whisper output is saved, including:
    • Transcribed text
    • Language info
    • Segments
    • Timestamps
  • Additionally, encoding="utf-8" ensures proper support for Unicode text.

8️⃣: Print Final Translated Text

				
					print(result["text"])

				
			
  • Finally, only the translated text is printed.
  • This is useful when:
    • You only want readable output
    • Not the full JSON data

9️⃣: Complete Code

				
					import whisper
import torch
import json

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load model on GPU
model = whisper.load_model("large-v2", device=device)

# Transcribe audio
result = model.transcribe(audio="audio.mp3",language="hi",task="translate")

# Print output
with open("transcription.json", "w",encoding="utf-8") as f:
    json.dump(result, f, indent=4)
print(result["text"])






				
			

Transcribing Video to Text Using OpenAI Whisper

✅ Overall Flow (Big Picture)

  • First, convert your video (.mp4) into audio (.mp3)
  • Next, install and configure Whisper
  • Then, load the large-v2 model
  • Finally, transcribe and translate the audio into English

✅ Why Convert MP4 to MP3 Before Video Transcription?

Faster Processing

  • MP3 files are much smaller than MP4 video files.
  • Therefore, Whisper can load and process audio more quickly.
  • As a result, transcription time is reduced.

Lower System Resource Usage

  • MP4 contains both video and audio streams.
  • However, Whisper only needs audio.
  • So, converting to MP3 saves CPU, GPU, and RAM usage.

Better Compatibility

  • Some MP4 files use unsupported or unstable audio codecs.
  • In contrast, MP3 is widely supported.
  • Hence, you avoid decoding errors.

Improved Transcription Accuracy

  • When you extract clean audio:
    • Background noise is reduced.
    • Audio clarity improves.
  • Consequently, Whisper produces more accurate text.

Easier Preprocessing

  • With MP3, you can easily:
    • Normalize volume
    • Remove noise
    • Trim silence
  • Therefore, you can optimize audio before transcription.

Better Storage Management

  • MP4 files are large and consume more disk space.
  • On the other hand, MP3 files are compact.
  • So, managing datasets becomes easier.

✅ Python Program to Convert Video to Text Using Whisper

1️⃣: Folder Structure

my_project/
├── videos/
│
├── jsons/
│   
├── audios/
│   
├── video_to_audio.py
└── speech_to_text.py

Setup of FFmpeg and Whisper we have already discussed.

2️⃣: Convert MP4 to MP3 Using FFmpeg

You need to write programme to converting video files (.mp4) to audio file (.mp3) files using FFmpeg.

  • You need to put your all videos file into videos folder.
Import Required Modules
				
					import os
import subprocess

				
			
  • First, import os
    → Used to interact with folders and files on your system.

  • Next, import subprocess
    → Allows Python to run external commands, such as FFmpeg.

List All Video Files
				
					files = os.listdir("videos")
print(files)

				
			
  • First, first line reads all files inside the videos folder.
  • As a result, it stores the file names in the files list.
  • Then, second line prints all video file names.
  • This is helpful to confirm which files will be processed.
Loop Through All Files And Convert Files Into mp3 Using FFmpeg
				
					for file in files:
     filename = file.split(".")[0]
     subprocess.run(["ffmpeg","-i",f"videos/{file}", f"audios/{filename}.mp3"])
				
			
Loop Through All Files
  • First, for file in files: starts a loop.
  • As a result, each video file inside the files list is processed one by one.
  • This makes the conversion automatic and scalable.
Extract File Name Without Extension
  • Next, filename = file.split(".")[0]
  • First, the file name is split using the dot (.) character.
  • Then, the part before the dot is selected.
  • As a result, the extension (.mp4, .mkv, etc.) is removed.

Run FFmpeg Command Using subprocess.
  • Then, subprocess.run([...]) executes an external command.
  • Specifically, it runs the FFmpeg tool from Python.
  • Inside the command:
  • "ffmpeg" :Starts the FFmpeg program.
  • "-i",f"videos/{file}": Specifies the input video file from the videos folder.
  • As a result, audio is extracted from the video and saved as an MP3 file.
Complete Code

video_to_audio.py

				
					#covert videos to mp3 
import os
import subprocess

files = os.listdir("videos")
print(files)
for file in files:
    filename = file.split(".")[0]
    subprocess.run(["ffmpeg","-i",f"videos/{file}", f"audios/{filename}.mp3"])
				
			

Whisper Audio Transcription & Translation .

speech_to_text.py

				
					import whisper
import json
import os

model = whisper.load_model("large-v2") # lodad large-v2 model
audios = os.listdir("new_audio") # list all audio file of new_audio folder


for audio in audios:
    title = audio.split(".")[0]
    result = model.transcribe(audio=f"new_audio/{audio}", language="hi",task="translate",word_timestamps=False)
    chunks = []
    for segment in result["segments"]:
        chunks.append({
            "start": segment["start"],
            "end": segment["end"], 
            "text": segment["text"].strip(),
            "title" : title
        })
    chunks_with_metadata = {"chunks": chunks,"text": result["text"]}
    with open(f"jsons/{title}.json","w") as f:
        json.dump(chunks_with_metadata, f)    
				
			
  • First, import whisper, json, os
    → Imports required libraries for speech recognition, file handling, and JSON output.
  • Next, model = whisper.load_model("large-v2")
    → Imports required libraries for speech recognition, file handling, and JSON output.
  • Then, audios = os.listdir("new_audio")

    → Reads all audio files from the new_audio folder.
  • After that, for audio in audios:

    → Loops through each audio file one by one.
  • Next, title = audio.split(".")[0]

    → Extracts the file name (without extension) to use as a title.
  • Then, model.transcribe(...)

    → Transcribes Hindi audio and translates it into English.
    → word_timestamps=False keeps output lighter and faster.
  • After that, loop over result["segments"]

    → Extracts start time, end time, and text for each segment.
  • Next, each segment is stored as a chunk with metadata (title + timestamps).
  • Then, all chunks and full text are combined into one JSON object.
  • Finnaly, the output is saved as a JSON file inside the jsons folder.

❓ Frequently Asked Questions (FAQs)

What is OpenAI Whisper used for?

OpenAI Whisper is used for speech to text tasks such as converting audio and video files into written text. It is commonly used for audio transcription, subtitles, meeting notes, and podcast transcripts.

Can OpenAI Whisper convert audio to text in Python?

Yes, you can use OpenAI Whisper in Python to perform audio to text conversion. It supports multiple audio formats and provides accurate transcription with minimal setup.

Does OpenAI Whisper support video transcription?

Yes, OpenAI Whisper can transcribe video files. You simply extract the audio from the video and then use Whisper to convert the spoken content into text.

Is OpenAI Whisper free and open source?

Yes, OpenAI Whisper is an open-source speech to text model. You can run it locally without paying any API fees.

What audio formats are supported by OpenAI Whisper?

Whisper supports common formats such as MP3, WAV, M4A, and MP4 (video audio). For best results, audio files should be clear and properly encoded.

Does Whisper work well with noisy audio?

Yes, Whisper is trained on real-world audio, so it performs well even with background noise, accents, and different speaking styles.

Other Related Posts

Scroll to Top