SoFunction
Updated on 2025-03-04

Python+OpenAI Whisper implements video subtitles

Install the correct OpenAI Whisper package:

pip install openai-whisper

Here is a complete working code example:

import whisper
import os
from  import VideoFileClip
from datetime import timedelta
import torch

def extract_audio(video_path, audio_path):
    """Extract audio from video"""
    try:
        video = VideoFileClip(video_path)
        .write_audiofile(audio_path)
        ()
    except Exception as e:
        print(f"Audio extraction error: {str(e)}")
        raise

def generate_srt(segments, output_srt):
    """Generate SRT format subtitle file"""
    with open(output_srt, 'w', encoding='utf-8') as f:
        for i, segment in enumerate(segments, start=1):
            # Convert time format            start = str(timedelta(seconds=int(segment['start']))) + ',000'
            end = str(timedelta(seconds=int(segment['end']))) + ',000'
            
            # Write to SRT format            (f"{i}\n")
            (f"{start} --> {end}\n")
            (f"{segment['text'].strip()}\n\n")

def main(video_path, output_srt):
    """Main Function"""
    try:
        # Check if CUDA is available        device = "cuda" if .is_available() else "cpu"
        print(f"Usage equipment: {device}")

        # Loading the model        print("Loading the Whisper model...")
        model = whisper.load_model("base", device=device)

        # Transcription directly using video files        print("Start transcription...")
        result = (video_path, language="zh")

        # Generate subtitle file        print("Generate subtitle file...")
        generate_srt(result["segments"], output_srt)

        print(f"Subtitles have been generated: {output_srt}")

    except Exception as e:
        print(f"An error occurred during processing: {str(e)}")
        raise

if __name__ == "__main__":
    # Set the input and output path    video_path = "your_video.mp4"  # Replace with your video file path    output_srt = ""      # The output subtitle file path    
    main(video_path, output_srt)

This code is used to load the Whisper model of OpenAI. In speech recognition, Whisper is an open source model provided by OpenAI, specifically used for speech to text (ASR, Automatic Speech Recognition).

Parse code:

print("Loading the Whisper model...")
model = whisper.load_model("base", device=device)

1. whisper.load_model("base", device=device)

whisper.load_model: This is to call the load_model function in the Whisper library to load a pretrained Whisper model.

"base": This is the name of the selected model. In Whisper, there are multiple pretrained models that vary in size and performance. "base" is one of the medium-sized models (not the smallest, nor the largest). Whisper provides models of different sizes, such as:

  • "tiny": The smallest model, the fastest, but the accuracy is relatively low.
  • "base": A medium-sized model with a balance between speed and accuracy.
  • "small": bigger, providing better accuracy, but slower.
  • "medium" and "large": These are the largest models with very high accuracy, but require more computing resources and are slower.

The "base" model is usually chosen because it has a good balance between speed and accuracy. Suitable for most everyday applications.

2. device=device

device=device: This part of the code specifies the hardware device on which the model runs. Typically, the device can be a "cpu" (central processor) or a "cuda" (if there is a CUDA-enabled NVIDIA GPU). Depending on your device configuration, Whisper will run the model on the CPU or GPU. Using GPU for calculations can greatly speed up the process of model inference, especially when it comes to larger models.

For example:

  • If your computer has a GPU and already has CUDA installed, device="cuda" will let Whisper use the GPU to load and execute the model.
  • If there is no GPU or CUDA is not configured, Whisper will use the CPU by default.

Why use the "base" model?

The reasons for choosing the "base" model are usually considered as follows:

Balance: The "base" model is more accurate than the tiny model, but the computing resource requirements are not as high as the large or medium model. It is a common choice for many users because it reaches a better compromise between accuracy and processing speed.

Performance: For most common applications (such as real-time speech recognition), the "base" model usually provides sufficient accuracy without consuming too much hardware resources.

Hardware requirements: The "base" model requires less memory and computing resources than larger models, so it is suitable for devices that do not have particularly powerful hardware.

Select another model:

  • If your hardware supports it and requires high recognition accuracy, you can choose the "small", "medium" or "large" model, which can achieve higher recognition accuracy, but may sacrifice a certain speed.
  • If the hardware performance is limited, or you need to respond quickly, you can choose the "tiny" model, although its recognition accuracy is relatively low.

Summarize:

The purpose of this code is to load the Whisper model and specify the device (CPU or GPU) used.

The "base" model is a medium-sized model and is usually suitable for most scenarios where accuracy and speed are required.

Instructions for use:

Replace video_path with your actual video file path

Make sure there is enough disk space

If there is an NVIDIA GPU, CUDA acceleration will be automatically used

Additional dependencies that may be required:

pip install torch torchvision torchaudio
pip install moviepy

If you want a more detailed progress display, you can add a progress bar:

from tqdm import tqdm

def generate_srt_with_progress(segments, output_srt):
    """Subtitle generation with progress display"""
    with open(output_srt, 'w', encoding='utf-8') as f:
        for i, segment in tqdm(enumerate(segments, start=1), 
                             desc="Create subtitles", 
                             total=len(segments)):
            start = str(timedelta(seconds=int(segment['start']))) + ',000'
            end = str(timedelta(seconds=int(segment['end']))) + ',000'
            
            (f"{i}\n")
            (f"{start} --> {end}\n")
            (f"{segment['text'].strip()}\n\n")

Add error handling and logging:

import logging
import sys

#Configuration log(
    level=,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        ('subtitle_generation.log'),
        ()
    ]
)

def main(video_path, output_srt):
    """Main function with complete error handling and logging"""
    try:
        if not (video_path):
            raise FileNotFoundError(f"The video file does not exist: {video_path}")

        (f"Start processing videos: {video_path}")
        
        # Check CUDA        device = "cuda" if .is_available() else "cpu"
        (f"Usage equipment: {device}")

        # Loading the model        ("Loading the Whisper model...")
        model = whisper.load_model("base", device=device)

        #Translation        ("Start transcription...")
        result = (video_path, language="zh")

        # Generate subtitles        ("Generate subtitle file...")
        generate_srt_with_progress(result["segments"], output_srt)

        (f"Subtitle generation is completed: {output_srt}")

    except FileNotFoundError as e:
        (f"File Error: {str(e)}")
        raise
    except Exception as e:
        (f"Handling errors: {str(e)}")
        raise

If you need to process long videos, you can add segmentation processing:

def process_long_video(video_path, output_srt, segment_duration=300):
    """Processing long videos in segments"""
    from  import VideoFileClip
    
    video = VideoFileClip(video_path)
    duration = 
    segments = []
    
    for start in range(0, int(duration), segment_duration):
        end = min(start + segment_duration, duration)
        
        # Extract fragments        segment = (start, end)
        temp_audio = f"temp_{start}_{end}.wav"
        .write_audiofile(temp_audio)
        
        # Processing fragments        result = (temp_audio, language="zh")
        (result["segments"])
        
        # Clean up temporary files        (temp_audio)
    
    # Generate full subtitles    generate_srt(segments, output_srt)
    ()

Whisper can translate English pronunciation into Chinese. Whisper is a multilingual speech recognition model that supports speech recognition and translation tasks, including translating English speech into Chinese.

How to implement English pronunciation translation into Chinese:

When using Whisper, you can use the following steps to translate English pronunciation:

Load the model and transcribe

Use whisper.load_model to load the model (for example, base or small, etc.) and then use the transcribe method to recognize English speech. In order to translate, you need to set the task="translate" parameter.

Specify the translation target language as Chinese

Set language="en" to tell the model that the input language is English, and Whisper will automatically translate the recognized English pronunciation into the target language you specified (Chinese).

Sample code:

import whisper

# Loading the modelmodel = whisper.load_model("base")

# Set the video or audio file pathaudio_path = "path_to_audio_or_video"

# Perform pronunciation translation, English pronunciation translation into Chineseresult = (audio_path, language="en", task="translate")

# Print translation resultsprint(result["text"])

explain:

language="en": Tell Whisper that the language entered is in English.

task="translate": Enable translation mode, Whisper will translate the recognized English pronunciation into Chinese.

result["text"]: This is the translated Chinese text.

Notes:

Translation quality: Whisper has relatively strong translation capabilities, but its translation quality is limited by the scale of the model and the training data. Therefore, the translation results may not be perfect, especially for certain complex sentences or contexts.

Speech recognition and translation: Whisper combines speech recognition and translation functions, suitable for obtaining translated text directly from speech without the need to perform speech recognition and translation steps separately.

Summarize:

Whisper can process English pronunciations and translate them into Chinese, which is achieved directly by setting task="translate".

This is the article about Python + OpenAI Whisper implementing video subtitles. For more related content on Python video subtitles, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!