AI in practice: Generating video subtitles

Build
Diego Coy
5 min read

Scaleway is a French company with an international vision, so it is imperative that we provide information to our 550+ employees in both English and French, to ensure clear understanding and information flow. We create a diverse set of training videos for internal usage, with some being originally voiced in English, and others in French. In all cases they should include subtitles for both languages.

Creating subtitles is a time-consuming process that we quickly realized would not scale. Fortunately, we were able to harness the power of AI for this exact task. With the help of OpenAI’s Whisper, the University of Helsinki’s Opus-MT and a bit of code, we were able to not only transcribe, and when required, translate our internal videos; but we could also generate subtitles in the srt format, that we can simply import into a video editing software or feed to a video player.

OpenAI’s Whisper

Whisper is an Open Source model created by OpenAI. It is a general-purpose speech recognition model that is able to identify and transcribe a wide variety of spoken languages. It is one of the most popular models around today and is released under MIT license.

OpenAI provides a Python SDK that will interact with the model, which has a wide variety of “flavors” based on the accuracy of their results: tiny, base, small, medium, and large. Larger models have been trained with a greater amount of parameters or examples, which makes them larger in size, and more resource-hungry — the tiny version of the model requires 1GB of VRAM (Video RAM) and the large version requires around 10GB.

Helsinki-NLP’s Opus-MT

The University of Helsinki made its own Open Source text translation models available based on the Marian-MT framework used by Microsoft Translator. Opus-MT models are provided as language pairs: translation source, and translation target, meaning that the model Helsinki-NLP/opus-mt-fr-en will translate text in French (fr) to English (en), and the other way around with Helsinki-NLP/opus-mt-en-fr.

Opus-MT can be used via the Transformers Python library from Hugging Face or using Docker. It is an Open Source project released under the MIT License and requires you to cite the OPUS-MT paper on your implementations:

@InProceedings{TiedemannThottingal:EAMT2020,
author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
year = {2020},
address = {Lisbon, Portugal}
}

Generating subtitles

Combining these two models into a subtitle-generating service is only a matter of adding some code to “glue” them together. But before diving into the code, let’s review our requirements:

First, we need to create a Virtual Machine capable of running AI models without a hitch, and the NVIDIA H100-1-80G GPU instance is a great choice.

With the type of instance clear, we can now focus on the functional requirements. We want to pass in a video file as input to Whisper to get a transcript. The second step will be to translate that transcript using OPUS-MT from a specific source language to a target language. Finally, we want to create a subtitle file in the target language that is in sync with the audio.

Setting up Whisper

You will find the latest information about setting it up on their GitHub repository, but in general, you can install the Python library using pip:

pip install -U openai-whisper

Whisper relies heavily on the FFmpeg project for manipulating multimedia files. FFmpeg can be installed via APT:

 sudo apt install ffmpeg -y

The code

1. A simple text transcription

This basic example is the most straightforward way to transcribe audio into text. After importing the Whisper library, you load a flavor of the model by passing a string with its name to the load_model method. In this case, the base model is accurate enough, but some use cases may require larger or smaller model flavors.

After loading the model, you load the audio source by passing the file path. Notice that you can use both audio and video files, and in general, any file type with audio that is supported by FFmpeg.

Finally, you make use of the transcribe method of the model by passing it the loaded audio. As a result, you get a dictionary that amongst other items, contains the whole transcription text.

#main.py

import whisper

model = whisper.load_model("base")
audio = whisper.load_audio("input_file.mp4")
result = model.transcribe(audio)

print(result["text"])

This basic example gives you the main tools needed for the rest of the project: loading a model, loading an input audio file, and transcribing the audio using the model. This is already a big step forward and puts us closer to our goal of generating a subtitle file, however, you may have noticed that the resulting text doesn’t include any time references, it’s only text. Syncing this transcribed text with the audio would be a task that would require large amounts of manual work, but fortunately, Whisper’s transcription process also outputs segments that are time-coded.

2. Segments

Having time-coded segments means you can pinpoint them to their specific start and end times during the clip. For instance, if the first speech segment in the clip is “We're no strangers” and it starts at 00:17:50 and ends at 00:18:30, you will get that information in the segment dictionary, giving you all you need to create an srt subtitle file, now all you have to do is to properly format it to conform with the appropriate syntax.

#Getting the transcription segments
from datetime import timedelta #For when getting the segment time
import os #For creating the srt file in the filesystem
import whisper

model = whisper.load_model("base")
audio = whisper.load_audio("input_file.mp4")
result = model.transcribe(audio)

segments = result["segments"] #A list of segments

for segment in segments:
#...

3. An srt subtile file

Subtitle files in the srt format are divided into sequences that include the start and end timecodes — separated by the “ --> " string — followed by the caption text ending in a line break. Here’s an example:

1
00:01:26,612 --> 00:01:29,376
Took you long enough!
Did you find it? where is it?.

2
00:01:39,101 --> 00:01:42,609
I did. But I wish I didn't.

3
00:02:16,339 --> 00:02:18,169
What are you talking about?

Each segment contains an ID field that can be used as the sequence number. The start and end times — the moments during which the subtitle is supposed to be on screen — can be obtained by padding the timedelta of each of the corresponding fields with zeroes (we’re keeping things simple here, but note that a more accurate subtitle syncing result have been achieved by projects such as stable-ts). And the caption is the segment’s text. Here is the code that will generate each formatted subtitle sequence:

#Getting segments transcription and formatting it as an srt subtitle

#...

for segment in segments:
startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
text = segment['text']

subtitle_segment = f"{segment['id'] + 1}\n{startTime} --> {endTime}\n{ text }\n\n"

All that is left is to write each subtitle_segment to a new file:

#Writting to the output subtitle file
with open("subtitle.srt", 'a', encoding='utf-8') as srtFile:
srtFile.write(subtitle_segment)

The complete example code should look like this:

#main.py

from datetime import timedelta
import os
import whisper

model = whisper.load_model("base")
audio = whisper.load_audio("input_file.mp4")
result = model.transcribe(audio)

segments = result["segments"]

for segment in segments:
startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
text = segment['text']

subtitle_segment = f"{segment['id'] + 1}\n{startTime} --> {endTime}\n{ text }\n\n"
#Writting to the output subtitle file
with open("subtitle.srt", 'a', encoding='utf-8') as srtFile:
srtFile.write(subtitle_segment)

Now to try it out you can download this example file — Or bring your own! —_ _with wget for instance:

wget https://scaleway.com/ai-book/examples/1/example.mp4 -O input_file.mp4

And then simply run the script:

python3 main.py

After only a few seconds — because you’re using one of the fastest GPU instances on the planet —, the script will complete running and you will have a new subtitle.srt file that you can use during your video editing process or to load while playing the video file, great! But… the subtitle file is in the same language as the video. It is indeed useful as it is, but you probably want to reach a wider audience by translating it into different languages. We’ll explore that next.

4. Translating a segment’s text

Translating each segment’s text comes down to importing MarianMTModel andMarianTokenizer from Hugging Face’s Transformers library, passing the desired model name, and generating the translation. Install the dependencies by running the following command:

pip install transformers SentencePiece

In this example "Helsinki-NLP/opus-mt-fr-en" is used to translate from French to English. The translate abstracts the translation process by requiring a source string and returning a translated version of it.

from transformers import MarianMTModel, MarianTokenizer
# ...

opus_mt_model_name = "Helsinki-NLP/opus-mt-fr-en"
tokenizer = MarianTokenizer.from_pretrained(opus_mt_model_name)
opus_mt_model = MarianMTModel.from_pretrained(opus_mt_model_name)

def translate(str):
translated = opus_mt_model.generate(**tokenizer(str, return_tensors="pt", padding=True))
res = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
return res[0]

There’s no need to worry about the **tokenizer function for now, just know that it receives the source string and some additional parameters that we can leave untouched.

The complete code example looks like this:

from datetime import timedelta
import os
import whisper
from transformers import MarianMTModel, MarianTokenizer

model = whisper.load_model("base")
audio = whisper.load_audio("input_file.mp4")
result = model.transcribe(audio)

segments = result["segments"]

opus_mt_model_name = "Helsinki-NLP/opus-mt-fr-en"
tokenizer = MarianTokenizer.from_pretrained(opus_mt_model_name)
opus_mt_model = MarianMTModel.from_pretrained(opus_mt_model_name)

def translate(str):
translated = opus_mt_model.generate(**tokenizer(str, return_tensors="pt", padding=True))
res = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
return res[0]

for segment in segments:
startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
text = translate(segment['text'])


subtitle_segment = f"{segment['id'] + 1}\n{startTime} --> {endTime}\n{ text }\n\n"
#Writting to the output subtitle file
with open("subtitle.srt", 'a', encoding='utf-8') as srtFile:
srtFile.write(subtitle_segment)

That’s it! Even though the results are not perfect, and you may need to make a few manual adjustments here and there, considering the rate at which AI is advancing, things can only get better in the time to come.

You can now extend and adapt this code to your own needs, how about making it dynamically accept a file path as an input parameter? Or what if you made it into a web service others can easily take advantage of? The choice is yours! just don’t forget to cite the OPUS-MT paper on your implementations if you’re using the translation feature.

Share on
Other articles about: