AI in practice: Generating video subtitles

01/12/235 min read

Scaleway is a French company with an international vision, so it is imperative that we provide information to our 550+ employees in both English and French, to ensure clear understanding and information flow. We create a diverse set of training videos for internal usage, with some being originally voiced in English, and others in French. In all cases they should include subtitles for both languages.

Creating subtitles is a time-consuming process that we quickly realized would not scale. Fortunately, we were able to harness the power of AI for this exact task. With the help of OpenAI’s Whisper, the University of Helsinki’s Opus-MT and a bit of code, we were able to not only transcribe, and when required, translate our internal videos; but we could also generate subtitles in the srt format, that we can simply import into a video editing software or feed to a video player.

OpenAI’s Whisper

Whisper is an Open Source model created by OpenAI. It is a general-purpose speech recognition model that is able to identify and transcribe a wide variety of spoken languages. It is one of the most popular models around today and is released under MIT license.

OpenAI provides a Python SDK that will interact with the model, which has a wide variety of “flavors” based on the accuracy of their results: tiny, base, small, medium, and large. Larger models have been trained with a greater amount of parameters or examples, which makes them larger in size, and more resource-hungry — the tiny version of the model requires 1GB of VRAM (Video RAM) and the large version requires around 10GB.

Helsinki-NLP’s Opus-MT

The University of Helsinki made its own Open Source text translation models available based on the Marian-MT framework used by Microsoft Translator. Opus-MT models are provided as language pairs: translation source, and translation target, meaning that the model Helsinki-NLP/opus-mt-fr-en will translate text in French (fr) to English (en), and the other way around with Helsinki-NLP/opus-mt-en-fr.

Opus-MT can be used via the Transformers Python library from Hugging Face or using Docker. It is an Open Source project released under the MIT License and requires you to cite the OPUS-MT paper on your implementations:

@InProceedings{TiedemannThottingal:EAMT2020,  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},  year = {2020},  address = {Lisbon, Portugal} }

Generating subtitles

Combining these two models into a subtitle-generating service is only a matter of adding some code to “glue” them together. But before diving into the code, let’s review our requirements:

First, we need to create a Virtual Machine capable of running AI models without a hitch, and the NVIDIA H100-1-80G GPU instance is a great choice.

With the type of instance clear, we can now focus on the functional requirements. We want to pass in a video file as input to Whisper to get a transcript. The second step will be to translate that transcript using OPUS-MT from a specific source language to a target language. Finally, we want to create a subtitle file in the target language that is in sync with the audio.

Setting up Whisper

You will find the latest information about setting it up on their GitHub repository, but in general, you can install the Python library using pip:

pip install -U openai-whisper

Whisper relies heavily on the FFmpeg project for manipulating multimedia files. FFmpeg can be installed via APT:

sudo apt install ffmpeg -y

The code

1. A simple text transcription

This basic example is the most straightforward way to transcribe audio into text. After importing the Whisper library, you load a flavor of the model by passing a string with its name to the load_model method. In this case, the base model is accurate enough, but some use cases may require larger or smaller model flavors.

After loading the model, you load the audio source by passing the file path. Notice that you can use both audio and video files, and in general, any file type with audio that is supported by FFmpeg.

Finally, you make use of the transcribe method of the model by passing it the loaded audio. As a result, you get a dictionary that amongst other items, contains the whole transcription text.

#main.pyimport whispermodel = whisper.load_model("base")audio = whisper.load_audio("input_file.mp4")result = model.transcribe(audio)print(result["text"])

This basic example gives you the main tools needed for the rest of the project: loading a model, loading an input audio file, and transcribing the audio using the model. This is already a big step forward and puts us closer to our goal of generating a subtitle file, however, you may have noticed that the resulting text doesn’t include any time references, it’s only text. Syncing this transcribed text with the audio would be a task that would require large amounts of manual work, but fortunately, Whisper’s transcription process also outputs segments that are time-coded.

2. Segments

Having time-coded segments means you can pinpoint them to their specific start and end times during the clip. For instance, if the first speech segment in the clip is “We're no strangers” and it starts at 00:17:50 and ends at 00:18:30, you will get that information in the segment dictionary, giving you all you need to create an srt subtitle file, now all you have to do is to properly format it to conform with the appropriate syntax.

#Getting the transcription segmentsfrom datetime import timedelta #For when getting the segment timeimport os #For creating the srt file in the filesystemimport whispermodel = whisper.load_model("base")audio = whisper.load_audio("input_file.mp4")result = model.transcribe(audio)segments = result["segments"] #A list of segmentsfor segment in segments:	#...

3. An srt subtile file

Subtitle files in the srt format are divided into sequences that include the start and end timecodes — separated by the “ --> " string — followed by the caption text ending in a line break. Here’s an example:

100:01:26,612 --> 00:01:29,376Took you long enough!Did you find it? where is it?.200:01:39,101 --> 00:01:42,609I did. But I wish I didn't.300:02:16,339 --> 00:02:18,169What are you talking about?

Each segment contains an ID field that can be used as the sequence number. The start and end times — the moments during which the subtitle is supposed to be on screen — can be obtained by padding the timedelta of each of the corresponding fields with zeroes (we’re keeping things simple here, but note that a more accurate subtitle syncing result have been achieved by projects such as stable-ts). And the caption is the segment’s text. Here is the code that will generate each formatted subtitle sequence:

#Getting segments transcription and formatting it as an srt subtitle#...for segment in segments:	startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'	endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'	text = segment['text']	subtitle_segment = f"{segment['id'] + 1}\n{startTime} --> {endTime}\n{ text }\n\n"

All that is left is to write each subtitle_segment to a new file:

#Writting to the output subtitle file	with open("subtitle.srt", 'a', encoding='utf-8') as srtFile:    	srtFile.write(subtitle_segment)

The complete example code should look like this:

#main.pyfrom datetime import timedeltaimport osimport whispermodel = whisper.load_model("base")audio = whisper.load_audio("input_file.mp4")result = model.transcribe(audio)segments = result["segments"]for segment in segments:    startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'    endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'    text = segment['text']    subtitle_segment = f"{segment['id'] + 1}\n{startTime} --> {endTime}\n{ text }\n\n"    #Writting to the output subtitle file    with open("subtitle.srt", 'a', encoding='utf-8') as srtFile:   	 srtFile.write(subtitle_segment)

Now to try it out you can download _this example file — Or bring your own! — _with wget for instance:

wget https://scaleway.com/ai-book/examples/1/example.mp4 -O input_file.mp4

And then simply run the script:

python3 main.py

After only a few seconds — because you’re using one of the fastest GPU instances on the planet —, the script will complete running and you will have a new subtitle.srt file that you can use during your video editing process or to load while playing the video file, great! But… the subtitle file is in the same language as the video. It is indeed useful as it is, but you probably want to reach a wider audience by translating it into different languages. We’ll explore that next.

4. Translating a segment’s text

Translating each segment’s text comes down to importing MarianMTModel and MarianTokenizer from Hugging Face’s Transformers library, passing the desired model name, and generating the translation. Install the dependencies by running the following command:

pip install transformers SentencePiece

In this example "Helsinki-NLP/opus-mt-fr-en" is used to translate from French to English. The translate abstracts the translation process by requiring a source string and returning a translated version of it.

from transformers import MarianMTModel, MarianTokenizer# ...opus_mt_model_name = "Helsinki-NLP/opus-mt-fr-en"tokenizer = MarianTokenizer.from_pretrained(opus_mt_model_name)opus_mt_model = MarianMTModel.from_pretrained(opus_mt_model_name)def translate(str):	translated = opus_mt_model.generate(**tokenizer(str, return_tensors="pt", padding=True))	res = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]	return res[0]

There’s no need to worry about the **tokenizer function for now, just know that it receives the source string and some additional parameters that we can leave untouched.

The complete code example looks like this:

from datetime import timedeltaimport osimport whisperfrom transformers import MarianMTModel, MarianTokenizermodel = whisper.load_model("base")audio = whisper.load_audio("input_file.mp4")result = model.transcribe(audio)segments = result["segments"]opus_mt_model_name = "Helsinki-NLP/opus-mt-fr-en"tokenizer = MarianTokenizer.from_pretrained(opus_mt_model_name)opus_mt_model = MarianMTModel.from_pretrained(opus_mt_model_name)def translate(str):	translated = opus_mt_model.generate(**tokenizer(str, return_tensors="pt", padding=True))	res = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]	return res[0]for segment in segments:    startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'    endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'    text = translate(segment['text'])    subtitle_segment = f"{segment['id'] + 1}\n{startTime} --> {endTime}\n{ text }\n\n"    #Writting to the output subtitle file    with open("subtitle.srt", 'a', encoding='utf-8') as srtFile:   	 srtFile.write(subtitle_segment)

That’s it! Even though the results are not perfect, and you may need to make a few manual adjustments here and there, considering the rate at which AI is advancing, things can only get better in the time to come.

You can now extend and adapt this code to your own needs, how about making it dynamically accept a file path as an input parameter? Or what if you made it into a web service others can easily take advantage of? The choice is yours! just don’t forget to cite the OPUS-MT paper on your implementations if you’re using the translation feature.

The Best quotes from ai-PULSE 2023

The first edition of AI conference ai-PULSE was one to be remembered. Here’s a first sweep of the most headline-worthy quotes!

Build

James Martin

22/11/237 min read

AIai-PULSE

Why CPUs also make sense for AI inference - interview with Ampere Computing's Jeff Wittich

Artificial intelligence inference doesn’t necessarily need supercomputers, or GPUs, says Ampere CPO Jeff Wittich. CPUs are not only good enough, they can even be ideal, he says. Find out why...

Build

James Martin

13/11/234 min read

AIai-PULSE

How to Optimize LLM Performance with NVIDIA H100 GPUs from Scaleway, by Golem.ai

Why did Scaleway partner Golem.ai decide to experiment with LLMs? Because Symbolic & Generative AI approaches can be complementary. So here's how to optimize the latter!

Build

Kevin Baude

03/11/238 min read

AIGuest Post

AI in practice: Generating video subtitles

OpenAI’s Whisper

Helsinki-NLP’s Opus-MT

Generating subtitles

Setting up Whisper

The code

1. A simple text transcription

2. Segments

3. An srt subtile file

4. Translating a segment’s text

Recommended articles

The Best quotes from ai-PULSE 2023

Why CPUs also make sense for AI inference - interview with Ampere Computing's Jeff Wittich

How to Optimize LLM Performance with NVIDIA H100 GPUs from Scaleway, by Golem.ai