How to use Bacalhau and OpenAI Whisper to transcribe your video and audio files

Published in

Nerd For Tech

7 min readNov 10, 2022

Captions, subtitles, and transcripts are all ways to help your audio and video content reach a wider audience and encourage more interaction with its readers and listeners.

Over the past week, I experimented with a new approach to video and audio transcription. I used OpenAI Whisper and Bacalhau to transcribe my downloaded YouTube videos and extract the text from them in different formats (.srt,.vtt, .txt).

I was eager to give this a try, and I was blown away by how well it worked and how simple it was to implement. This process is not exclusive to YouTube videos only, you can try it out on any video or audio of your choice.

If this is something you want to try out, this is a tutorial on how to get started.

Here’s the roadmap for this project:

Install dependencies
Create a Whisper Python script
Create a Dockerfile to containerize your Whisper script
Run Whisper on Bacalhau

But before you get started, let’s dive into why I decided to use Bacalhau and Open AI Whisper and what it is all about

What is Bacalhau and OpenAI Whisper?

Whisper is an open-source, general-purpose speech recognition model developed by OpenAI. It is a multi-task model trained on a large dataset to perform language recognition, vocal activity detection, transcription, and translation. In addition to English, Whisper was trained in over 96 languages with 680,000 hours of audio.

Bacalhau (Compute Over Data, or CoD) is a network of open compute resources made available to serve any data processing workload. It processes and transforms large-scale datasets by enabling users to run arbitrary Docker containers and (WebAssembly) wasm images against data stored in IPFS (InterPlanetary File System). Bacalhau operates as a peer-to-peer network of nodes where each node participates in executing and computing jobs submitted to the cluster.

The advantage of using Bacalhau over managed Automatic Speech Recognition services

You can manage your own containers that can scale to batch process petabytes (quadrillion bytes) of audio and video files.
Using its sharding feature, you carry out distributed inference very easily. Typically, distributed inference is carried out on large-scale datasets with millions of records.
If you have the data stored on IPFS you don’t need to move the data, you can compute where the data is located.
The cost of computing is much cheaper than managed services.

Install dependencies

To get started, you’ll need to install all the dependencies below. It is assumed that you already have Python and pip installed.

Install FFmpeg an audio-processing library.

#Linux
sudo apt update && sudo apt install ffmpeg

#MacOS
brew install ffmpeg

#Windows 
chco install ffmpeg

Note: The macOS installation command requires Homebrew, and the Windows installation command requires Chocolatey.

2. Install Pytorch, an open-source machine learning (ML) framework

pip install torch

3. Install Whisper, an open-source speech recognition model

pip install git+https://github.com/openai/whisper.git -q

4. Install Bacalhau, to compute the data processing workload

curl -sL https://get.bacalhau.org/install.sh | bash

Create Whisper Python script

For the Whisper script, you will need to create a file called openai-whisper.py. Below is the Whisper sample script code written by the Bacalhau team. Copy and paste the code below into your openai-whisper.pyfile.

The above script accepts and sets the required parameters, like input file path, output file path, temperature, etc. Next, the script is configured to execute on the GPU and also convert .mp4 files to .wav files. The Whisper model “large” is used. You can find more information about the different Whisper models. Next, the script is set to save the output transcript in various formats after we have loaded the model.

Test the Whisper script

In order to test the script to ensure everything works as expected. You’ll need to run the following commands below in your terminal

To download the test audio clip

wget https://github.com/js-ts/hello/raw/main/hello.mp3

To run your whisper script

python openai-whisper.py

To view the output for the test sample audio

#View the text document file format
cat hello.txt

#view the subtitle file format
cat hello.srt

#view the WebVTT format
cat hello.vtt

Create a Dockerfile to containerize your Whisper script

At this stage, you will need to create a Dockerfile to containerize your Python Whisper script. A Dockerfile is a text file that contains instructions that Docker uses to create a container image. You can check the docs to learn more about Docker.

To containerize the script

Create an empty file called Dockerfile

touch Dockerfile

2. In the Dockerfile, add the following lines of code. These commands specify how the image will be built, and what extra requirements will be included.

3. Right-click on the Dockerfile and click on build image

So what exactly is happening in the Dockerfile?

The pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime image is used as the base image.
The dependencies to be installed are added to the container
The test audio file and our openai-whisper the script is also added to the container
Finally, docker is run to check if the container builds successfully.

Running Whisper on Bacalhau

This is the point where you get to transcribe your video. As stated earlier, I’ll be using this Youtube video (this is an 8-minute long video) as an example to show how this works. I downloaded the video in .mp4.

You can use any video of your choice, it doesn’t have to be a YouTube video

Get CID number

After downloading your video, the next step is to upload it to IPFS to get the CID (content identifier) number. You can use NFTUp to upload the video by following the steps below:

Create an account on NFTUp
Get your key on your account page.
Drag and drop your downloaded video for it to be uploaded
Copy your CID number

For this example, the CID number is:

bafybeidwbzzi3hjg54tvdabiesc54lrb3qerzunu4uuahh3o6g3tfmitee

Run the container on Bacalhau

To run the container on Bacalhau, copy and paste the following command into your terminal

bacalhau docker run \
> jsacex/whisper \
> --gpu 1 \
> -v bafybeidwbzzi3hjg54tvdabiesc54lrb3qerzunu4uuahh3o6g3tfmitee:/ytvideo.mp4 \
> -- python openai-whisper.py -p ytvideo.mp4 -o outputs

From the above command:

The — gpu flag denotes the no of GPUs we are going to use
The -v flag mounts our file to a specific location
-p provides the input path of our file
-o provides the output path of the file

When you run the command, Bacalhau prints out the related job id: f07d5a18–3c5c-4df7–8269–1695ca61ae86

At this point, you can free-style and run a series of Bacalhau commands to find out more about the job submitted.

To find out the state of your job, run the following command

bacalhau list --id-filter f07d5a18-3c5c-4df7-8269-1695ca61ae86

When it says Completed, this means the job is done, and you can get the results.

To find out more information about your job, run the following command:

bacalhau describe f07d5a18-3c5c-4df7-8269-1695ca61ae86

Once your job is complete, you will be getting something like this.

Job successfully submitted. Job ID: f07d5a18-3c5c-4df7-8269-1695ca61ae86
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

Creating job for submission ... done ✅
               Finding node(s) for the job ... done ✅
                     Node accepted the job ... done ✅
           Job finished, verifying results ... done ✅
              Results accepted, publishing ... Results CID: QmWPpwPiBtkJtk5tg7FZnEHzWMEZhFUdbz5vWd1dHsTJ6Q
Job Results By Node:
Node QmUDAXvv:
  Shard 0:
    Status: Completed
    Container Exit Code: 0
    Stdout (truncated: last 2000 characters):
      ]  one day this will go back it's not in that comfort zone it's in the discomfort zone
[06:37.840 --> 06:41.520]  is where my confidence is getting good that's what's getting good the people
[06:41.520 --> 06:46.160]  they want an easier answer there has to be an easier way it's not I'm sorry I
[06:46.160 --> 06:54.000]  searched for my entire life we're built for struggle us human beings you know

Note: For the sake of brevity, I removed some parts of the result

The job’s outputs are saved on IPFS after processing is complete. To locally download your result, create an output directory to save your results

mkdir results

Use the Bacalhau command below to download the results in your output directory

bacalhau get f07d5a18-3c5c-4df7-8269-1695ca61ae86  --output-dir results

After the download has finished your contents in the results directory

View the Output

In your result folder, you have three sub-folders that contain the output formats

You can view your result in either.srt, .txt or .vtt

Below, is a screenshot from the input.vtt file

And that is it! very accurate and very clean results. You can do more with this. You can transcribe your movie footage, podcasts, lecture recordings, etc.

Interesting Applications for Bacalhau

Most likely, this post has piqued your interest in Bacalhau, and if so, I should tell you that I’ve only touched the surface of the many applications to which it may be put. It has a number of interesting applications beyond speech recognition using whisper. It can be used for image processing, data conversion, generating realistic images using styleGAN3, and much more.

If you want to learn more about Bacalhau, you can check out the official documentation. You can also check the amazing project on GitHub and slack channel

References