Automating Video Captioning with Whisper AI: Improving Accuracy and Accessibility on YouTube

YouTube is a widely popular platform for sharing videos and information. However, one common challenge faced by content creators and viewers is the accuracy of automatic captions generated by YouTube. In scenarios where there is distortion or noise in the audio, the auto-generated captions may not accurately reflect the content, affecting accessibility and user experience. To address this issue, we can leverage the power of OpenAI's open-source projects to automate the process of creating accurate captions for videos, thereby enhancing accessibility and improving user satisfaction.

Automating Video Captioning with Whisper AI:

OpenAI's Whisper is an advanced automatic speech recognition system that has been trained on a vast amount of multilingual and multitask supervised data. Leveraging state-of-the-art deep learning models, Whisper offers exceptional accuracy in transcribing audio and video content. This capability opens up various opportunities for extracting valuable insights from large volumes of spoken data. In this article, we will focus on utilizing Whisper specifically for transcribing audio from YouTube videos, thereby enhancing accessibility and facilitating analysis of video content.

The CODE :

We'll be using Python 3 in Google Colab for these project , as Whisper is provided in this language and could take GPU power

Installing Depedencies :

!pip install pytube
!pip install -U openai-whisper

Reading Youtube video and downloading as an MP4 file to transcribe

import pytube
# Reading the above Taken movie Youtube link
video = '<Youtube Video link>'
data = pytube.YouTube(video)
# Converting and downloading as 'MP4' file
audio = data.streams.get_audio_only()
audio.download()

After Downloading the audio file we are gonna use Whisper AI to Transcribe the Audio to Text File

You can adjust the model used here. Model choice is typically a tradeoff between accuracy and speed.

All available models are located at https://github.com/openai/whisper/#available-models-and-languages.

import whisper

model = whisper.load_model("base")
# im using "base" version as it is fast but it comes with a compromise in accuracy

Transcribing the audio to Text

result = model.transcribe("<location of the .MP4 file >")
# "/content/More " it is store here 
print(result["text"])

after running the code it will print the output as text in the bottom of the cell