whisper-webrtc's Introduction

Bolt

This browser based project let's the whisper model transcribe incoming audio in realtime.

Translation is not possible(maybe added later)

Openai's Whisper

[Blog] [Paper] [Model card] [Colab example]

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

Approach

Use the webrtc api to transmit audio data in realtime to the backend.
Extend the model, so it caches the previous outputs, hence mitigating duplicate computation.
Make realtime transcription happen.

Setup

Install the requirements.txt with pip. No need for ffmpeg.

# in backend
pip3 install -r requirements.txt

To start

cd frontend && npm run dev
cd ../backend && python3 main.py

TODO:

Docker file for the frontend and backend
The timestamp for individual words can also be extracted by extracting the timestamp_token after each word.
The timestamp after each token doesn't produce nice results. So I don't know how this will fare with "Scriptio continua" languages.
I am in need of some ideas for the continous transcription. Please state any methodology: twitter:@gslaller.

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x