This browser based project let's the whisper model transcribe incoming audio in realtime.
Translation is not possible(maybe added later)
[Blog] [Paper] [Model card] [Colab example]
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
- Use the webrtc api to transmit audio data in realtime to the backend.
- Extend the model, so it caches the previous outputs, hence mitigating duplicate computation.
- Make realtime transcription happen.
Install the requirements.txt with pip. No need for ffmpeg.
# in backend
pip3 install -r requirements.txt
To start
cd frontend && npm run dev
cd ../backend && python3 main.py
TODO:
- Docker file for the frontend and backend
- The timestamp for individual words can also be extracted by extracting the timestamp_token after each word.
- The timestamp after each token doesn't produce nice results. So I don't know how this will fare with "Scriptio continua" languages.
- I am in need of some ideas for the continous transcription. Please state any methodology: twitter:@gslaller.
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |