Git Product home page Git Product logo

localwhisperx's Introduction

localwhisperx.py

About

Performs local speech to text transcription and speaker identification of audio and video files.

Especially useful for transcription of user interviews where confidentially might be an issue or where data privacy is needed. No data leaves your computer, everything runs locally.

Transcription will take pretty long compared to what you might be used to, since this script does not use any GPU optimization.

On a 2023 Mac M3 Pro this script runs roughly in real-time, meaning one minute of audio will take about one minute. Identifying speakers takes most of this time.

For many or large files it might be best to run this tool over night.

Getting Started

This script has been tested on Mac OS 14.4.1 (23E224). Your milage on Windows or another OS might vary.

Prerequisites

These instructions are for usage on Mac OS:

Installation

  • Download this repository and extract it to a folder of you choice. You should see these files:
    • config.yml
    • localwhisperx.py
  • Install these necessary Python modules:
    pip3 install pyyaml
    pip3 install ffmpeg-python
    pip3 install git+https://github.com/m-bain/whisperx.git
  • Open config.yml with your favourite Editor, e.g. TextEdit and add your Huggin Face User Access Token in the appropriate line

Usage

  • Put your audio and/or video files into a directory and open a Terminal.
  • In the open Terminal change to the directory, where you extracted this repository:
    cd path/to/unzipped/localwhisperx
  • Transcribe your files with the following command
    python3 localwhisperx.py your/audio/or/video/file/or/folder
    This will run the script and create a .txt file for the provided file or all files within the given directory.

On the first run the program will download a local copy of a large language model (LLM) which requires a couple of free gigabytes.

Command line arguments

The script can take a couple of command line arguments to refine your transcription:

  • --language Language spoken in all files, e.g. en; default is de
  • --minspeaker Minimum number of speakers; default is 1
  • --maxspeaker Minimum number of speakers; default is 2

Here's a example, converting an english language audio file with exactly 4 speakers:

python3 localwhisperx.py test.mp3 --language en --minspeaker 4 --maxspeaker 4

Selecting a model size

The file config.yml contains a line to change the LLM-model size. A larger model will generally result in a more accurate transcription. Likewise processing time will increase.

You can select one of the follwing options:

  • tiny
  • base
  • small
  • medium (default, recommended compromise of accuracy and processing time, takes about 1.5 GB of space on your harddisk)
  • large

Trouble Shooting

The tool might show a couple of warnings -- it is safe to ignore these:

UserWarning: torchaudio.\_backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0. Bad things might happen unless you revert torch to 1.x.

Next steps

If you're transcribing interviews put the resulting transcriptions into ChatGPT. Maybe try using the Interview Analyst GPT and see what a quick analysis will result in.

Acknowledgments

localwhisperx's People

Contributors

bbroke avatar

Stargazers

Mike avatar

Watchers

 avatar

Forkers

mikekotsch

localwhisperx's Issues

No language detected

Check and correct why No language specified, language will be first be detected for each audio file (increases inference time). is displayed, although everything seems fine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.