Git Product home page Git Product logo

multivent_asr's Introduction

Introduction

This project aims to automatically segment and transcribe the MultiVENT dataset, reducing the time and effort required for human annotators to make further corrections.

Description of the recipe

The recipe can be run simply by

./run.sh

Here is the detailed explanation of the recipe:

Preparation

This step generates Lhotse CutSets of the MultiVENT dataset for fine-tuning an ASR model. It first converts the video to audio in WAV format with a sampling rate of 16 khz. It then decodes and segments the audio using WHISPER-timestamped to produce a CTM file. For segments that are too long (e.g., over 20 seconds), it performs resegmentation to break them down into smaller segments using an LLM (GPT). The resulting data is then processed to extract fbank features and stored as Lhotse CutSets.

./prepare.sh \
  --corpus-dir "${corpus_dir}" \
  --lang-dir "${pretrained_lang_dir}"

Note: To use GPT for resegmentation, please set the OPENAI_API_KEY by

export OPEN_AI_KEY=YOUR_OPEN_AI_KEY

Fine-tuning the Icefall ASR model

We fine-tune a pre-trained Icefall Zipformer Stateless Transducer model using the data prepared in the previous step.

./pruned_transducer_stateless7/finetune.py \
  --world-size ${gpus} \
  --num-epochs 20 \
  --start-epoch 1 \
  --exp-dir "${finetune_out_exp_dir}" \
  --base-lr 0.005 \
  --lr-epochs 100 \
  --lr-batches 100000 \
  --bpe-model "${pretrained_model_dir}/data/lang_bpe_500/bpe.model" \
  --do-finetune True \
  --finetune-ckpt "${pretrained_model_dir}/exp/pretrained.pt" \
  --max-duration 500 \
  --language "en"

We tried two pretrained models: one trained on LibriSpeech (1,000 hours) and the other on GigaSpeech (10,000 hours). Here are the fine-tuning results on the test set of MultiVENT created by Dongji Gao:

LibriSpeech GigaSpeech
WER(%) 25.06 36.94

OTC flexible alignment

This step performs OTC flexible alignment using an external ASR model. The flexible alignment modifies the result from Whisper by replacing suspicious or incorrectly inserted tokens with *, and by placing a * where a word is missing. This can provide detailed information to human annotators about which parts are likely to be wrong.

for event in ${events[@]}; do
  for language in ${languages[@]}; do
    ./conformer_ctc/otc_alignment.py \
      --event "${event}" \
      --language "${language}" \
      --method "${decoding_method}" \
      --exp-dir "${exp_dir}" \
      --lang-dir "${lang_dir}" 

Post-processing

This step converts the aligned text back to WHISPER style for readability.

for event in ${events[@]}; do
  for language in ${languages[@]}; do
    local/post_process_text.py \
      --text "${exp_dir}/otc-alignment-${event}_${language}.txt" \
      --whisper-text "data/${event}_${language}/text_raw" \
      --output-dir "data/${event}_${language}"

multivent_asr's People

Contributors

dongjigao avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.