Learning From Errors

errors made by a speech recognition system

overall idea: by "finetuning" an ngram-based Language Model one should be able to counteract errors that are specific to the acoustic model

"generating" errors

take pretrained NeMo QuartzNet (QuartzNet5x5LS-En)
train KenLM on librispeech-lm-data; I suppose thats how LM models at openslr where created
-> receive librispeech-KenLM
use QuartzNet + librispeech-KenLM to predict on TEDLIUMv2 trainset
-> receive ~90k correction tuples (hypothesis,reference)
use smith-waterman-algorithm (stolen from kaldi) to align hypothesis (predicitons) and references
-> looks like

hyp = "hee cad i blac"
ref = "I think the cat is black"

alignment:
ref: I think th|e cat is black
hyp: |||||||||hee cad i| blac|

analyse errors

TED-talks contain missus: 98 times the QuartzNet + librispeech-KenLM understood this instead of missus

["missus", {"this": 98, "is": 98, "this is": 76, "a": 9, "and": 7}]

TED-talks preferes two word per cent

["per cent", {"percent": 1083, "percent of": 17, "percent oil": 3, "percent that": 2, "forty percent": 2}]

104 times QuartzNet + librispeech-KenLM inserts an and between hundred and fifty -> not really an error?

["hundred fifty", {"hundred and fifty": 104, "one hundred and": 1, "a hundred and": 1}]

see ngram_counts.jsonl for first 1000 most frequently misunderstood ngrams

"fine-tune" KenLM

train enhanced KenLM on enhanced train-corpus = librispeech-lm-data + correction ngrams of erroneous phrases from TEDLIUMv2 trainset -> should give hundred fifty slightly higher probability
Got WER of 0.289 vs. 0.284 (see) -> makes not really a difference!

details

decoding is done with CTC prefix beam search + ngram LM implemented by NeMo and OpenSeq2Seq
correction ngrams:
- only 3-grams where taken -> ~1mio (800k unique)
- added 10xtimes to librispeech-lm-data -> see
impact on kenlm-arpa-file

hyp: were learning about celoron armchairs and once we figure out
Ref: we're learning about cellular mechanics once we figure out
zcat ngrams.txt.gz | rg --color=always "cellular" | wc -l -> 134
zcat kenlm_cache_vanilla/lower.txt.gz | rg cellular | wc -l -> 324
zcat kenlm_cache_tedlium/lower.txt.gz | rg cellular | wc -l -> 1664
cat kenlm_vanilla/lm_filtered.arpa | rg "\tcellular\t" -> -5.7082434
cat kenlm_tedlium/lm_filtered.arpa | rg "\tcellular\t" -> -5.644795

preparing data

processing TEDLIUMv2
notebook downloading + mp3-converting librispeech data on google colab -> takes ages but is for free

found: /mydrive/data/asr_data/ENGLISH/train-clean-100.tar.gz no need to download
wrote /mydrive/data/asr_data/ENGLISH/train-clean-100_processed_mp3.tar.gz
downloading: http://www.openslr.org/resources/12/train-clean-360.tar.gz
wrote /mydrive/data/asr_data/ENGLISH/train-clean-360_processed_mp3.tar.gz
28539it [33:54, 14.03it/s]
104014it [2:13:54, 12.95it/s]
CPU times: user 5.85 s, sys: 1.07 s, total: 6.92 s
Wall time: 5h 31min 3s

ishine / lfc Goto Github PK

lfc's Introduction

Learning From Errors

"generating" errors

"fine-tune" KenLM

details

preparing data

lfc's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent