errors made by a speech recognition system
overall idea: by "finetuning" an ngram-based Language Model one should be able to counteract errors that are specific to the acoustic model
- take pretrained NeMo QuartzNet (QuartzNet5x5LS-En)
- train KenLM on librispeech-lm-data; I suppose thats how LM models at openslr where created
-> receive librispeech-KenLM - use QuartzNet + librispeech-KenLM to predict on TEDLIUMv2 trainset
-> receive ~90k correction tuples(hypothesis,reference)
- use smith-waterman-algorithm (stolen from kaldi) to align hypothesis (predicitons) and references
-> looks like
hyp = "hee cad i blac"
ref = "I think the cat is black"
alignment:
ref: I think th|e cat is black
hyp: |||||||||hee cad i| blac|
- TED-talks contain
missus
: 98 times the QuartzNet + librispeech-KenLM understoodthis
instead ofmissus
["missus", {"this": 98, "is": 98, "this is": 76, "a": 9, "and": 7}]
- TED-talks preferes two word
per cent
["per cent", {"percent": 1083, "percent of": 17, "percent oil": 3, "percent that": 2, "forty percent": 2}]
- 104 times QuartzNet + librispeech-KenLM inserts an
and
betweenhundred
andfifty
-> not really an error?
["hundred fifty", {"hundred and fifty": 104, "one hundred and": 1, "a hundred and": 1}]
- see ngram_counts.jsonl for first 1000 most frequently misunderstood ngrams
- train
enhanced KenLM
on enhanced train-corpus = librispeech-lm-data + correction ngrams of erroneous phrases from TEDLIUMv2 trainset -> should givehundred fifty
slightly higher probability - Got WER of 0.289 vs. 0.284 (see) -> makes not really a difference!
- decoding is done with CTC prefix beam search + ngram LM implemented by NeMo and OpenSeq2Seq
- correction ngrams:
- only 3-grams where taken -> ~1mio (800k unique)
- added 10xtimes to librispeech-lm-data -> see
- impact on kenlm-arpa-file
hyp: were learning about celoron armchairs and once we figure out
Ref: we're learning about cellular mechanics once we figure out
zcat ngrams.txt.gz | rg --color=always "cellular" | wc -l -> 134
zcat kenlm_cache_vanilla/lower.txt.gz | rg cellular | wc -l -> 324
zcat kenlm_cache_tedlium/lower.txt.gz | rg cellular | wc -l -> 1664
cat kenlm_vanilla/lm_filtered.arpa | rg "\tcellular\t" -> -5.7082434
cat kenlm_tedlium/lm_filtered.arpa | rg "\tcellular\t" -> -5.644795
- processing TEDLIUMv2
- notebook downloading + mp3-converting librispeech data on google colab -> takes ages but is for free
found: /mydrive/data/asr_data/ENGLISH/train-clean-100.tar.gz no need to download
wrote /mydrive/data/asr_data/ENGLISH/train-clean-100_processed_mp3.tar.gz
downloading: http://www.openslr.org/resources/12/train-clean-360.tar.gz
wrote /mydrive/data/asr_data/ENGLISH/train-clean-360_processed_mp3.tar.gz
28539it [33:54, 14.03it/s]
104014it [2:13:54, 12.95it/s]
CPU times: user 5.85 s, sys: 1.07 s, total: 6.92 s
Wall time: 5h 31min 3s