Some results about training TDNN_LSTM_CTC based on Single One GPU

Introduction

The icefall project contains speech-related recipes for various datasets using k2-fsa and lhotse.

You can use sherpa, sherpa-ncnn or sherpa-onnx for deployment with models in icefall; these frameworks also support models not included in icefall; please refer to respective documents for more details.

You can try pre-trained models from within your browser without the need to download or install anything by visiting this huggingface space. Please refer to document for more details.

Installation

Please refer to document for installation.

Recipes

Please refer to document for more details.

ASR: Automatic Speech Recognition

Supported Datasets

More datasets will be added in the future.

Supported Models

The LibriSpeech recipe supports the most comprehensive set of models, you are welcome to try them out.

CTC

TDNN LSTM CTC
Conformer CTC
Zipformer CTC

MMI

Conformer MMI
Zipformer MMI

Transducer

Conformer-based Encoder
LSTM-based Encoder
Zipformer-based Encoder
LSTM-based Predictor
Stateless Predictor

Whisper

OpenAi Whisper (We support fine-tuning on AiShell-1.)

If you are willing to contribute to icefall, please refer to contributing for more details.

We would like to highlight the performance of some of the recipes here.

yesno

This is the simplest ASR recipe in icefall and can be run on CPU. Training takes less than 30 seconds and gives you the following WER:

[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

We provide a Colab notebook for this recipe:

LibriSpeech

Please see RESULTS.md for the latest results.

Conformer CTC

	test-clean	test-other
WER	2.42	5.73

We provide a Colab notebook to test the pre-trained model:

TDNN LSTM CTC

	test-clean	test-other
WER	6.59	17.69

We provide a Colab notebook to test the pre-trained model:

Transducer (Conformer Encoder + LSTM Predictor)

	test-clean	test-other
greedy_search	3.07	7.51

We provide a Colab notebook to test the pre-trained model:

Transducer (Conformer Encoder + Stateless Predictor)

	test-clean	test-other
modified_beam_search (`beam_size=4`)	2.56	6.27

We provide a Colab notebook to test the pre-trained model:

Transducer (Zipformer Encoder + Stateless Predictor)

WER (modified_beam_search beam_size=4 unless further stated)

LibriSpeech-960hr

Encoder	Params	test-clean	test-other	epochs	devices
Zipformer	65.5M	2.21	4.79	50	4 32G-V100
Zipformer-small	23.2M	2.42	5.73	50	2 32G-V100
Zipformer-large	148.4M	2.06	4.63	50	4 32G-V100
Zipformer-large	148.4M	2.00	4.38	174	8 80G-A100

LibriSpeech-960hr + GigaSpeech

Encoder	Params	test-clean	test-other
Zipformer	65.5M	1.78	4.08

LibriSpeech-960hr + GigaSpeech + CommonVoice

Encoder	Params	test-clean	test-other
Zipformer	65.5M	1.90	3.98

GigaSpeech

Conformer CTC

	Dev	Test
WER	10.47	10.58

Transducer (pruned_transducer_stateless2)

Conformer Encoder + Stateless Predictor + k2 Pruned RNN-T Loss

	Dev	Test
greedy_search	10.51	10.73
fast_beam_search	10.50	10.69
modified_beam_search	10.40	10.51

Transducer (Zipformer Encoder + Stateless Predictor)

	Dev	Test
greedy_search	10.31	10.50
fast_beam_search	10.26	10.48
modified_beam_search	10.25	10.38

Aishell

TDNN LSTM CTC

	test
CER	10.16

We provide a Colab notebook to test the pre-trained model:

Transducer (Conformer Encoder + Stateless Predictor)

	test
CER	4.38

We provide a Colab notebook to test the pre-trained model:

Transducer (Zipformer Encoder + Stateless Predictor)

WER (modified_beam_search beam_size=4)

Encoder	Params	dev	test	epochs
Zipformer	73.4M	4.13	4.40	55
Zipformer-small	30.2M	4.40	4.67	55
Zipformer-large	157.3M	4.03	4.28	56

Aishell4

Transducer (pruned_transducer_stateless5)

1 Trained with all subsets:

	test
CER	29.08

We provide a Colab notebook to test the pre-trained model:

TIMIT

TDNN LSTM CTC

	TEST
PER	19.71%

We provide a Colab notebook to test the pre-trained model:

TDNN LiGRU CTC

	TEST
PER	17.66%

We provide a Colab notebook to test the pre-trained model:

TED-LIUM3

Transducer (Conformer Encoder + Stateless Predictor)

	dev	test
modified_beam_search (`beam_size=4`)	6.91	6.33

We provide a Colab notebook to test the pre-trained model:

Transducer (pruned_transducer_stateless)

	dev	test
modified_beam_search (`beam_size=4`)	6.77	6.14

We provide a Colab notebook to test the pre-trained model:

Aidatatang_200zh

Transducer (pruned_transducer_stateless2)

	Dev	Test
greedy_search	5.53	6.59
fast_beam_search	5.30	6.34
modified_beam_search	5.27	6.33

We provide a Colab notebook to test the pre-trained model:

WenetSpeech

Transducer (pruned_transducer_stateless2)

	Dev	Test-Net	Test-Meeting
greedy_search	7.80	8.75	13.49
fast_beam_search	7.94	8.74	13.80
modified_beam_search	7.76	8.71	13.41

We provide a Colab notebook to test the pre-trained model:

Transducer Streaming (pruned_transducer_stateless5)

	Dev	Test-Net	Test-Meeting
greedy_search	8.78	10.12	16.16
fast_beam_search	9.01	10.47	16.28
modified_beam_search	8.53	9.95	15.81

Alimeeting

Transducer (pruned_transducer_stateless2)

	Eval	Test-Net
greedy_search	31.77	34.66
fast_beam_search	31.39	33.02
modified_beam_search	30.38	34.25

We provide a Colab notebook to test the pre-trained model:

TAL_CSASR

Transducer (pruned_transducer_stateless5)

The best results for Chinese CER(%) and English WER(%) respectively (zh: Chinese, en: English):

decoding-method	dev	dev_zh	dev_en	test	test_zh	test_en
greedy_search	7.30	6.48	19.19	7.39	6.66	19.13
fast_beam_search	7.18	6.39	18.90	7.27	6.55	18.77
modified_beam_search	7.15	6.35	18.95	7.22	6.50	18.70

We provide a Colab notebook to test the pre-trained model:

TTS: Text-to-Speech

Supported Datasets

Supported Models

VITS

Deployment with C++

Once you have trained a model in icefall, you may want to deploy it with C++ without Python dependencies.

Please refer to

for how to do this.

We also provide a Colab notebook, showing you how to run a torch scripted model in k2 with C++. Please see:

	if isinstance(lattice.aux_labels, torch.Tensor):
	word_seq = k2.ragged.index(lattice.aux_labels, path)
	else:
	word_seq = lattice.aux_labels.index(path)
	word_seq = word_seq.remove_axis(word_seq.num_axes - 2)

	# Each utterance has `num_paths` paths but some of them transduces
	# to the same word sequence, so we need to remove repeated word
	# sequences within an utterance. After removing repeats, each utterance
	# contains different number of paths
	#
	# `new2old` is a 1-D torch.Tensor mapping from the output path index
	# to the input path index.
	_, _, new2old = word_seq.unique(
	need_num_repeats=False, need_new2old_indexes=True
	)

	# word_seq is a k2.RaggedTensor sharing the same shape as `path`
	# but it contains word IDs. Note that it also contains 0s and -1s.
	# The last entry in each sublist is -1.
	if isinstance(lattice.aux_labels, torch.Tensor):
	word_seq = k2.ragged.index(lattice.aux_labels, path)
	else:
	word_seq = lattice.aux_labels.index(path, remove_axis=True)

	# Remove 0 (epsilon) and -1 from word_seq
	word_seq = word_seq.remove_values_leq(0)

k2-fsa / icefall Goto Github PK

icefall's Introduction

Introduction

Installation

Recipes

ASR: Automatic Speech Recognition

Supported Datasets

Supported Models

CTC

MMI

Transducer

Whisper

TTS: Text-to-Speech

Supported Datasets

Supported Models

Deployment with C++

icefall's People

Contributors

Stargazers

Watchers

Forkers

icefall's Issues

Corpora

Architectures

Topologies

Criterions

AM targets

First difference

Second difference

Recommend Projects

Recommend Topics

Recommend Org