Git Product home page Git Product logo

icefall's Introduction

Introduction

The icefall project contains speech-related recipes for various datasets using k2-fsa and lhotse.

You can use sherpa, sherpa-ncnn or sherpa-onnx for deployment with models in icefall; these frameworks also support models not included in icefall; please refer to respective documents for more details.

You can try pre-trained models from within your browser without the need to download or install anything by visiting this huggingface space. Please refer to document for more details.

Installation

Please refer to document for installation.

Recipes

Please refer to document for more details.

ASR: Automatic Speech Recognition

Supported Datasets

More datasets will be added in the future.

Supported Models

The LibriSpeech recipe supports the most comprehensive set of models, you are welcome to try them out.

CTC

  • TDNN LSTM CTC
  • Conformer CTC
  • Zipformer CTC

MMI

  • Conformer MMI
  • Zipformer MMI

Transducer

  • Conformer-based Encoder
  • LSTM-based Encoder
  • Zipformer-based Encoder
  • LSTM-based Predictor
  • Stateless Predictor

Whisper

If you are willing to contribute to icefall, please refer to contributing for more details.

We would like to highlight the performance of some of the recipes here.

This is the simplest ASR recipe in icefall and can be run on CPU. Training takes less than 30 seconds and gives you the following WER:

[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

We provide a Colab notebook for this recipe: Open In Colab

Please see RESULTS.md for the latest results.

test-clean test-other
WER 2.42 5.73

We provide a Colab notebook to test the pre-trained model: Open In Colab

test-clean test-other
WER 6.59 17.69

We provide a Colab notebook to test the pre-trained model: Open In Colab

test-clean test-other
greedy_search 3.07 7.51

We provide a Colab notebook to test the pre-trained model: Open In Colab

test-clean test-other
modified_beam_search (beam_size=4) 2.56 6.27

We provide a Colab notebook to test the pre-trained model: Open In Colab

WER (modified_beam_search beam_size=4 unless further stated)

  1. LibriSpeech-960hr
Encoder Params test-clean test-other epochs devices
Zipformer 65.5M 2.21 4.79 50 4 32G-V100
Zipformer-small 23.2M 2.42 5.73 50 2 32G-V100
Zipformer-large 148.4M 2.06 4.63 50 4 32G-V100
Zipformer-large 148.4M 2.00 4.38 174 8 80G-A100
  1. LibriSpeech-960hr + GigaSpeech
Encoder Params test-clean test-other
Zipformer 65.5M 1.78 4.08
  1. LibriSpeech-960hr + GigaSpeech + CommonVoice
Encoder Params test-clean test-other
Zipformer 65.5M 1.90 3.98
Dev Test
WER 10.47 10.58

Conformer Encoder + Stateless Predictor + k2 Pruned RNN-T Loss

Dev Test
greedy_search 10.51 10.73
fast_beam_search 10.50 10.69
modified_beam_search 10.40 10.51
Dev Test
greedy_search 10.31 10.50
fast_beam_search 10.26 10.48
modified_beam_search 10.25 10.38
test
CER 10.16

We provide a Colab notebook to test the pre-trained model: Open In Colab

test
CER 4.38

We provide a Colab notebook to test the pre-trained model: Open In Colab

WER (modified_beam_search beam_size=4)

Encoder Params dev test epochs
Zipformer 73.4M 4.13 4.40 55
Zipformer-small 30.2M 4.40 4.67 55
Zipformer-large 157.3M 4.03 4.28 56

1 Trained with all subsets:

test
CER 29.08

We provide a Colab notebook to test the pre-trained model: Open In Colab

TEST
PER 19.71%

We provide a Colab notebook to test the pre-trained model: Open In Colab

TEST
PER 17.66%

We provide a Colab notebook to test the pre-trained model: Open In Colab

dev test
modified_beam_search (beam_size=4) 6.91 6.33

We provide a Colab notebook to test the pre-trained model: Open In Colab

dev test
modified_beam_search (beam_size=4) 6.77 6.14

We provide a Colab notebook to test the pre-trained model: Open In Colab

Dev Test
greedy_search 5.53 6.59
fast_beam_search 5.30 6.34
modified_beam_search 5.27 6.33

We provide a Colab notebook to test the pre-trained model: Open In Colab

Dev Test-Net Test-Meeting
greedy_search 7.80 8.75 13.49
fast_beam_search 7.94 8.74 13.80
modified_beam_search 7.76 8.71 13.41

We provide a Colab notebook to test the pre-trained model: Open In Colab

Dev Test-Net Test-Meeting
greedy_search 8.78 10.12 16.16
fast_beam_search 9.01 10.47 16.28
modified_beam_search 8.53 9.95 15.81
Eval Test-Net
greedy_search 31.77 34.66
fast_beam_search 31.39 33.02
modified_beam_search 30.38 34.25

We provide a Colab notebook to test the pre-trained model: Open In Colab

The best results for Chinese CER(%) and English WER(%) respectively (zh: Chinese, en: English):

decoding-method dev dev_zh dev_en test test_zh test_en
greedy_search 7.30 6.48 19.19 7.39 6.66 19.13
fast_beam_search 7.18 6.39 18.90 7.27 6.55 18.77
modified_beam_search 7.15 6.35 18.95 7.22 6.50 18.70

We provide a Colab notebook to test the pre-trained model: Open In Colab

TTS: Text-to-Speech

Supported Datasets

Supported Models

Deployment with C++

Once you have trained a model in icefall, you may want to deploy it with C++ without Python dependencies.

Please refer to the document for how to do this.

We also provide a Colab notebook, showing you how to run a torch scripted model in k2 with C++. Please see: Open In Colab

icefall's People

Contributors

csukuangfj avatar danpovey avatar desh2608 avatar emreozkose avatar ezerhouni avatar glynpu avatar huangruizhe avatar jinzr avatar karelvesely84 avatar kobenaxie avatar luomingshuang avatar marcoyang1998 avatar pingfengluo avatar pkufool avatar pzelasko avatar rickychanhoyin avatar rouseabout avatar shanguanma avatar shcxlee avatar teapoly avatar teowenshen avatar videodanchik avatar wangtiance avatar waynewiser avatar wgb14 avatar yaguanghu avatar yaozengwei avatar yfyeung avatar yuekaizhang avatar zhuangweiji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

icefall's Issues

Official Dockerfile?

Should we create an official Dockerfile that shows how to prepare the env for Icefall experiments?

CUDA error of device

Hi, I run the TDNN-LSTM-CTC training of librispeech, command line: ./tdnn_lstm_ctc/train.py --world-size 4,but I did not use: export CUDA_VISIBLE_DEVICES="0,1,2,3", because our computing cluster can not set the CUDA_VISIBLE_DEVICES, so I got some errors :

Traceback (most recent call last):
File "./tdnn_lstm_ctc/train.py", line 616, in
main()
File "./tdnn_lstm_ctc/train.py", line 610, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/nfs13/nfs/aisearch/asr/cdxie/icefall/egs/librispeech/ASR/tdnn_lstm_ctc/train.py", line 512, in run
setup_dist(rank, world_size, params.master_port)
File "/workspace/icefall/icefall/dist.py", line 30, in setup_dist
torch.cuda.set_device(rank)
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 261, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

1)how to solve the error ?
2)whether the icefall do not support the Multi-GPU multi-machine DDP training cause the above error?If it is,when added?

RuntimeError: Specified device cuda:0 does not match device of data cuda:-2

Hello,

I am training a TDNN-LSTM model with librispeech recipe on 16k 100 hours data. After training, I run decode.py. I sometimes observe a cuda issue (given below). Have you ever observe something like that? I think it is related to something during training. Because after some trainings, decode.py works well, however after some of trainings, decode.py gives this error. I googled RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 error, but found nothing. I have Tesla-p100 16gb. I should also mention that 1best works well, but problem occurs during nbest and rescorings.

(k2) yunusemre.ozkose@boxx-3:/path/to/k2/icefall/egs/from_wav_scp/ASR$ python tdnn_lstm_ctc/decode.py --avg 1 --epoch 9
2021-09-02 14:24:46,677 INFO [decode.py:324] Decoding started
2021-09-02 14:24:46,678 INFO [decode.py:325] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp9_w2v2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 1024, 'subsampling_factor': 1, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'nbest-rescoring', 'num_paths': 10, 'epoch': 9, 'avg': 1, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-09-02 14:24:47,880 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-02 14:24:48,469 INFO [decode.py:334] device: cuda:0
2021-09-02 14:25:02,211 INFO [decode.py:362] Loading pre-compiled G_4_gram.pt
2021-09-02 14:25:02,846 INFO [checkpoint.py:75] Loading checkpoint from tdnn_lstm_ctc/exp9_w2v2/epoch-9.pt
/path/to/k2/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames or max_cuts constraints - we'll return it anyway. Consider increasing max_frames/max_cuts.
  warnings.warn(
2021-09-02 14:25:07,886 INFO [decode.py:271] batch 0, cuts processed until now is 1/171 (0.584795%)
Traceback (most recent call last):
  File "tdnn_lstm_ctc/decode.py", line 432, in <module>
    main()
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "tdnn_lstm_ctc/decode.py", line 415, in main
    results_dict = decode_dataset(
  File "tdnn_lstm_ctc/decode.py", line 250, in decode_dataset
    hyps_dict = decode_one_batch(
  File "tdnn_lstm_ctc/decode.py", line 190, in decode_one_batch
    best_path_dict = rescore_with_n_best_list(
  File "/path/to/k2/icefall/icefall/decode.py", line 405, in rescore_with_n_best_list
    am_scores, _ = compute_am_and_lm_scores(
  File "/path/to/k2/icefall/icefall/decode.py", line 297, in compute_am_and_lm_scores
    path_lattice = _intersect_device(
  File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
    return k2.intersect_device(
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
    out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
    value = index_select(a_value, a_arc_map, default_value=filler) \
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 160, in index_select
    ans = _IndexSelectFunction.apply(src, index, default_value)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 66, in forward
    return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f41692162f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f416921367b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7f40c8316200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7f40c83fc0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7f40c8372bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7f40c837658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7f40c838d876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7f40c830bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7f41c016d41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #54: __libc_start_main + 0xe7 (0x7f41f24cbb97 in /lib/x86_64-linux-gnu/libc.so.6)

Multiple tokenization alternatives in lexicon from the sub-word tokenizer

@csukuangfj @danpovey I wonder if you considered using all possible BPE tokenizations as "pronunciation alternatives" in the lexicon. It'd look sth like:

ALTERNATIVES 1.0 ALT _ER _NA _TI _VES
ALTERNATIVES 0.7 ALTER _NA _TI _VES
ALTERNATIVES 0.11 ALTER _NA _TI _V _E _S
...
ALTERNATIVES 0.001 A _L _T _E _R _N _A _T _I _V _E _S

I recall that in machine translation, different tokenizations are sometimes sampled on the fly for the training examples, so in the expectation, the model sees all possible tokenizations. I thought with k2 the model could be optimized against all possible tokenizations at the same time, but I'm also concerned about the resulting graph sizes. WDYT?

Recipes for release

I'm thinking what are the recipes we can show at the tutorial, and whether we need to add more. AFAIK we have the following ready (and can use roughly any combination of them):

Corpora

  • LibriSpeech
  • Aishell

Architectures

  • tdnn-lstm
  • conformer
  • contextnet

Topologies

  • CTC

Criterions

  • CTC
  • MMI 2-gram dens

AM targets

  • phones + blank
  • BPEs + blank

tdnn_lstm_ctc training error of librispeech

hi, when I run the tdnn_lstm_ctc training of librispeech on Epoch 5, I got one error,please take a look,thanks
log of error:

2021-09-17 13:28:10,440 INFO [train.py:450] Epoch 5, batch 8620, batch avg loss 1.0633, total avg loss: 1.1221, batch size: 39
2021-09-17 13:28:22,302 INFO [train.py:450] Epoch 5, batch 8630, batch avg loss 1.1049, total avg loss: 1.1507, batch size: 41
2021-09-17 13:28:25,554 WARNING [cut.py:1694] To perform mix, energy must be non-zero and non-negative (got 0.0). MonoCut with id "845a0a69-f758-7b6a-90d8-ba99fa1795c4" will not be mixed in.
2021-09-17 13:28:36,682 INFO [train.py:450] Epoch 5, batch 8640, batch avg loss 1.1305, total avg loss: 1.1622, batch size: 40
2021-09-17 13:28:49,311 INFO [train.py:450] Epoch 5, batch 8650, batch avg loss 1.1228, total avg loss: 1.1774, batch size: 37
[F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] [F] /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void /workspace/k2/k2/csrc/intersect_dense.cu:863:lambda [](signed int)->void::operator()(signed int)->void block:[0,0,0], thread: [32,0,0] block:[0,0,0], thread: [33,0,0] block:[0,0,0], thread: [34,0,0] block:[0,0,0], thread: [35,0,0] block:[0,0,0], thread: [36,0,0] block:[0,0,0], thread: [37,0,0] block:[0,0,0], thread: [38,0,0] block:[0,0,0], thread: [39,0,0] block:[0,0,0], thread: [0,0,0] block:[0,0,0], thread: [1,0,0] block:[0,0,0], thread: [2,0,0] block:[0,0,0], thread: [3,0,0] block:[0,0,0], thread: [4,0,0] block:[0,0,0], thread: [5,0,0] block:[0,0,0], thread: [6,0,0] block:[0,0,0], thread: [7,0,0] block:[0,0,0], thread: [8,0,0] block:[0,0,0], thread: [9,0,0] block:[0,0,0], thread: [10,0,0] block:[0,0,0], thread: [11,0,0] block:[0,0,0], thread: [12,0,0] block:[0,0,0], thread: [13,0,0] block:[0,0,0], thread: [14,0,0] block:[0,0,0], thread: [15,0,0] block:[0,0,0], thread: [16,0,0] block:[0,0,0], thread: [17,0,0] block:[0,0,0], thread: [18,0,0] block:[0,0,0], thread: [19,0,0] block:[0,0,0], thread: [20,0,0] block:[0,0,0], thread: [21,0,0] block:[0,0,0], thread: [22,0,0] block:[0,0,0], thread: [23,0,0] block:[0,0,0], thread: [24,0,0] block:[0,0,0], thread: [25,0,0] block:[0,0,0], thread: [26,0,0] block:[0,0,0], thread: [27,0,0] block:[0,0,0], thread: [28,0,0] block:[0,0,0], thread: [29,0,0] block:[0,0,0], thread: [30,0,0] block:[0,0,0], thread: [31,0,0] Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_s/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [32,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [33,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [34,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [35,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [36,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [37,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [38,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [39,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [0,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [1,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [2,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [3,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [4,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [5,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [6,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [7,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [8,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [9,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [10,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [11,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [12,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [13,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [14,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [15,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [16,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [17,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [18,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [19,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [20,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [21,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [22,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [23,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [24,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [25,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [26,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [27,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [28,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [29,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [30,0,0] Assertion Some bad things happened failed.
/workspace/k2/k2/csrc/intersect_dense.cu:863: lambda [](signed int)->void::operator()(signed int)->void: block: [0,0,0], thread: [31,0,0] Assertion Some bad things happened failed.
tart || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0 nannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannan vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs vs nannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannannan
[F] /workspace/k2/k2/csrc/array.h:341:T k2::Array1::operator const [with T = int; int32_t = int] Check failed: ret == cudaSuccess (710 vs. 0) Error: device-side assert triggered.
[ Stack-Trace: ]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/libk2_log.so(k2::internal::GetStackTrace()+0x47) [0x7f8c456419f7]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/libk2context.so(k2::Array1::operator const+0xeb9) [0x7f8c4593c8e9]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/libk2context.so(k2::Renumbering::ComputeOld2New()+0x14e) [0x7f8c459377ee]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/libk2context.so(k2::Renumbering::ComputeNew2Old()+0x7f8) [0x7f8c45938f68]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/libk2context.so(k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1, k2::Array1)+0x7ec) [0x7f8c45a9e47c]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/libk2context.so(k2::IntersectDense(k2::Raggedk2::Arc&, k2::DenseFsaVec&, k2::Array1 const*, float, k2::Raggedk2::Arc, k2::Array1, k2::Array1)+0x420) [0x7f8c45a8e900]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x65390) [0x7f8c4bad4390]
/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x1ee9e) [0x7f8c4ba8de9e]
python3(PyCFunction_Call+0x58) [0x55e5b8fb72d8]
python3(_PyObject_MakeTpCall+0x23c) [0x55e5b8fa6edc]
python3(_PyEval_EvalFrameDefault+0x11dd) [0x55e5b902f4ad]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyFunction_Vectorcall+0x1e3) [0x55e5b8ffd593]
python3(PyObject_CallObject+0x52) [0x55e5b9002982]
/opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so(THPFunction_apply(_object
, _object*)+0x8fd) [0x7f8d427dc39d]
python3(PyCFunction_Call+0xe0) [0x55e5b8fb7360]
python3(_PyObject_MakeTpCall+0x23c) [0x55e5b8fa6edc]
python3(_PyEval_EvalFrameDefault+0x45a9) [0x55e5b9032879]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyFunction_Vectorcall+0x1e3) [0x55e5b8ffd593]
python3(+0x10425f) [0x55e5b8f6725f]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyFunction_Vectorcall+0x1e3) [0x55e5b8ffd593]
python3(+0x19aac9) [0x55e5b8ffdac9]
python3(PyObject_Call+0x414) [0x55e5b8fa7874]
python3(_PyEval_EvalFrameDefault+0x2088) [0x55e5b9030358]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyObject_Call_Prepend+0x181) [0x55e5b8ffe051]
python3(+0x19b3fa) [0x55e5b8ffe3fa]
python3(_PyObject_MakeTpCall+0x23c) [0x55e5b8fa6edc]
python3(_PyEval_EvalFrameDefault+0x475) [0x55e5b902e745]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyFunction_Vectorcall+0x1e3) [0x55e5b8ffd593]
python3(+0x103562) [0x55e5b8f66562]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyFunction_Vectorcall+0x1e3) [0x55e5b8ffd593]
python3(+0x103562) [0x55e5b8f66562]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyFunction_Vectorcall+0x1e3) [0x55e5b8ffd593]
python3(+0x103562) [0x55e5b8f66562]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(_PyFunction_Vectorcall+0x1e3) [0x55e5b8ffd593]
python3(+0x103562) [0x55e5b8f66562]
python3(_PyFunction_Vectorcall+0x10b) [0x55e5b8ffd4bb]
python3(+0x10425f) [0x55e5b8f6725f]
python3(_PyEval_EvalCodeWithName+0x300) [0x55e5b8ffc760]
python3(PyEval_EvalCode+0x23) [0x55e5b90914e3]
python3(+0x22e584) [0x55e5b9091584]
python3(+0x2547c4) [0x55e5b90b77c4]
python3(+0x115620) [0x55e5b8f78620]

Traceback (most recent call last):
File "./tdnn_lstm_ctc/train.py", line 616, in
main()
File "./tdnn_lstm_ctc/train.py", line 612, in main
run(rank=0, world_size=1, args=args)
File "./tdnn_lstm_ctc/train.py", line 575, in run
train_one_epoch(
File "./tdnn_lstm_ctc/train.py", line 424, in train_one_epoch
loss = compute_loss(
File "./tdnn_lstm_ctc/train.py", line 317, in compute_loss
loss = k2.ctc_loss(
File "/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/k2/ctc_loss.py", line 136, in ctc_loss
return m(decoding_graph, dense_fsa_vec, target_lengths)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/k2/ctc_loss.py", line 80, in forward
lattice = intersect_dense(decoding_graph, dense_fsa_vec,
File "/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/k2/autograd.py", line 810, in intersect_dense
_IntersectDenseFunction.apply(a_fsas, b_fsas, out_fsa, output_beam,
File "/opt/conda/lib/python3.8/site-packages/k2-1.6.dev20210906+cuda11.1.torch1.8.0-py3.8-linux-x86_64.egg/k2/autograd.py", line 550, in forward
ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense(
RuntimeError: Some bad things happed.

Aborted error when attempting to use ctc-decoding

I receive an error trying to decode using ctc-decoding. I am using the master branch without code changes but using a model trained for several epochs.

command

gdb --args python ./conformer_ctc/decode.py --avg 1 --epoch 5 --method ctc-decoding --exp-dir exp_dir/ --lang-dir data/lang_bpe_5000 --max-duration 300

result

[...]

2021-11-02 10:06:55,671 INFO [decode.py:476] batch 0/?, cuts processed until now is 6
[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1631619831677/work/k2/csrc/array.h:501:void k2::Array1<T>::Init(k2::ContextPtr, int32_t, k2::Dtype) [with T = char; k2::ContextPtr = std::shared_ptr<k2::Context>; int32_t = int] Check failed: size >= 0 (-1383015021 vs. 0) Array size MUST be greater than or equal to 0, given :-1383015021


[ Stack-Trace: ]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x7fff5b01021c]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2context.so(k2::internal::Logger::~Logger()+0x5a) [0x7fff5b33535a]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2context.so(k2::Array1<char>::Init(std::shared_ptr<k2::Context>, int, k2::Dtype)+0x1cd) [0x7fff5b3630dd]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2context.so(k2::Renumbering::Init(std::shared_ptr<k2::Context>, int, bool)+0xa7) [0x7fff5b3697a7]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2context.so(k2::Renumbering::Renumbering(std::shared_ptr<k2::Context>, int, bool)+0xbc) [0x7fff5b36ac5c]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int)+0x37c) [0x7fff5b4c945c]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2context.so(std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1}>::_M_invoke(std::_Any_data const&)+0x203) [0x7fff5b4cc763]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/libk2context.so(k2::ThreadPool::ProcessTasks()+0x164) [0x7fff5b610b94]
/anaconda3/envs/selfsl/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xc8421) [0x7fffd25b5421]
/lib64/libpthread.so.0(+0x7ea5) [0x7ffff7bc6ea5]
/lib64/libc.so.6(clone+0x6d) [0x7ffff78ef9fd]

terminate called after throwing an instance of 'std::runtime_error'
  what():
      Some bad things happened. Please read the above error messages and stack
      trace. If you are using Python, the following command may be helpful:

            gdb --args python /path/to/your/code.py

                (You can use `gdb` to debug the code. Please consider compiling
                    a debug version of k2.).

                        If you are unable to fix it, please open an issue at:

                              https://github.com/k2-fsa/k2/issues/new


                              Program received signal SIGABRT, Aborted.
                              [Switching to Thread 0x7fff3f7ee700 (LWP 24121)]
                              0x00007ffff7827387 in raise () from /lib64/libc.so.6
                              Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64

As part of debugging (this is sort of separate issue) but I tried using pretrained model for decode script to ensure its not model related but I received the error

_pickle.UnpicklingError: invalid load key, 'v'

when running the line:

checkpoint = torch.load(filename, map_location="cpu")

Any plan to implement dataloader of CE model, like Kaldi

I tried to perform a VAD model based on two-class CE loss, but found that data preparation was complicated, and conventional data preparation would consume a lot of storage space. Are there any plans to provide dataloader for CE training?

CUDA out of memory in decoding

Hi, I am newer to learn icefall,I finished the training of tdnn_lstm_ctc, when run the decode steps, I meet the following error, I change the --max-duration, there are still errors:

2021-10-04 00:42:07,942 INFO [decode.py:383] Decoding started
2021-10-04 00:42:07,942 INFO [decode.py:384] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'lattice_score_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 50, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-04 00:42:08,361 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt
2021-10-04 00:42:08,614 INFO [decode.py:393] device: cuda:0
2021-10-04 00:42:23,560 INFO [decode.py:406] Loading G_4_gram.fst.txt
2021-10-04 00:42:23,560 WARNING [decode.py:407] It may take 8 minutes.
Traceback (most recent call last):
File "./tdnn_lstm_ctc/decode.py", line 492, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "./tdnn_lstm_ctc/decode.py", line 420, in main
G = k2.arc_sort(G)
File "/opt/conda/lib/python3.8/site-packages/k2-1.8.dev20210918+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/k2/fsa_algo.py", line 441, in arc_sort
ragged_arc, arc_map = _k2.arc_sort(fsa.arcs, need_arc_map=need_arc_map)
RuntimeError: CUDA out of memory. Tried to allocate 884.00 MiB (GPU 0; 15.78 GiB total capacity; 14.28 GiB already allocated; 461.19 MiB free; 14.29 GiB reserved in total by PyTorch)

the device used:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 27C P0 25W / 250W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 28C P0 25W / 250W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

would you give me some advice?thanks

pretrained show worse cer than decode?

I have trained my own model and test one my datasets.
The first step is decoded with many params, just like finetune (as decode.py), and save the best params. I use --max-duration=20. And I will save the decode results using the best params on all dataset (not just one).

Then I use this best params to decode wave (as pretrained.py), just one by one on these datasets.

All my dataset show a little worse cer using pretrained.py. Cer comparision below.
decode.py: 3.190 12.802 17.995 9.569 14.478 10.299 16.242 7.329 20.695
pretrained.py: 3.203 13.029 18.177 9.662 14.610 10.447 16.463 7.333 20.911

Is this normal? I see the feature extraction is not the same, will this be the reason?

Preparing features takes too long

Hi,

Preparing filter banks for large datasets (~6000h) takes too long. Even time of preparing manifest is approximately 10 hours. Do you have any suggestion to make this process fast?

I am trying to modify Dataset part as on-the-fly computing (if feature is not computed then compute&write else read). So I can observe early batches earlier.

Problem for long wav with multi supervisions

When I train with data which has multi supervisions, such as one wav has two supervisions [(0, 5, 'I'm good'), (10, 15, 'ok')], error will occur.

I have debug it and find the reason. For dataloader, lhotse.load_manifest will get one MonoCut with two supervisions, but will only generate one feature.

I print some variables in compute_loss() in conformer_ctc/train.py. The first dimension of feature is 1, but the length of token_ids = graph_compiler.texts_to_ids(texts) is 2. Then decode_forward() will fail.

Error in train conformer

Following the instructions for training the conformer ctc model, I get the following error:

transformer.py", line 706, in forward x = x * self.xscale + self.pe[:, : x.size(1), :] IndexError: too many indices for tensor of dimension 2

I am currently running the simples command:
./conformer_ctc/train.py --world-size 1

Any idea of what the problem could be?

TypeError: object of type 'SingleCutSampler' has no len()

hi,I run the decode steps of tdnn-lstm-ctc, I meet the following error:
-> def decode_dataset(
(Pdb) n

icefall/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py(300)decode_dataset()
-> results = []
(Pdb)
icefall/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py(302)decode_dataset()
-> num_cuts = 0
(Pdb)
icefall/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py(304)decode_dataset()
-> try:
(Pdb)
icefall/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py(305)decode_dataset()
-> num_batches = len(dl)
(Pdb)
TypeError: object of type 'SingleCutSampler' has no len()

Is this a problem with the version I used?

k2 version: 1.8

logging not working for ddp training

Seems that logging module is not working when training with ddp for some particular pytorch version(I tried three versions, it works for torch 1.7.2, but not for 1.8.0, 1.8.1).

I searched for a while and did not figure out how to fix it.

FYI.

lhotse download librispeech command is error

I install k2_fsa and lhotse via the below commands:

1. install k2-fsa
$ conda create -n k2-fsa20210823 python=3.8
$ conda activate k2-fsa20210823
$ conda install -c k2-fsa -c pytorch -c conda-forge k2 python=3.8 cudatoolkit=11.1 pytorch=1.8.1

2. install hotse
$ pip install git+https://github.com/lhotse-speech/lhotse
3.  install icefall
$ git clone https://github.com/k2-fsa/icefall.git
$ cd icafall
$ pip install -r requirements.txt
$  export PYTHONPATH=/home/maduo/w2021/k2-fsa_20210823/icefall:$PYTHONPATHON
                                                                               

When I run ./prepared.sh --stage 0 --stop_stage 0, the error is as follows:

(k2-fsa20210823) maduo@pd:~/w2021/k2-fsa_20210823/icefall/egs/librispeech/ASR$ ./prepare.sh --stage 0 --stop_stage 0
2021-08-23 14:32:41 (prepare.sh:57:main) dl_dir: /mnt/4T/md/icefall_recipes/librispeech/download
2021-08-23 14:32:41 (prepare.sh:66:main) stage 0: Download data
./prepare.sh: /home/maduo/miniconda3/envs/k2-fsa20210823/bin/lhotse: python: bad interpreter: No such file or directory

I found the https://github.com/lhotse-speech/lhotse/blob/2a1410bfd08bc5117d67d09f470fde14b8231521/lhotse/bin/lhotse#L1
The python interpreter is ok. I don't know where is it wrong?

AttributeError: Can't get attribute 'RaggedInt' on <module '_k2'

Hi, I finished the training of TDNN-LSTM-CTC,when I run the step of decoding, I met a error:

2021-09-28 14:56:23,027 INFO [decode.py:349] Decoding started
2021-09-28 14:56:23,048 INFO [decode.py:350] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': 'whole-lattice-rescoring', 'num_paths': 30, 'epoch': 19, 'avg': 5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-09-28 14:56:23,854 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt
2021-09-28 14:56:25,177 INFO [decode.py:359] device: cuda:0
Traceback (most recent call last):
File "./tdnn_lstm_ctc/decode.py", line 457, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "./tdnn_lstm_ctc/decode.py", line 362, in main
torch.load(f"{params.lang_dir}/HLG.pt", map_location="cpu")
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
AttributeError: Can't get attribute 'RaggedInt' on <module '_k2' from '/opt/conda/lib/python3.8/site-packages/k2-1.8.dev20210918+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so'>

PS:

python3 -m k2.version
Collecting environment information...

k2 version: 1.8
Build type: Release
Git SHA1: 8030001c9a002aa17e090a41de3f1146bdfe1e78
Git date: Fri Sep 17 05:42:56 2021
Cuda used to build k2: 11.0
cuDNN used to build k2: 8.0.4
Python version used to build k2: 3.8
OS used to build k2:
CMake version: 3.18.0
GCC version: 7.5.0
CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.7.1
PyTorch is using Cuda: 11.0
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False

I need some help,thanks

Exception when building HLG with large language model

It seems that overflow occurs when building HLG on large language model.
The apra language model info:

\data\
ngram 1=136920
ngram 2=30620323
ngram 3=58191282
ngram 4=63623739

After successfully detrminize(compose(L,G)) , an exception occurs when remove_epsilon from LG.
From the exception traceback we can see that something like overflow when cating impl_->byte_offset into int64_t ?

2021-11-25 16:23:09,208 INFO [compile_hlg.py:173] Processing data/lang_phone
2021-11-25 16:23:09,479 INFO [lexicon.py:177] Loading pre-compiled data/lang_phone/Linv.pt
2021-11-25 16:23:09,900 INFO [compile_hlg.py:79] Building ctc_topo. max_token_id: 218
2021-11-25 16:23:09,911 INFO [compile_hlg.py:86] Loading L_disambig.fst.txt
2021-11-25 16:23:12,274 INFO [compile_hlg.py:91] Loading G.fst.txt
2021-11-25 16:34:12,174 INFO [compile_hlg.py:110] Intersecting L and G
2021-11-25 17:04:55,417 INFO [compile_hlg.py:112] LG shape: (552371407, None)
2021-11-25 17:04:55,417 INFO [compile_hlg.py:114] Connecting LG
2021-11-25 17:04:55,417 INFO [compile_hlg.py:116] LG shape after k2.connect: (552371407, None)
2021-11-25 17:04:55,418 INFO [compile_hlg.py:118] <class 'torch.Tensor'>
2021-11-25 17:04:55,418 INFO [compile_hlg.py:119] Determinizing LG
2021-11-25 17:51:51,788 INFO [compile_hlg.py:122] <class '_k2.ragged.RaggedTensor'>
2021-11-25 17:51:51,788 INFO [compile_hlg.py:124] Connecting LG after k2.determinize
2021-11-25 17:51:51,788 INFO [compile_hlg.py:127] Removing disambiguation symbols on LG
[F]/k2/k2/k2/csrc/tensor.cu:159:k2::Tensor::Tensor(k2::Dtype, const k2::Shape&, k2::RegionPtr, int32_t) Check failed: int64_t(impl_->byte_offset) + begin_elem * element_size >= 0 (-156
3030948 vs. 0) impl_->byte_offset: -1563030948, begin_elem: 0, element_size: 4


[ Stack-Trace: ]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2_log.so(k2::internal::GetStackTrace()+0x4f) [0x7f57d74b67af]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(k2::Tensor::Tensor(k2::Dtype, k2::Shape const&, std::share
d_ptr<k2::Region>, int)+0x91a) [0x7f57d7a2f50a]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(k2::Array2<int>::Col(int)+0x13a) [0x7f57d79d9dea]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(+0x29f869) [0x7f57d79cd869]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(k2::Index(k2::RaggedShape&, int, k2::Array1<int> const&, k
2::Array1<int>*)+0x1da) [0x7f57d79cfc2a]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(+0x2e6655) [0x7f57d7a14655]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(k2::ComputeEpsilonClosureOneIter(k2::Ragged<k2::Arc>&, k2:
:Ragged<k2::Arc>*, k2::Ragged<int>*)+0xdc8) [0x7f57d7a19a38]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(k2::ComputeEpsilonClosure(k2::Ragged<k2::Arc>&, k2::Ragged
<k2::Arc>*, k2::Ragged<int>*)+0x109) [0x7f57d7a1aac9]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(k2::RemoveEpsilonDevice(k2::Ragged<k2::Arc>&, k2::Ragged<k
2::Arc>*, k2::Ragged<int>*)+0x269) [0x7f57d7a1c3f9]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so(k2::RemoveEpsilonDevice(k2::Ragged<k2::Arc>&, k2::Ragged<k
2::Arc>*, k2::Ragged<int>*)+0x1949) [0x7f57d7a1dad9]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x76141) [0x7f57d8bb1141]
/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x25273) [0x7f57d8b60273]
python3(PyCFunction_Call+0x54) [0x5585214a2c64]
python3(_PyObject_MakeTpCall+0x31e) [0x5585214ac94e]
python3(_PyEval_EvalFrameDefault+0x540f) [0x55852153cf0f]
python3(_PyFunction_Vectorcall+0x1a6) [0x55852151cc46]
python3(_PyEval_EvalFrameDefault+0x4dd3) [0x55852153c8d3]
python3(_PyEval_EvalCodeWithName+0x2c3) [0x55852151ba33]
python3(_PyFunction_Vectorcall+0x378) [0x55852151ce18]
python3(_PyEval_EvalFrameDefault+0x947) [0x558521538447]
python3(_PyFunction_Vectorcall+0x1a6) [0x55852151cc46]
python3(_PyEval_EvalFrameDefault+0x947) [0x558521538447]
python3(_PyEval_EvalCodeWithName+0x2c3) [0x55852151ba33]
python3(PyEval_EvalCodeEx+0x39) [0x55852151ca99]
python3(PyEval_EvalCode+0x1b) [0x5585215c51db]
python3(+0x24f273) [0x5585215c5273]
python3(+0x26cea3) [0x5585215e2ea3]
python3(+0x272582) [0x5585215e8582]
python3(PyRun_SimpleFileExFlags+0x1b2) [0x5585215e8762]                                                                                                                                            [54/370833]
python3(Py_RunMain+0x36d) [0x5585215e8cdd]
python3(Py_BytesMain+0x39) [0x5585215e8e99]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f58b1835840]
python3(+0x1dcd1d) [0x558521552d1d]

Traceback (most recent call last):
  File "./local/compile_hlg.py", line 187, in <module>
    main()
  File "./local/compile_hlg.py", line 175, in main
    HLG = compile_HLG(lang_dir, lm_dir, args.oov)
  File "./local/compile_hlg.py", line 137, in compile_HLG
    LG = k2.remove_epsilon(LG)
  File "/miniconda3/envs/k2/lib/python3.8/site-packages/k2-1.10.dev20211111+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/k2/fsa_algo.py", line 554, in remove_epsilon
    ragged_arc, arc_map = _k2.remove_epsilon(fsa.arcs, fsa.properties)
RuntimeError:
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

The code for compile HLG:

def compile_HLG(lang_dir: str,
                lm_dir: str,
                oov: str = "<UNK>") -> k2.Fsa:
    lexicon = Lexicon(lang_dir)
    max_token_id = max(lexicon.tokens)
    logging.info(f"Building ctc_topo. max_token_id: {max_token_id}")
    H = k2.ctc_topo(max_token_id, modified=False)
    
    if Path(lang_dir / "L_disambig.pt").is_file():
        logging.info("Loading L_disambig")
        L = k2.Fsa.from_dict(torch.load(f"{lang_dir}/L_disambig.pt"))
    else:
        logging.info("Loading L_disambig.fst.txt")
        with open(lang_dir / "L_disambig.fst.txt") as f:
            L = k2.Fsa.from_openfst(f.read(), acceptor=False)
            torch.save(L.as_dict(), f"{lang_dir}/L_disambig.pt")

    logging.info("Loading G.fst.txt")
    with open(lm_dir / "G_4gram.kndiscount.fst") as f:
        G = k2.Fsa.from_openfst(f.read(), acceptor=False)

    first_token_disambig_id = lexicon.token_table["#0"]
    first_word_disambig_id = lexicon.word_table["#0"]

    # remove oov word symbol.
    #if isinstance(G.aux_labels, k2.RaggedTensor):
    #    G.aux_labels.values[G.aux_labels.values == lexicon.word_table[oov]] = 0
    #else:
    #    G.aux_labels[G.aux_labels == lexicon.word_table[oov]] = 0
    #G.__dict__["_properties"] = None

    L = k2.arc_sort(L)
    G = k2.arc_sort(G)

    #G = k2.determinize(G)

    logging.info("Intersecting L and G")
    LG = k2.compose(L, G)
    logging.info(f"LG shape: {LG.shape}")

    logging.info("Connecting LG")
    LG = k2.connect(LG)
    logging.info(f"LG shape after k2.connect: {LG.shape}")

    logging.info(type(LG.aux_labels))
    logging.info("Determinizing LG")

    LG = k2.determinize(LG)
    logging.info(type(LG.aux_labels))

    logging.info("Connecting LG after k2.determinize")
    LG = k2.connect(LG)

    logging.info("Removing disambiguation symbols on LG")

    LG.labels[LG.labels >= first_token_disambig_id] = 0
     # See https://github.com/k2-fsa/k2/issues/874
    # for why we need to set LG.properties to None
    LG.__dict__["_properties"] = None

    assert isinstance(LG.aux_labels, k2.RaggedTensor)
    LG.aux_labels.values[LG.aux_labels.values >= first_word_disambig_id] = 0

    LG = k2.remove_epsilon(LG)
    logging.info(f"LG shape after k2.remove_epsilon: {LG.shape}")

    LG = k2.connect(LG)
    LG.aux_labels = LG.aux_labels.remove_values_eq(0)

    logging.info("Arc sorting LG")
    LG = k2.arc_sort(LG)

    logging.info("Composing H and LG")
    # CAUTION: The name of the inner_labels is fixed
    # to `tokens`. If you want to change it, please
    # also change other places in icefall that are using
    # it.

    HLG = k2.compose(H, LG, inner_labels="tokens")

    logging.info("Connecting HLG")
    HLG = k2.connect(HLG)

    logging.info("Arc sorting HLG")
    HLG = k2.arc_sort(HLG)
    logging.info(f"HLG.shape: {HLG.shape}")

    return HLG

Design thoughts

Guys (mostly Piotr but also anyone who's listening),

Firstly, on timeline, we need something working well in time for September 1st-ish, the time we give the tutorial.
So we can't be too ambitious: think, cleaned up and reorganized version of Snowfall, working by early-to-mid August. Sorry I have delayed this for so long. Liyong is working on replicating ESPNet results with k2 mechanisms, he is making good progress, we may want to incorporate parts of that.

I want to avoid big centralized APIs at the moment.

I also want to avoid the phenomenon in SpeechBrain and ESPNet where there is a kind of "configuration layer" where
you pass in configs, and these get parsed into actual python code by some other code. I would rather keep it all
Python code. Suppose we have a directory (this doesn't have to be the real name):
egs/librispeech/ASR/
then I am thinking we can have subdirectories of that where the scripts for different versions of experiments live.
We might have some data-prep scripts:
egs/librispeech/ASR/{prepare.sh,local/blahblah,...}
and these would write to some subdirectory, e.g. egs/librispeech/ASR/data/...
Then for different experiments we'd have the scripts in subdirectories, like:
egs/librispeech/ASR/tdnn_lstm_ctc/{model.py,train.py,decode.py,README.md}
and we might have
egs/librispeech/ASR/conformer_mmi/{model.py,train.py,decode.py,README.md}
that would refer to the alignment model in e.g. ../tdnn_lstm_ctc/8.pt, and to the data in ../data/blah...

The basic idea here is that if you want to change the experiment locally, you would copy-and-modify the scripts in conformer_mmi to e.g. conformer_mmi_1a/, and add them to your git repo if wanted. We would avoid overloading the scripts in these experiment directories with command-line options. Any back-compatibility would be at the level of the icefall Python libraries themselves. We could perhaps introduce versions of the data directories as well, e.g. data/, data2/ and so on (not sure whether it would make sense to have multiple versions of the data-prep scripts or use options).

In order to avoid overloading the model code, and utility code, with excessive back-compatibility, I suggest that we have versions of the model code and maybe even parts of the other libraries: e.g. snowfall/models1/. Then we can add options etc., but when it becomes oppressive we can just copy-and-modify to models2/ and strip out most of the options. This will tend to reduce "cyclomatic complexity" by keeping any given version of the code simple. At this point, let's think of this to some extent as a demo tool for k2 and lhotse, we don't have to think of it as some vast toolkit with zillions of features.

Error in Yes No recipe probably due to installation

Hi, I installed k2 from source and lhotse via pip. To check if my (k2 and lhotse) installation is ok, I am trying to run Yes No recipe. I did not change anything in the scripts however, while running yes no recipe, I am getting an error (RuntimeError: invalid device function). I am getting the same error in librispeech recipe and a recipe which I wrote. It seems to be due to installation and probably nvcc version and if anybody can help me with this error. My log with environment information is as follows:

# Running on r7n04
# Started at Sun Oct 31 20:00:38 EDT 2021
# /home/hltcoe/aarora/miniconda3/envs/k2_scratch2/bin/python3 ./tdnn/train.py
2021-10-31 20:00:40,299 INFO [train.py:481] Training started
2021-10-31 20:00:40,299 INFO [train.py:482] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.9', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7178d67e594bc7fa89c2b331ad7bd1c62a6a9eb4', 'k2-git-date': 'Tue Oct 26 10:12:54 2021', 'lhotse-version': '0.11.0.dev+git.7f56dd1.clean', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'coe_asr2', 'icefall-git-sha1': 'e06baf3-clean', 'icefall-git-date': 'Sun Oct 31 19:53:21 2021', 'icefall-path': '/exp/aarora/icefall_work_env/icefall', 'k2-path': '/exp/aarora/icefall_work_env/k2_me/k2/python/k2/__init__.py', 'lhotse-path': '/exp/aarora/icefall_work_env/lhotse/lhotse/__init__.py'}}
2021-10-31 20:00:40,326 INFO [lexicon.py:176] Loading pre-compiled data/lang_phone/Linv.pt
2021-10-31 20:00:43,731 INFO [asr_datamodule.py:145] About to get train cuts
2021-10-31 20:00:43,732 INFO [asr_datamodule.py:242] About to get train cuts
2021-10-31 20:00:43,758 INFO [asr_datamodule.py:148] About to create train dataset
2021-10-31 20:00:43,758 INFO [asr_datamodule.py:199] Using SingleCutSampler.
2021-10-31 20:00:43,760 INFO [asr_datamodule.py:205] About to create train dataloader
2021-10-31 20:00:43,760 INFO [asr_datamodule.py:218] About to get test cuts
2021-10-31 20:00:43,761 INFO [asr_datamodule.py:248] About to get test cuts
Traceback (most recent call last):
  File "./tdnn/train.py", line 573, in <module>
    main()
  File "./tdnn/train.py", line 569, in main
    run(rank=0, world_size=1, args=args)
  File "./tdnn/train.py", line 534, in run
    train_one_epoch(
  File "./tdnn/train.py", line 404, in train_one_epoch
    loss, loss_info = compute_loss(
  File "./tdnn/train.py", line 300, in compute_loss
    decoding_graph = graph_compiler.compile(texts)
  File "/exp/aarora/icefall_work_env/icefall/icefall/graph_compiler.py", line 74, in compile
    transcript_fsa = self.convert_transcript_to_fsa(texts)
  File "/exp/aarora/icefall_work_env/icefall/icefall/graph_compiler.py", line 116, in convert_transcript_to_fsa
    word_fsa = k2.linear_fsa(word_ids_list, self.device)
  File "/exp/aarora/icefall_work_env/k2_me/k2/python/k2/fsa_algo.py", line 66, in linear_fsa
    ragged_arc = _k2.linear_fsa(labels, device)
**RuntimeError: invalid device function**
# Accounting: time=7 threads=1
# Finished at Sun Oct 31 20:00:45 EDT 2021 with status 1

using TPUs in Colab ??

I'm working on and trying to train Zeroth Korean dataset with tdnn_lstm_ctc and conformer_ctc. But, unfortunately, I think I will suffer from GPU memory shortage for a while before I updated GPUs in my data center.

Do you think I can try this icefall recipes (training) on Google Colab with multiple TPU env? It looks like needed slight code modification.

By the way, I really appreciate about this Icefall open source work. It is now much better to use then Snowfall before. Thank you all!

Duplicated token seqs are used for rescoring

While implementing rescoring with conformer lm, I find that there are duplicated token seqs.

The reason is that the following code

icefall/icefall/decode.py

Lines 222 to 237 in 810b193

if isinstance(lattice.aux_labels, torch.Tensor):
word_seq = k2.ragged.index(lattice.aux_labels, path)
else:
word_seq = lattice.aux_labels.index(path)
word_seq = word_seq.remove_axis(word_seq.num_axes - 2)
# Each utterance has `num_paths` paths but some of them transduces
# to the same word sequence, so we need to remove repeated word
# sequences within an utterance. After removing repeats, each utterance
# contains different number of paths
#
# `new2old` is a 1-D torch.Tensor mapping from the output path index
# to the input path index.
_, _, new2old = word_seq.unique(
need_num_repeats=False, need_new2old_indexes=True
)

does not remove 0s from word_seq.

Previous versions remove 0s from word_seq, see

icefall/icefall/decode.py

Lines 218 to 227 in abadc71

# word_seq is a k2.RaggedTensor sharing the same shape as `path`
# but it contains word IDs. Note that it also contains 0s and -1s.
# The last entry in each sublist is -1.
if isinstance(lattice.aux_labels, torch.Tensor):
word_seq = k2.ragged.index(lattice.aux_labels, path)
else:
word_seq = lattice.aux_labels.index(path, remove_axis=True)
# Remove 0 (epsilon) and -1 from word_seq
word_seq = word_seq.remove_values_leq(0)

It does not affect the final WER, but it incurs extra unncessary computations.

Version conflicts for k2 & torchaudio & torch

Hi,

When I want to try rnn-t, I have to upgrade torchaudio to at least 0.10.0 for rnn-t loss. However k2 requires torch 1.8.1. When I upgrade torch to 1.10.0, and torchaudio to 0.10.0, I got this error.

>>> import torch
>>> import torchaudio
>>> import k2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/miniconda3/lib/python3.8/site-packages/k2/__init__.py", line 2, in <module>
    from _k2 import DeterminizeWeightPushingType
ImportError: /path/to/miniconda3/lib/python3.8/site-packages/libk2context.so: undefined symbol: _ZNK2at6Tensor7is_cudaEv

Can you share versions of packages in the environment?

Note: I also checked conda search and couldn't find corresponding k2 vesion for torch1.10.0.

...
k2                   1.11.dev20211209 cuda11.1_py3.7_torch1.8.0  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.7_torch1.8.1  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.7_torch1.9.0  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.7_torch1.9.1  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.8_torch1.8.0  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.8_torch1.8.1  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.8_torch1.9.0  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.8_torch1.9.1  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.9_torch1.8.0  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.9_torch1.8.1  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.9_torch1.9.0  k2-fsa              
k2                   1.11.dev20211209 cuda11.1_py3.9_torch1.9.1  k2-fsa

Decoding error 'Fsa' object doesn't support assignment.

Hi, I'm experiencing this error while decoding Librispeech with Conformer model.

./conformer_ctc/decode.py --exp-dir conformer_ctc/exp_500_att0.8 \
                          --lang-dir data/lang_bpe_500 \
                          --max-duration 30 \
                          --concatenate-cuts 0 \
                          --bucketing-sampler true \
                          --num-paths 1000 \
                          --epoch 5 \
                          --avg 1 \
                          --method attention-decoder \
                          --nbest-scale 0.5
2021-11-26 19:08:00,025 INFO [decode.py:549] Decoding started
2021-11-26 19:08:00,026 INFO [decode.py:550] {
    'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 
    'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 
    'max_active_states': 10000, 'use_double_scores': True, 
    'env_info': {'k2-version': '1.10', 'k2-build-type': 'Release', 'k2-with-cuda': True,
                 'k2-git-sha1': 'fd5565d32ffa8274ff9700453b1e543f34343ed1', 'k2-git-date': 'Wed Nov 10 08:31:12 2021',
                 'lhotse-version': '0.12.0.dev+git.d5e7815.dirty', 'torch-cuda-available': True, 'torch-cuda-version': '11.2',
                 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'b945223-dirty',
                 'icefall-git-date': 'Thu Nov 11 03:26:02 2021', 'icefall-path': '/home/audiodan/asr/icefall',
                 'k2-path': '/home/audiodan/anaconda3/lib/python3.8/site-packages/k2-1.10.dev20211112+cuda11.2.torch1.10.0a0-py3.8-linux-x86_64.egg/k2/__init__.py',
                 'lhotse-path': '/home/audiodan/asr/lhotse/lhotse/init.py'},
    'epoch': 5, 'avg': 1, 'method': 'attention-decoder', 'num_paths': 1000, 'nbest_scale': 0.5, 'export': False,
    'exp_dir': PosixPath('conformer_ctc/exp_500_att0.8'), 'lang_dir': PosixPath('data/lang_bpe_500'), 
    'lm_dir': PosixPath('data/lm'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30, 
    'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0,
    'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 10
}
2021-11-26 19:08:00,218 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-11-26 19:08:00,253 INFO [decode.py:560] device: cpu
2021-11-26 19:08:01,984 INFO [decode.py:597] Loading G_4_gram.fst.txt
2021-11-26 19:08:01,984 WARNING [decode.py:598] It may take 8 minutes.
Traceback (most recent call last):
  File "./conformer_ctc/decode.py", line 704, in <module>
    main()
  File "/home/audiodan/anaconda3/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "./conformer_ctc/decode.py", line 615, in main
    G["dummy"] = 1
TypeError: 'Fsa' object does not support item assignment

dataloader.dataset.cuts()

Hello,

I trained a tdnn-lstm model successfully. However when I want to continue with decoding, this error occurs:

$ python tdnn_lstm_ctc/decode.py
2021-08-19 09:24:07,705 INFO [decode.py:319] Decoding started
2021-08-19 09:24:07,705 INFO [decode.py:320] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp2'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 20, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'method': '1best', 'num_paths': 30, 'epoch': 9, 'avg': 5, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 500.0, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': True, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'full_libri': False}
2021-08-19 09:24:07,974 INFO [lexicon.py:96] Loading pre-compiled data/lang_phone/Linv.pt
2021-08-19 09:24:08,077 INFO [decode.py:329] device: cuda:0
2021-08-19 09:24:20,422 INFO [decode.py:387] averaging ['tdnn_lstm_ctc/exp2/epoch-5.pt', 'tdnn_lstm_ctc/exp2/epoch-6.pt', 'tdnn_lstm_ctc/exp2/epoch-7.pt', 'tdnn_lstm_ctc/exp2/epoch-8.pt', 'tdnn_lstm_ctc/exp2/epoch-9.pt']
Traceback (most recent call last):
  File "tdnn_lstm_ctc/decode.py", line 419, in <module>
    main()
  File "path/to/env/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "tdnn_lstm_ctc/decode.py", line 402, in main
    results_dict = decode_dataset(
  File "tdnn_lstm_ctc/decode.py", line 239, in decode_dataset
    tot_num_cuts = len(dl.dataset.cuts)
AttributeError: 'K2SpeechRecognitionDataset' object has no attribute 'cuts'

The line is that: tot_num_cuts = len(dl.dataset.cuts)

I checked source code and there is not a .cuts attiribute/function. I think number of cuts should be equal to number of sample when each segment is audio itself. So, we can remove .cuts and tot_num_cuts = len(dl.dataset) should be okey for Librispeech?

A question about the format of load data

Hi, I am learning the icefall and lhotse recently, and I'm more concerned about the training or testing data processing. The lhotse provides standard data preparation recipes for commonly used corpora, it is very good. I find the lhotse will process the data into JSON format(*.json or .jsonl.gz), and I also find the lhotse support converting kaldi file to Lhotse manifests. The question I want to know:
1、why select JSON format(
.json or *.jsonl.gz) to represent the data?
2、why not use the kaldi file format(wav.scp,text,utt2dur,.....),I think that it is very clear to know the means

How can I install `k2` with version `1.9.dev20211101`.

Thanks for the great work before I ask question.
Here is my problem. I want use icefall for stateless transducer experiment, it requires torchaudio 0.10.0, it means we need PyTorch 1.10.0.
But I can't find any specific wheel package version from conda or pip . So I guess we need install from code, but I don't know how to checkout k2 version == 1.9.dev20211101.

Hope someone can help me. Thanks.

How to decode fast on cpu?

Hi,

Recently, I did the librispeech experiment on icefall.
I found that the cpu takes a long time to decode. I know decoding on gpu will be fast.
But I need to decode on cpu. So the question I want to know:

  • Is there a way to make it run faster on cpu?
  • how long did it take you to decode on cpu?

Here is the screenshot of the decoding log. It took about 10 hours.
image

thanks!

On the fly decoding support

Hi, i'm new here. First thanks to this project. Though early, it looks awesome.
I go through the current recipes and find that they are all based on ctc topology. I wonder if the current version supports on the fly decoding with aed model? Specifically, i mean aed model outputing phoneme based score and wfst based LG model dynamically computing corresponding word based score.

Exception when rescoring with attention decoder

After successfully decode test dataset with 1best decoder. I decided to take attention decoder to get better result.
But it seems that something unexpected occured. I have no idea on how to debug the reason that cause this exception.
Can anyone make suggestions ?
Exception information:

Traceback (most recent call last):
  File "conformer/decode_mmi.py", line 485, in <module>
    main()
  File "/cfs/sge/miniconda3/envs/k2-debug/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "conformer/decode_mmi.py", line 469, in main
    results_dict = decode_dataset(
  File "conformer/decode_mmi.py", line 320, in decode_dataset
    hyps_dict, texts = decode_one_batch(
  File "conformer/decode_mmi.py", line 256, in decode_one_batch
    best_path_dict = rescore_with_attention_decoder(
  File "/cfs/sge/k2/icefall/icefall/decode.py", line 828, in rescore_with_attention_decoder
    nbest = nbest.intersect(lattice)
  File "/cfs/sge/k2/icefall/icefall/decode.py", line 333, in intersect
    path_lattice = _intersect_device(
  File "/cfs/sge/k2/icefall/icefall/decode.py", line 64, in _intersect_device
    return k2.cat(ans)
  File "/cfs/sge/k2/k2/k2/python/k2/ops.py", line 219, in cat
    out_fsa = Fsa(ans_ragged_arcs)
  File "/cfs/sge/k2/k2/k2/python/k2/fsa.py", line 229, in __init__
    _ = self.properties
  File "/cfs/sge/k2/k2/k2/python/k2/fsa.py", line 455, in properties
    raise ValueError(
ValueError: Fsa is not valid, properties are: 2 = "Nonempty", arcs are: [ [ [ 0 1 0 -2.23462 0 2 0 -7.25351 ] [ 1 3 0 -2.4075 1 4 0 -7.06948 ] [ 2 5 0 -2.4075 ] [ 3 6 0 -2.97694 3 7 0 -5.45353 ] ...

Mismatching between custom LabelSmoothing and and PyTorch's Label Smoothing

The current LabelSmooting in icefall (see below)

class LabelSmoothingLoss(nn.Module):

is based on ESPnet, which seems to be based on The Annotated Transformer.

As @janvainer pointed out in #106, there is a built-in LabelSmooting loss in torch >=1.10, see https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss

I just compared the implementation from PyTorch with the one in icefall and identified the following differences.

PyTorch's implementation follows the one described in the paper Attention Is All You Need

LabelSmoothing is proposed by the paper Rethinking the Inception Architecture for Computer Vision, which has the following formula:
Screen Shot 2021-11-05 at 7 27 49 PM

First difference

icefall is using K - 1, not K. See the code below from icefall

true_dist.fill_(self.smoothing / (self.size - 1))

Second difference

Rethinking the Inception Architecture for Computer Vision uses cross-entropy to compute the loss. See the formula from the paper, given below:
Screen Shot 2021-11-05 at 7 32 21 PM

but icefall uses the following formula, i.e., KL-divergence
CodeCogsEqn


To match PyTorch's implementation (also the one used in the original transformer paper), we have to do the following changes:

(1) Change

true_dist.fill_(self.smoothing / (self.size - 1))

to

true_dist.fill_(self.smoothing / self.size)

Also, we need to add self.smoothing / self.size to the target positions in true_dist.

That is, change

true_dist.scatter_(1, target.unsqueeze(1), self.confidence)

to use scatter_add_

            true_dist.scatter_add_(
                1,
                target.unsqueeze(1),
                torch.full(true_dist.size(), fill_value=self.confidence).to(true_dist),
            )

(2) Change

kl = self.criterion(torch.log_softmax(x, dim=1), true_dist)

to

lable_smoothing_loss = -1 * (torch.log_softmax(x, dim=1) * true_dist).sum(dim=1)

@danpovey Do you think we should make the above changes?

Problem during preparing G

Hi, I am trying to train a tdnn-lstm model on a Turkish data. I run an experiment successfully once (including decoding part) with smaller langauge model. Then I tried a new corpus for language modeling. During preparing G step, I got this error:

2021-08-23 11:35:45 (prepare.sh:118:main) Stage 6: Compile HLG
2021-08-23 11:35:46,389 INFO [compile_hlg.py:126] Processing data/lang_phone
2021-08-23 11:35:46,859 INFO [lexicon.py:99] Converting L.pt to Linv.pt
2021-08-23 11:35:47,640 INFO [compile_hlg.py:48] Building ctc_topo. max_token_id: 52
2021-08-23 11:35:47,826 INFO [compile_hlg.py:57] Loading G_3_gram.fst.txt
2021-08-23 11:38:15,955 INFO [compile_hlg.py:68] Intersecting L and G
2021-08-23 11:51:37,889 INFO [compile_hlg.py:70] LG shape: (301909252, None)
2021-08-23 11:51:37,889 INFO [compile_hlg.py:72] Connecting LG
2021-08-23 11:51:37,889 INFO [compile_hlg.py:74] LG shape after k2.connect: (301909252, None)
2021-08-23 11:51:37,889 INFO [compile_hlg.py:76] <class 'torch.Tensor'>
2021-08-23 11:51:37,889 INFO [compile_hlg.py:77] Determinizing LG
2021-08-23 12:09:11,585 INFO [compile_hlg.py:80] <class '_k2.RaggedInt'>
2021-08-23 12:09:11,585 INFO [compile_hlg.py:82] Connecting LG after k2.determinize
2021-08-23 12:09:11,585 INFO [compile_hlg.py:85] Removing disambiguation symbols on LG
[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1628135473078/work/k2/csrc/tensor.cu:159:k2::Tensor::Tensor(k2::Dtype, const k2::Shape&, k2::RegionPtr, int32_t) Check failed: int64_t(impl_->byte_offset) + begin_elem * element_size >= 0 (-1246502780 vs. 0) 


[ Stack-Trace: ]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x7fb12e7c76bc]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/libk2context.so(k2::Tensor::Tensor(k2::Dtype, k2::Shape const&, std::shared_ptr<k2::Region>, int)+0x6da) [0x7fb12ed24aca]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/libk2context.so(k2::Array2<int>::Col(int)+0x13a) [0x7fb12ecd03ba]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/libk2context.so(+0x27e0d9) [0x7fb12ecc40d9]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/libk2context.so(k2::Index(k2::RaggedShape&, int, k2::Array1<int> const&, k2::Array1<int>*)+0x1da) [0x7fb12ecc649a]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xc5395) [0x7fb134ced395]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xabc90) [0x7fb134cd3c90]
/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x1dfcf) [0x7fb134c45fcf]
python3(PyCFunction_Call+0x54) [0x555e1b3b3914]
python3(_PyObject_MakeTpCall+0x31e) [0x555e1b3b6ebe]
python3(_PyEval_EvalFrameDefault+0x52f6) [0x555e1b458986]
python3(_PyFunction_Vectorcall+0x1a6) [0x555e1b43b646]
python3(_PyEval_EvalFrameDefault+0x947) [0x555e1b453fd7]
python3(_PyFunction_Vectorcall+0x1a6) [0x555e1b43b646]
python3(_PyEval_EvalFrameDefault+0x947) [0x555e1b453fd7]
python3(_PyEval_EvalCodeWithName+0x2c3) [0x555e1b43a433]
python3(_PyFunction_Vectorcall+0x378) [0x555e1b43b818]
python3(_PyEval_EvalFrameDefault+0x1822) [0x555e1b454eb2]
python3(_PyFunction_Vectorcall+0x1a6) [0x555e1b43b646]
python3(_PyEval_EvalFrameDefault+0x4d33) [0x555e1b4583c3]
python3(_PyFunction_Vectorcall+0x1a6) [0x555e1b43b646]
python3(_PyEval_EvalFrameDefault+0x947) [0x555e1b453fd7]
python3(_PyFunction_Vectorcall+0x1a6) [0x555e1b43b646]
python3(_PyEval_EvalFrameDefault+0x947) [0x555e1b453fd7]
python3(_PyEval_EvalCodeWithName+0x2c3) [0x555e1b43a433]
python3(PyEval_EvalCodeEx+0x39) [0x555e1b43b499]
python3(PyEval_EvalCode+0x1b) [0x555e1b4d6ecb]
python3(+0x252f63) [0x555e1b4d6f63]
python3(+0x26f033) [0x555e1b4f3033]
python3(+0x274022) [0x555e1b4f8022]
python3(PyRun_SimpleFileExFlags+0x1b2) [0x555e1b4f8202]
python3(Py_RunMain+0x36d) [0x555e1b4f877d]
python3(Py_BytesMain+0x39) [0x555e1b4f8939]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fb25ee04b97]
python3(+0x1e8f39) [0x555e1b46cf39]

Traceback (most recent call last):
  File "./local/compile_hlg.py", line 140, in <module>
    main()
  File "./local/compile_hlg.py", line 128, in main
    HLG = compile_HLG(lang_dir)
  File "./local/compile_hlg.py", line 92, in compile_HLG
    LG = k2.remove_epsilon(LG)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 562, in remove_epsilon
    out_fsa = k2.utils.fsa_from_unary_function_ragged(fsa, ragged_arc, arc_map,
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 515, in fsa_from_unary_function_ragged
    new_value = index(value, arc_map)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 335, in index
    return index_ragged(src, indexes)
  File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 283, in index_ragged
    return _k2.index(src, indexes)
RuntimeError: Some bad things happed.

How can I solve that? My arpa files are 2.1gb and 6.2gb for 3 gram and 4 gram respectively. Could it be related to size issue? My language models are prepared via Kenlm.

Is it relevant to icefall, or should I ask on K2 repository?

conda installation of k2 dosn't work

Hi, I am trying to install k2 within a conda env with the command you provide:

conda install -c k2-fsa -c pytorch -c conda-forge k2 python=3.8 cudatoolkit=11.1 pytorch=1.8.1
Unfortuntely conda compalins with:

PackagesNotFoundError: The following packages are not available from current channels:

  - k2

I am running on a Windows 10 machine

CUDA out of memory in decoding

Hi, I am newer to learn icefall,I finished the training of tdnn_lstm_ctc, when run the decode steps, I meet the following error, I change the --max-duration, there are still errors:

image

we set --max-duration=100 and use Tesla V100-SXM, the GPU info follow:

image

would you give me some advice?thanks

Why Nbest.intersect takes the shortest path within rescore_with_attention_decoder

Hi, I'm new to the icefall tool. For the latest commit, I notice that the function "intersect" of class "Nbest" calls k2.shortest_path which return the best path of the lattice. However, this seems to contradict with "rescore_with_attention_decoder" in decode.py which rescores multiple paths in the lattice. Can someone please explain why we're doing this?

one_best = k2.shortest_path(

nbest = nbest.intersect(lattice)

Some results about training TDNN_LSTM_CTC based on Single One GPU

I did some experiments for tdnn_lstm_ctc when training with just single one GPU.

Note: these results are just for reference not the final conclusions!

Case 1. When setting bucketing_sampler as False and using lr=0.001
image
The training log:
log-train-2021-10-09-20-17-49.txt
From the above picture and training log, we can know that the model can converge. And if you want to the model converges better, you can set a big training epoch.

Case 2: When setting bucketing_sampler as True and using lr=0.001
image
The training log:
log-train-2021-10-12-19-34-52.txt
From the above picture and training log, we can know that the model can NOT converge. It seems that the big lr leads to this.

Case 3: When setting bucketing_sampler as True and using lr=0.00025 (a small lr)
image
The training log:
log-train-2021-10-13-17-16-27.txt
From the above picture and training log, we can know that the model can converge better. Maybe you can use a different small lr.

So, according to all the above results, if you run this script based on Single One GPU, I will suggest you can try to use a big training epoch when setting bucketing_sampler=False and try to use a small lr (such as 0.00025) when setting bucketing_sampler=True.

Problem with valid loss functions

Since we merged a change to asr_datamodule.py (sorry dont have time to the find the PR), our valid loss function for attention is very bad, which affects diagnostics (but not decoding).

The issue seems to be related to reordering ("indices") that is done in encode_supervisions(); the supervision for the attention decoder is taken from there, but it looks like we are not properly taking into account any reordering. I have verified that the "indices" variable in encode_supervisions() always seems to be in order for train data, but for some reason, not for valid. I won't be fixing this tonight, as it's late right now, we'll fix this to-morrow.

Making an issue in case anyone else notices the problem.

Get inf ctc loss during training

I'm using my labeled data for training and get inf ctc loss.

I have debug it, and find the batch which changes ctc loss to inf has very short data. The minimum of number frames (before subsampling) is 11.

I try to generate train manifest which filter short segment less than 1s (similiar as kaldi). It seems to be ok.

So maybe this very short data cause inf loss?
If so, maybe this can be shown in some code or tutorial? And what is the recommended minimum threshold for this value?

Problem with max-duration during training

I'm using V100 with 16G Mem and training using world-size = 4.

If I use max-duration = 50, error will occur after some batches of training and show oom problem.
If I use max-duration = 30, training will be done. But I see gpu usage is usually less than 60%, which may need longer training time.

What's the main contribution to gpu memory? Or any advice?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.