How to decode fast on cpu?,about k2-fsa/icefall

Comments (13)

csukuangfj commented on September 21, 2024 2

It turns out it takes more time and memory to decode test-other. I switched to a new machine with 100 GB CPU RAM, though the max RAM used in decoding test-other is about 58.16 GB.

Attached is the complete decoding log for test-clean and test-other.

log-decode-2021-10-26-10-33-20.txt

You can find the decoding time from the attached log:

test-clean: 1 hour and 46 minutes
test-other: 3 hours and 21 minutes

The memory usage for decoding test-clean is given below:

There is a peak (20.56 GB) at 10:36:00, which is caused by model averaging. Decoding test-clean on CPU uses on average less than 18 GB RAM.

The memory usage for test-other is shown below:

The memory consumption goes up between 14:25 and 14:39, which happens when decoding batch 400. See the log below:

2021-10-26 14:03:56,823 INFO [decode-500-vgg-att0.8.py:403] batch 400/757, cuts processed until now is 1549
2021-10-26 14:52:47,464 INFO [decode-500-vgg-att0.8.py:403] batch 500/757, cuts processed until now is 1946

#69 (comment)

Here is the screenshot of the decoding log. It took about 10 hours.

I cannot reproduce your results. By the way, I am using --max-duration 30, not 80.

Here is part of the decoding log extracted from the attached file for easier reference:

CUDA_VISIBLE_DEVICES= ./conformer_ctc/decode-500-vgg-att0.8.py \
--max-duration 30 --concatenate-cuts 0 --bucketing-sampler 1 \
--method attention-decoder --epoch 34 --avg 20

2021-10-26 10:33:20,093 INFO [decode-500-vgg-att0.8.py:465] Decoding started
2021-10-26 10:33:20,093 INFO [decode-500-vgg-att0.8.py:466] {'exp_dir': PosixPath('conformer_ctc/exp_500_att_0.8_vgg'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'subsampling_factor': 4, 'num_decoder_layers': 6, 'vgg_frontend': True, 'is_espnet_structure': True, 'mmi_loss': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 34, 'avg': 20, 'method': 'attention-decoder', 'num_paths': 100, 'lattice_score_scale': 1.0, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-26 10:33:20,468 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-10-26 10:33:20,740 INFO [decode-500-vgg-att0.8.py:476] device: cpu
2021-10-26 10:33:28,599 INFO [decode-500-vgg-att0.8.py:519] Loading pre-compiled G_4_gram.pt
2021-10-26 10:35:48,692 INFO [decode-500-vgg-att0.8.py:558] averaging ['conformer_ctc/exp_500_att_0.8_vgg/epoch-15.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-16.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-17.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-18.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-19.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-20.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-21.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-22.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-23.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-24.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-25.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-26.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-27.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-28.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-29.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-30.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-31.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-32.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-33.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-34.pt']
2021-10-26 10:37:26,867 INFO [decode-500-vgg-att0.8.py:571] Number of model parameters: 102568040
2021-10-26 10:37:36,221 INFO [decode-500-vgg-att0.8.py:403] batch 0/787, cuts processed until now is 3
2021-10-26 10:51:13,182 INFO [decode-500-vgg-att0.8.py:403] batch 100/787, cuts processed until now is 348
2021-10-26 11:04:48,234 INFO [decode-500-vgg-att0.8.py:403] batch 200/787, cuts processed until now is 690
2021-10-26 11:18:20,976 INFO [decode-500-vgg-att0.8.py:403] batch 300/787, cuts processed until now is 1017
2021-10-26 11:31:43,197 INFO [decode-500-vgg-att0.8.py:403] batch 400/787, cuts processed until now is 1381
2021-10-26 11:44:47,576 INFO [decode-500-vgg-att0.8.py:403] batch 500/787, cuts processed until now is 1694
2021-10-26 11:57:59,318 INFO [decode-500-vgg-att0.8.py:403] batch 600/787, cuts processed until now is 2024
2021-10-26 12:11:09,544 INFO [decode-500-vgg-att0.8.py:403] batch 700/787, cuts processed until now is 2352
2021-10-26 12:23:47,024 INFO [decode-500-vgg-att0.8.py:452] 
For test-clean, WER of different settings are:
ngram_lm_scale_1.0_attention_scale_1.1	2.6	best for test-clean
ngram_lm_scale_0.7_attention_scale_1.0	2.61
ngram_lm_scale_0.9_attention_scale_0.9	2.61
ngram_lm_scale_0.9_attention_scale_1.0	2.61
ngram_lm_scale_1.0_attention_scale_1.2	2.61
ngram_lm_scale_1.1_attention_scale_1.1	2.61

... ... ...

2021-10-26 12:24:10,000 INFO [decode-500-vgg-att0.8.py:403] batch 0/757, cuts processed until now is 5
2021-10-26 12:42:13,595 INFO [decode-500-vgg-att0.8.py:403] batch 100/757, cuts processed until now is 377
2021-10-26 13:22:32,513 INFO [decode-500-vgg-att0.8.py:403] batch 200/757, cuts processed until now is 792
2021-10-26 13:42:37,275 INFO [decode-500-vgg-att0.8.py:403] batch 300/757, cuts processed until now is 1180
2021-10-26 14:03:56,823 INFO [decode-500-vgg-att0.8.py:403] batch 400/757, cuts processed until now is 1549
2021-10-26 14:52:47,464 INFO [decode-500-vgg-att0.8.py:403] batch 500/757, cuts processed until now is 1946
2021-10-26 15:11:51,084 INFO [decode-500-vgg-att0.8.py:403] batch 600/757, cuts processed until now is 2342
2021-10-26 15:29:53,848 INFO [decode-500-vgg-att0.8.py:403] batch 700/757, cuts processed until now is 2734
2021-10-26 15:45:15,609 INFO [decode-500-vgg-att0.8.py:452] 
For test-other, WER of different settings are:
ngram_lm_scale_1.2_attention_scale_1.2	5.71	best for test-other
ngram_lm_scale_1.3_attention_scale_1.5	5.71
ngram_lm_scale_1.5_attention_scale_1.7	5.71
ngram_lm_scale_1.5_attention_scale_2.0	5.71
ngram_lm_scale_1.2_attention_scale_1.1	5.72
ngram_lm_scale_1.5_attention_scale_1.9	5.72

from icefall.

csukuangfj commented on September 21, 2024 1

For comparison with GPU decoding, attached is the decoding log for GPU decoding that I just obtained using the same machine.

log-decode-2021-10-26-16-05-54.txt

The following is the CPU RAM usage for decoding test-clean on GPU:

Note: I am using #84 to average checkpoints on GPU, so it does not take much CPU RAM to average the checkpoints.

The time taken for averaging 20 checkpoints on CPU is 1 minute 38 seconds (see the log from #69 (comment))

2021-10-26 10:35:48,692 INFO [decode-500-vgg-att0.8.py:558] averaging ['conformer_ctc/exp_500_att_0.8_vgg/epoch-15.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-16.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-17.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-18.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-19.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-20.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-21.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-22.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-23.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-24.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-25.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-26.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-27.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-28.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-29.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-30.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-31.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-32.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-33.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-34.pt']
2021-10-26 10:37:26,867 INFO [decode-500-vgg-att0.8.py:571] Number of model parameters: 102568040

while it is 1 minute 20 seconds for averaging on GPU.

2021-10-26 16:06:44,528 INFO [decode-500-vgg-att0.8.py:558] averaging ['conformer_ctc/exp_500_att_0.8_vgg/epoch-15.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-16.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-17.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-18.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-19.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-20.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-21.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-22.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-23.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-24.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-25.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-26.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-27.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-28.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-29.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-30.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-31.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-32.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-33.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-34.pt']
2021-10-26 16:08:04,477 INFO [decode-500-vgg-att0.8.py:571] Number of model parameters: 102568040

When the number of checkpoints to be averaged becomes larger, the advantages to use GPU for averaging become more obvious.

The GPU memory usage reported by nvidia-smi while decoding test-clean is:

For test-other, the CPU RAM usage is:

We can see that the max RAM usage is about 10.19 GB.

The GPU RAM usage is:

Note that at 16:29:48, there is an OOM error, which will free some cached memory, so you can see that
the GPU RAM drops from 275357 MB to 25205 MB.

The following is part of the decoding log extracted from the attached file for ease of reference:

CUDA_VISIBLE_DEVICES=0 ./conformer_ctc/decode-500-vgg-att0.8.py \
--max-duration 30 --concatenate-cuts 0 --bucketing-sampler 1 \
--method attention-decoder --epoch 34 --avg 20

2021-10-26 16:05:54,451 INFO [decode-500-vgg-att0.8.py:465] Decoding started
2021-10-26 16:05:54,452 INFO [decode-500-vgg-att0.8.py:466] {'exp_dir': PosixPath('conformer_ctc/exp_500_att_0.8_vgg'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'subsampling_factor': 4, 'num_decoder_layers': 6, 'vgg_frontend': True, 'is_espnet_structure': True, 'mmi_loss': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 34, 'avg': 20, 'method': 'attention-decoder', 'num_paths': 100, 'lattice_score_scale': 1.0, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-26 16:05:54,819 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-10-26 16:05:55,204 INFO [decode-500-vgg-att0.8.py:476] device: cuda:0
2021-10-26 16:06:08,230 INFO [decode-500-vgg-att0.8.py:519] Loading pre-compiled G_4_gram.pt
2021-10-26 16:06:44,528 INFO [decode-500-vgg-att0.8.py:558] averaging ['conformer_ctc/exp_500_att_0.8_vgg/epoch-15.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-16.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-17.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-18.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-19.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-20.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-21.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-22.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-23.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-24.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-25.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-26.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-27.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-28.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-29.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-30.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-31.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-32.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-33.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-34.pt']
2021-10-26 16:08:04,477 INFO [decode-500-vgg-att0.8.py:571] Number of model parameters: 102568040
2021-10-26 16:08:06,127 INFO [decode-500-vgg-att0.8.py:403] batch 0/787, cuts processed until now is 3
2021-10-26 16:09:45,247 INFO [decode-500-vgg-att0.8.py:403] batch 100/787, cuts processed until now is 348
2021-10-26 16:11:27,598 INFO [decode-500-vgg-att0.8.py:403] batch 200/787, cuts processed until now is 690
2021-10-26 16:13:10,609 INFO [decode-500-vgg-att0.8.py:403] batch 300/787, cuts processed until now is 1017
2021-10-26 16:14:50,366 INFO [decode-500-vgg-att0.8.py:403] batch 400/787, cuts processed until now is 1381
2021-10-26 16:16:34,659 INFO [decode-500-vgg-att0.8.py:403] batch 500/787, cuts processed until now is 1694
2021-10-26 16:18:16,964 INFO [decode-500-vgg-att0.8.py:403] batch 600/787, cuts processed until now is 2024
2021-10-26 16:19:58,504 INFO [decode-500-vgg-att0.8.py:403] batch 700/787, cuts processed until now is 2352
2021-10-26 16:22:38,703 INFO [decode-500-vgg-att0.8.py:452] 
For test-clean, WER of different settings are:
ngram_lm_scale_1.0_attention_scale_1.1	2.6	best for test-clean
ngram_lm_scale_1.3_attention_scale_1.3	2.6
ngram_lm_scale_1.3_attention_scale_1.5	2.6
ngram_lm_scale_1.5_attention_scale_1.3	2.6
ngram_lm_scale_1.9_attention_scale_1.9	2.6
ngram_lm_scale_2.0_attention_scale_2.0	2.6

... ...

2021-10-26 16:22:39,845 INFO [decode-500-vgg-att0.8.py:403] batch 0/757, cuts processed until now is 5
2021-10-26 16:24:21,415 INFO [decode-500-vgg-att0.8.py:403] batch 100/757, cuts processed until now is 377
2021-10-26 16:25:59,809 INFO [decode-500-vgg-att0.8.py:403] batch 200/757, cuts processed until now is 792
2021-10-26 16:27:37,661 INFO [decode-500-vgg-att0.8.py:403] batch 300/757, cuts processed until now is 1180
2021-10-26 16:29:13,849 INFO [decode-500-vgg-att0.8.py:403] batch 400/757, cuts processed until now is 1549
2021-10-26 16:29:48,348 INFO [decode.py:588] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 20.32 GiB already allocated; 7.15 GiB free; 23.38 GiB reserved in total by PyTorch)

2021-10-26 16:29:48,349 INFO [decode.py:589] num_arcs before pruning: 254683
2021-10-26 16:29:48,378 INFO [decode.py:596] num_arcs after pruning: 7851
2021-10-26 16:30:52,598 INFO [decode-500-vgg-att0.8.py:403] batch 500/757, cuts processed until now is 1946
2021-10-26 16:32:29,012 INFO [decode-500-vgg-att0.8.py:403] batch 600/757, cuts processed until now is 2342
2021-10-26 16:34:05,310 INFO [decode-500-vgg-att0.8.py:403] batch 700/757, cuts processed until now is 2734
2021-10-26 16:36:15,618 INFO [decode-500-vgg-att0.8.py:452] 
For test-other, WER of different settings are:
ngram_lm_scale_1.3_attention_scale_1.7	5.72	best for test-other
ngram_lm_scale_1.5_attention_scale_2.0	5.72
ngram_lm_scale_1.2_attention_scale_1.2	5.73
ngram_lm_scale_1.3_attention_scale_1.5	5.73
ngram_lm_scale_1.3_attention_scale_1.9	5.73
ngram_lm_scale_1.5_attention_scale_1.7	5.73
ngram_lm_scale_1.7_attention_scale_2.0	5.73

The following table compares the decoding time between CPU and GPU:

	test-clean	test-other
CPU	1 hour 46 minutes	3 hours 21 minutes
GPU	14 minutes	14 minutes

(Note: As you can see, the GPU RAM is not fully used. If you increase --max-duration, it may take less time.)

from icefall.

pzelasko commented on September 21, 2024

I found that converting the model to TorchScript and running model = torch.utils.mobile_optimizer.optimize_for_mobile(model) helps reduce the CPU recognition time by about 20%. Further improvements can probably be achieved with quantization but I didn’t try.

from icefall.

pzelasko commented on September 21, 2024

Also if you’re willing to sacrifice some WER, you can change the decoding method argument to 1best.

from icefall.

csukuangfj commented on September 21, 2024

Using decoding method ctc-decoding is faster as it requires no LMs and requires fewer FSA operations, though its WER is not as good as other methods.

#58 shows that the WERs using ctc decoding for the librispeech test datasets are:

ctc-decoding    3.26    best for test-clean
ctc-decoding    8.21    best for test-other

from icefall.

CSerV commented on September 21, 2024

These methods may not work for me, and I don't want to sacrifice performance.
My expectation is to spend 30 - 60 minutes decoding on the cpu.

Thanks for your advices.

from icefall.

danpovey commented on September 21, 2024

Attention models are never going to be that fast to decode. I believe the ESPNet setup takes something like 24 hours to decode with attention decoding; and that's on GPU. (But I may be mistaken.)

from icefall.

cdxie commented on September 21, 2024

Hi,

Recently, I did the librispeech experiment on icefall. I found that the cpu takes a long time to decode. I know decoding on gpu will be fast. But I need to decode on cpu. So the question I want to know:

Is there a way to make it run faster on cpu?

how long did it take you to decode on cpu?

Here is the screenshot of the decoding log. It took about 10 hours.

thanks!

hi, I also run the decoding on cpu, so，can you tell me your CPU configuration.
@CSerV @csukuangfj When I run the decoding in the following steps:

#########
2021-10-09 15:33:49,785 INFO [decode.py:538] Decoding started
2021-10-09 15:33:49,786 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 34, 'avg': 20, 'method': 'attention-decoder', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 50, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-09 15:33:50,955 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe/Linv.pt
2021-10-09 15:33:51,271 INFO [decode.py:549] device: cpu
2021-10-09 15:35:21,894 INFO [decode.py:604] Loading pre-compiled G_4_gram.pt
2021-10-09 15:35:54,747 INFO [decode.py:640] averaging ['conformer_ctc/exp/epoch-15.pt', 'conformer_ctc/exp/epoch-16.pt', 'conformer_ctc/exp/epoch-17.pt', 'conformer_ctc/exp/epoch-18.pt', 'conformer_ctc/exp/epoch-19.pt', 'conformer_ctc/exp/epoch-20.pt', 'conformer_ctc/exp/epoch-21.pt', 'conformer_ctc/exp/epoch-22.pt', 'conformer_ctc/exp/epoch-23.pt', 'conformer_ctc/exp/epoch-24.pt', 'conformer_ctc/exp/epoch-25.pt', 'conformer_ctc/exp/epoch-26.pt', 'conformer_ctc/exp/epoch-27.pt', 'conformer_ctc/exp/epoch-28.pt', 'conformer_ctc/exp/epoch-29.pt', 'conformer_ctc/exp/epoch-30.pt', 'conformer_ctc/exp/epoch-31.pt', 'conformer_ctc/exp/epoch-32.pt', 'conformer_ctc/exp/epoch-33.pt', 'conformer_ctc/exp/epoch-34.pt']
2021-10-09 16:06:26,042 INFO [decode.py:653] Number of model parameters: 116147120
#######
"Number of model parameters" appear will take a long time, and in your screenshot only take 20 seconds. At the same time, the CPU memory can up to 78G, which is very high, so did you make any optimizations?

from icefall.

CSerV commented on September 21, 2024

Here is some screenshots about the cpu configuration. I didn't make any optimizations.

cat /proc/cpuinfo
processor : 63
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
stepping : 7
microcode : 0x5003006
cpu MHz : 2300.000
cache size : 22528 KB
physical id : 1
siblings : 32
core id : 12
cpu cores : 16
apicid : 57
initial apicid : 57
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni spec_ctrl intel_stibp flush_l1d arch_capabilities
bogomips : 4604.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

from icefall.

csukuangfj commented on September 21, 2024

hi, I also run the decoding on cpu, so，can you tell me your CPU configuration.

I am using a virtual machine with 10 CPUs and 50 GB CPU RAM to test decoding with attention-decoder on CPU.

The decoding command is

CUDA_VISIBLE_DEVICES= ./conformer_ctc/decode-500-vgg-att0.8.py \
  --max-duration 30 \
  --concatenate-cuts 0 \
  --bucketing-sampler 1 \
  --method attention-decoder \
  --epoch 34 \
  --avg 20

The decoding logs are:

2021-10-25 16:12:01,444 INFO [decode-500-vgg-att0.8.py:465] Decoding started
2021-10-25 16:12:01,445 INFO [decode-500-vgg-att0.8.py:466] {'exp_dir': PosixPath('conformer_ctc/exp_500_att_0.8_vgg'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'subsampling_factor': 4, 'num_decoder_layers': 6, 'vgg_frontend': True, 'is_espnet_structure': True, 'mmi_loss': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 34, 'avg': 20, 'method': 'attention-decoder', 'num_paths': 100, 'lattice_score_scale': 1.0, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-25 16:12:01,798 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-10-25 16:12:02,052 INFO [decode-500-vgg-att0.8.py:476] device: cpu
2021-10-25 16:12:09,762 INFO [decode-500-vgg-att0.8.py:519] Loading pre-compiled G_4_gram.pt
2021-10-25 16:14:50,914 INFO [decode-500-vgg-att0.8.py:558] averaging ['conformer_ctc/exp_500_att_0.8_vgg/epoch-15.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-16.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-17.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-18.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-19.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-20.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-21.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-22.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-23.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-24.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-25.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-26.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-27.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-28.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-29.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-30.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-31.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-32.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-33.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-34.pt']
2021-10-25 16:15:34,641 INFO [decode-500-vgg-att0.8.py:571] Number of model parameters: 102568040
/ceph-fj/fangjun/open-source/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection v
iolates the max_frames, max_cuts, or max_duration constraints - we'll return it anyway. Consider increasing max_frames/max_cuts/max_d
uration.
  warnings.warn(
2021-10-25 16:15:43,896 INFO [decode-500-vgg-att0.8.py:403] batch 0/787, cuts processed until now is 3

At the same time, the CPU memory can up to 78G, which is very high, so did you make any optimizations?

I don't think it will use that much CPU memory. Here is the memory usage reported by our cluster management tools:

The maximum CPU RAM used in the decoding process is about 25.73 GB. We don't use any optimizations.

Note the code uses only two CPU threads during pruned intersection. If you have more CPU RAM, you can increase
--max-duration, which can decrease the total decoding time.

how long did it take you to decode on cpu?

We have not performed such a test yet. I am decoding on CPU right now.

from icefall.

csukuangfj commented on September 21, 2024

I ran decoding on CPU using a model from speechbrain sometime ago.
It takes 45.12778 hours to decode test-clean. See speechbrain/speechbrain#928 (comment)

(I was not able to decode test-other on CPU with speechbrain as it consumes lots of memory. My virtual machine got killed due to OOM)

from icefall.

csukuangfj commented on September 21, 2024

how long did it take you to decode on cpu?

Here is the decoding log on CPU for test-clean that I just got.

You can see that it takes about 1 hour and 46 minutes to decode test clean.
(Note: I use a model with vocab size 500, not 5000)

$ CUDA_VISIBLE_DEVICES= ./conformer_ctc/decode-500-vgg-att0.8.py --max-duration 30 --concatenate-cuts 0 --bucketing-sampler 1 --method attention-decoder --epoch 34 --avg 20
2021-10-25 16:12:01,444 INFO [decode-500-vgg-att0.8.py:465] Decoding started
2021-10-25 16:12:01,445 INFO [decode-500-vgg-att0.8.py:466] {'exp_dir': PosixPath('conformer_ctc/exp_500_att_0.8_vgg'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'subsampling_factor': 4, 'num_decoder_layers': 6, 'vgg_frontend': True, 'is_espnet_structure': True, 'mmi_loss': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 34, 'avg': 20, 'method': 'attention-decoder', 'num_paths': 100, 'lattice_score_scale': 1.0, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2}
2021-10-25 16:12:01,798 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_500/Linv.pt
2021-10-25 16:12:02,052 INFO [decode-500-vgg-att0.8.py:476] device: cpu
2021-10-25 16:12:09,762 INFO [decode-500-vgg-att0.8.py:519] Loading pre-compiled G_4_gram.pt
2021-10-25 16:14:50,914 INFO [decode-500-vgg-att0.8.py:558] averaging ['conformer_ctc/exp_500_att_0.8_vgg/epoch-15.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-16.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-17.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-18.pt','conformer_ctc/exp_500_att_0.8_vgg/epoch-19.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-20.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-21.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-22.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-23.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-24.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-25.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-26.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-27.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-28.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-29.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-30.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-31.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-32.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-33.pt', 'conformer_ctc/exp_500_att_0.8_vgg/epoch-34.pt']
2021-10-25 16:15:34,641 INFO [decode-500-vgg-att0.8.py:571] Number of model parameters: 102568040
/ceph-fj/fangjun/open-source/lhotse/lhotse/dataset/sampling/single_cut.py:170: UserWarning: The first cut drawn in batch collection violates the max_frames, max_cuts, or max_duration constraints - we'll return it anyway. Consider increasing max_frames/max_cuts/max_duration.
  warnings.warn(
2021-10-25 16:15:43,896 INFO [decode-500-vgg-att0.8.py:403] batch 0/787, cuts processed until now is 3
2021-10-25 16:29:17,429 INFO [decode-500-vgg-att0.8.py:403] batch 100/787, cuts processed until now is 348
2021-10-25 16:42:46,474 INFO [decode-500-vgg-att0.8.py:403] batch 200/787, cuts processed until now is 690
2021-10-25 16:56:16,433 INFO [decode-500-vgg-att0.8.py:403] batch 300/787, cuts processed until now is 1017
2021-10-25 17:09:35,900 INFO [decode-500-vgg-att0.8.py:403] batch 400/787, cuts processed until now is 1381
2021-10-25 17:22:36,109 INFO [decode-500-vgg-att0.8.py:403] batch 500/787, cuts processed until now is 1694
2021-10-25 17:35:43,517 INFO [decode-500-vgg-att0.8.py:403] batch 600/787, cuts processed until now is 2024
2021-10-25 17:48:50,643 INFO [decode-500-vgg-att0.8.py:403] batch 700/787, cuts processed until now is 2352
2021-10-25 18:01:22,994 INFO [decode-500-vgg-att0.8.py:452]
For test-clean, WER of different settings are:
ngram_lm_scale_1.0_attention_scale_1.1  2.6     best for test-clean
ngram_lm_scale_0.7_attention_scale_1.0  2.61
ngram_lm_scale_0.9_attention_scale_0.9  2.61
ngram_lm_scale_0.9_attention_scale_1.0  2.61

CPU RAM usage is given below. You see that the maximum RAM usage is less than 26 GB.

from icefall.

csukuangfj commented on September 21, 2024

#69 (comment)

"Number of model parameters" appear will take a long time, and in your screenshot only take 20 seconds. At the same time, the CPU memory can up to 78G, which is very high, so did you make any optimizations?

This does not look right to me. Model averaging should not take 78 GB CPU RAM. Can you check that
no other memory-consuming processes are running while you are decoding?

from icefall.

How to decode fast on cpu? about icefall HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent