When we trained model(<a class="issue-link js-issue-link" data-error-text="Failed to l

CTC/AED PROBLEM IN K2 about icefall HOT 10 CLOSED

zw76859420 commented on July 21, 2024

CTC/AED PROBLEM IN K2

from icefall.

Comments (10)

danpovey commented on July 21, 2024 1

I meant the normal training log output leading up to that point, not the error traceback.
(But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could
try 0.03 for instance.)

from icefall.

marcoyang1998 commented on July 21, 2024

It seems that you have nan values in the forward pass of your model. How many GPUs and what max_duration are you using?

from icefall.

zw76859420 commented on July 21, 2024

Thanks for your kind reply, and we used the following params to train zipformer(CTC/AED) model.

./zipformer/train.py \ --world-size 3 \ --num-epochs 360 \ --start-epoch 1 \ --use-fp16 0 \ --exp-dir zipformer/exp \ --base-lr 0.045 \ --lr-epochs 1.5 \ --max-duration 350 \ --enable-musan 0 \ --use-fp16 0 \ --lang-dir data/lang_char \ --manifest-dir data/fbank \ --on-the-fly-feats 0 \ --save-every-n 2000 \ --keep-last-k 20 \ --inf-check 1 \ --use-transducer 0 \ --use-ctc 1 \ --use-attention-decoder 1 \ --ctc-loss-scale 0.1 \ --attention-decoder-loss-scale 0.9 \ --num-encoder-layers 2,2,4,6,4,2 \ --feedforward-dim 512,768,1536,2048,1536,768 \ --encoder-dim 192,256,512,768,512,256 \ --encoder-unmasked-dim 192,192,256,320,256,192

from icefall.

zw76859420 commented on July 21, 2024

We using the following code to remove the utt in zipformer/train.py:

from icefall.

danpovey commented on July 21, 2024

i would guess nan has appeared in model. are you running with --inf-check True? If not, add it. and show log output leading up to that failure. may be divergence but let's see that log.

…

On Monday, May 6, 2024, hahaha ***@***.***> wrote: We using the following code to remove the utt in zipformer/train.py: image.png (view on web) <https://github.com/k2-fsa/icefall/assets/42910032/32584714-17d2-4510-8d70-24d31161eca9> image.png (view on web) <https://github.com/k2-fsa/icefall/assets/42910032/128c4832-4d6e-43de-acc3-8951794780a9> — Reply to this email directly, view it on GitHub <#1618 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO5PMEZW7BLHUDKAC7DZA5HRTAVCNFSM6AAAAABHIWPLKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGYYDAOJYHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

from icefall.

yaozengwei commented on July 21, 2024

He ran that with --inf-check True. The log shows "module.encoder_embed.conv.0.output is not finite. ". The total batch size he was using (--world-size 3 --max-duration 350) might be too small with a base-lr of 0.045.

from icefall.

zw76859420 commented on July 21, 2024

Thanks for advise, and we use the following model config, the same problem still oocrs.

export CUDA_VISIBLE_DEVICES="0,1,2"

./zipformer/train.py
--world-size 3
--num-epochs 360
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp
--base-lr 0.045
--lr-epochs 1.5
--max-duration 600
--enable-musan 0
--use-fp16 0
--lang-dir data/lang_char
--manifest-dir data/fbank
--on-the-fly-feats 0
--save-every-n 2000
--keep-last-k 20
--inf-check 1
--use-transducer 1
--use-ctc 1
--use-attention-decoder 1
--ctc-loss-scale 0.3
--attention-decoder-loss-scale 0.7
--num-encoder-layers 2,2,4,5,4,2
--feedforward-dim 512,768,1536,2048,1536,768
--encoder-dim 192,256,512,768,512,256
--encoder-unmasked-dim 192,192,256,320,256,192

from icefall.

zw76859420 commented on July 21, 2024

The detail error logs are as follow:

Traceback (most recent call last):
File "./zipformer/train.py", line 1520, in
main()
File "./zipformer/train.py", line 1511, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "ASR/zipformer/train.py", line 1318, in run
train_one_epoch(
File "ASR/zipformer/train.py", line 1009, in train_one_epoch
loss, loss_info = compute_loss(
File "ASR/zipformer/train.py", line 843, in compute_loss
simple_loss, pruned_loss, ctc_loss, attention_decoder_loss = model(
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/model.py", line 338, in forward
encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
File "ASR/zipformer/model.py", line 140, in forward_encoder
x, x_lens = self.encoder_embed(x, x_lens)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/subsampling.py", line 309, in forward
x = self.conv(x)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
hook_result = hook(self, input, result)
File "icefall/icefall/hooks.py", line 43, in forward_hook
raise ValueError(
ValueError: The sum of module.encoder_embed.conv.0.output is not finite: tensor([[[[nan, nan, nan, ..., nan, nan, nan],

from icefall.

zw76859420 commented on July 21, 2024

Thanks Dan, my idol.
Zengwei also gave me the same suggestion, and we now modify the base-lr to 0.035 based on the above configs.
We will show you the training results as soon as possible.

from icefall.

zw76859420 commented on July 21, 2024

I meant the normal training log output leading up to that point, not the error traceback. (But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could try 0.03 for instance.)

It works!!!
When I reduced base-lr from 0.045 to 0.030, the loss is currently dropping quite normally.

from icefall.

CTC/AED PROBLEM IN K2 about icefall HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent