Git Product home page Git Product logo

Comments (10)

danpovey avatar danpovey commented on July 21, 2024 1

I meant the normal training log output leading up to that point, not the error traceback.
(But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could
try 0.03 for instance.)

from icefall.

marcoyang1998 avatar marcoyang1998 commented on July 21, 2024

It seems that you have nan values in the forward pass of your model. How many GPUs and what max_duration are you using?

from icefall.

zw76859420 avatar zw76859420 commented on July 21, 2024

Thanks for your kind reply, and we used the following params to train zipformer(CTC/AED) model.

./zipformer/train.py \ --world-size 3 \ --num-epochs 360 \ --start-epoch 1 \ --use-fp16 0 \ --exp-dir zipformer/exp \ --base-lr 0.045 \ --lr-epochs 1.5 \ --max-duration 350 \ --enable-musan 0 \ --use-fp16 0 \ --lang-dir data/lang_char \ --manifest-dir data/fbank \ --on-the-fly-feats 0 \ --save-every-n 2000 \ --keep-last-k 20 \ --inf-check 1 \ --use-transducer 0 \ --use-ctc 1 \ --use-attention-decoder 1 \ --ctc-loss-scale 0.1 \ --attention-decoder-loss-scale 0.9 \ --num-encoder-layers 2,2,4,6,4,2 \ --feedforward-dim 512,768,1536,2048,1536,768 \ --encoder-dim 192,256,512,768,512,256 \ --encoder-unmasked-dim 192,192,256,320,256,192

from icefall.

zw76859420 avatar zw76859420 commented on July 21, 2024

We using the following code to remove the utt in zipformer/train.py:

image

image

from icefall.

danpovey avatar danpovey commented on July 21, 2024

from icefall.

yaozengwei avatar yaozengwei commented on July 21, 2024

He ran that with --inf-check True. The log shows "module.encoder_embed.conv.0.output is not finite. ". The total batch size he was using (--world-size 3 --max-duration 350) might be too small with a base-lr of 0.045.

from icefall.

zw76859420 avatar zw76859420 commented on July 21, 2024

Thanks for advise, and we use the following model config, the same problem still oocrs.

export CUDA_VISIBLE_DEVICES="0,1,2"

./zipformer/train.py
--world-size 3
--num-epochs 360
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp
--base-lr 0.045
--lr-epochs 1.5
--max-duration 600
--enable-musan 0
--use-fp16 0
--lang-dir data/lang_char
--manifest-dir data/fbank
--on-the-fly-feats 0
--save-every-n 2000
--keep-last-k 20
--inf-check 1
--use-transducer 1
--use-ctc 1
--use-attention-decoder 1
--ctc-loss-scale 0.3
--attention-decoder-loss-scale 0.7
--num-encoder-layers 2,2,4,5,4,2
--feedforward-dim 512,768,1536,2048,1536,768
--encoder-dim 192,256,512,768,512,256
--encoder-unmasked-dim 192,192,256,320,256,192

from icefall.

zw76859420 avatar zw76859420 commented on July 21, 2024

The detail error logs are as follow:

Traceback (most recent call last):
File "./zipformer/train.py", line 1520, in
main()
File "./zipformer/train.py", line 1511, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "ASR/zipformer/train.py", line 1318, in run
train_one_epoch(
File "ASR/zipformer/train.py", line 1009, in train_one_epoch
loss, loss_info = compute_loss(
File "ASR/zipformer/train.py", line 843, in compute_loss
simple_loss, pruned_loss, ctc_loss, attention_decoder_loss = model(
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/model.py", line 338, in forward
encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
File "ASR/zipformer/model.py", line 140, in forward_encoder
x, x_lens = self.encoder_embed(x, x_lens)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/subsampling.py", line 309, in forward
x = self.conv(x)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
hook_result = hook(self, input, result)
File "icefall/icefall/hooks.py", line 43, in forward_hook
raise ValueError(
ValueError: The sum of module.encoder_embed.conv.0.output is not finite: tensor([[[[nan, nan, nan, ..., nan, nan, nan],

from icefall.

zw76859420 avatar zw76859420 commented on July 21, 2024

Thanks Dan, my idol.
Zengwei also gave me the same suggestion, and we now modify the base-lr to 0.035 based on the above configs.
We will show you the training results as soon as possible.

from icefall.

zw76859420 avatar zw76859420 commented on July 21, 2024

I meant the normal training log output leading up to that point, not the error traceback. (But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could try 0.03 for instance.)

It works!!!
When I reduced base-lr from 0.045 to 0.030, the loss is currently dropping quite normally.
image

from icefall.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.