Comments (10)
I meant the normal training log output leading up to that point, not the error traceback.
(But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could
try 0.03 for instance.)
from icefall.
It seems that you have nan values in the forward pass of your model. How many GPUs and what max_duration are you using?
from icefall.
Thanks for your kind reply, and we used the following params to train zipformer(CTC/AED) model.
./zipformer/train.py \ --world-size 3 \ --num-epochs 360 \ --start-epoch 1 \ --use-fp16 0 \ --exp-dir zipformer/exp \ --base-lr 0.045 \ --lr-epochs 1.5 \ --max-duration 350 \ --enable-musan 0 \ --use-fp16 0 \ --lang-dir data/lang_char \ --manifest-dir data/fbank \ --on-the-fly-feats 0 \ --save-every-n 2000 \ --keep-last-k 20 \ --inf-check 1 \ --use-transducer 0 \ --use-ctc 1 \ --use-attention-decoder 1 \ --ctc-loss-scale 0.1 \ --attention-decoder-loss-scale 0.9 \ --num-encoder-layers 2,2,4,6,4,2 \ --feedforward-dim 512,768,1536,2048,1536,768 \ --encoder-dim 192,256,512,768,512,256 \ --encoder-unmasked-dim 192,192,256,320,256,192
from icefall.
We using the following code to remove the utt in zipformer/train.py:
from icefall.
from icefall.
He ran that with --inf-check True
. The log shows "module.encoder_embed.conv.0.output is not finite. ". The total batch size he was using (--world-size 3 --max-duration 350) might be too small with a base-lr of 0.045.
from icefall.
Thanks for advise, and we use the following model config, the same problem still oocrs.
export CUDA_VISIBLE_DEVICES="0,1,2"
./zipformer/train.py
--world-size 3
--num-epochs 360
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp
--base-lr 0.045
--lr-epochs 1.5
--max-duration 600
--enable-musan 0
--use-fp16 0
--lang-dir data/lang_char
--manifest-dir data/fbank
--on-the-fly-feats 0
--save-every-n 2000
--keep-last-k 20
--inf-check 1
--use-transducer 1
--use-ctc 1
--use-attention-decoder 1
--ctc-loss-scale 0.3
--attention-decoder-loss-scale 0.7
--num-encoder-layers 2,2,4,5,4,2
--feedforward-dim 512,768,1536,2048,1536,768
--encoder-dim 192,256,512,768,512,256
--encoder-unmasked-dim 192,192,256,320,256,192
from icefall.
The detail error logs are as follow:
Traceback (most recent call last):
File "./zipformer/train.py", line 1520, in
main()
File "./zipformer/train.py", line 1511, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "ASR/zipformer/train.py", line 1318, in run
train_one_epoch(
File "ASR/zipformer/train.py", line 1009, in train_one_epoch
loss, loss_info = compute_loss(
File "ASR/zipformer/train.py", line 843, in compute_loss
simple_loss, pruned_loss, ctc_loss, attention_decoder_loss = model(
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/model.py", line 338, in forward
encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
File "ASR/zipformer/model.py", line 140, in forward_encoder
x, x_lens = self.encoder_embed(x, x_lens)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/subsampling.py", line 309, in forward
x = self.conv(x)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
hook_result = hook(self, input, result)
File "icefall/icefall/hooks.py", line 43, in forward_hook
raise ValueError(
ValueError: The sum of module.encoder_embed.conv.0.output is not finite: tensor([[[[nan, nan, nan, ..., nan, nan, nan],
from icefall.
Thanks Dan, my idol.
Zengwei also gave me the same suggestion, and we now modify the base-lr to 0.035 based on the above configs.
We will show you the training results as soon as possible.
from icefall.
I meant the normal training log output leading up to that point, not the error traceback. (But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could try 0.03 for instance.)
It works!!!
When I reduced base-lr from 0.045 to 0.030, the loss is currently dropping quite normally.
from icefall.
Related Issues (20)
- pytorch ver. `>=2.1.0` breaks compatibility with all `conformer_ctc` recipes
- Multi Lingual model HOT 1
- low resource data HOT 1
- Identical Batches Across Multiple GPUs HOT 2
- append features HOT 1
- CTC/AED PROBLEMS IN EXPORTING JIT MODULE HOT 4
- Error happens with egs/librispeech/ASR/prepare_mmi.sh HOT 3
- Use CutSet.mux to effect? HOT 10
- Help with training/finetuning a zipformer based model HOT 6
- Different Training Loss with Single Node (8 GPUs) vs. Two Nodes (4 GPUs Each)
- Data cleaning HOT 3
- ONNX decode error HOT 2
- OTC with conformer librispeech/WASR isn't converage.
- ONNX bug HOT 9
- Questions about modifying prepare.sh for training ASR model on custom data HOT 2
- How to use my own dataset based on another dataset HOT 3
- kaldifeat installation error HOT 2
- Why unique lexicon is needed in Chinese ASR, but not in English ASR?
- Error during training OTC conformer_ctc2 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from icefall.