Git Product home page Git Product logo

Comments (5)

OKORKO avatar OKORKO commented on September 27, 2024

我多次测试,发现如果train.jsonl里面大于7条就会出现这个问题,小于7条是可以进行训练,不过训练的时候一直是 Update best acc: 0.0000, outputs/model.pt.best,正确率一直是0

from funasr.

R1ckShi avatar R1ckShi commented on September 27, 2024

已经修复,可以重新pull代码再试一下

from funasr.

OKORKO avatar OKORKO commented on September 27, 2024

还是一样报同样的错误
[2024-04-17 15:23:00,349][root][INFO] - Train epoch: 0, rank: 0

/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 512], strides() = [1, 1]
bucket_view.sizes() = [1, 512], strides() = [512, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 512], strides() = [1, 1]
bucket_view.sizes() = [1, 512], strides() = [512, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[2024-04-17 15:23:02,723][root][INFO] - train, rank: 1, epoch: 0/50, step: 1/1, total step: 1, (loss_avg_rank: 0.013), (loss_avg_epoch: 0.008), (ppl_avg_epoch: 1.008e+00), (acc_avg_epoch: 0.000), (lr: 1.333e-08), [('loss_seaco', 0.008), ('loss', 0.008)], {'data_load': '0.994', 'forward_time': '0.692', 'backward_time': '0.505', 'optim_time': '0.174', 'total_time': '2.364'}, GPU, memory: usage: 3.784 GB, peak: 6.489 GB, cache: 6.828 GB, cache_peak: 6.828 GB
[2024-04-17 15:23:02,750][root][INFO] - train, rank: 0, epoch: 0/50, step: 1/1, total step: 1, (loss_avg_rank: 0.003), (loss_avg_epoch: 0.008), (ppl_avg_epoch: 1.008e+00), (acc_avg_epoch: 0.000), (lr: 1.333e-08), [('loss_seaco', 0.007), ('loss', 0.007)], {'data_load': '0.987', 'forward_time': '0.541', 'backward_time': '0.663', 'optim_time': '0.200', 'total_time': '2.391'}, GPU, memory: usage: 3.684 GB, peak: 6.373 GB, cache: 6.521 GB, cache_peak: 6.521 GB
[2024-04-17 15:23:02,849][root][INFO] - Validate epoch: 0, rank: 1

[2024-04-17 15:23:02,849][root][INFO] - Validate epoch: 0, rank: 0

Error executing job with overrides: ['++model=iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=../../../data/list/train.jsonl', '++valid_data_set_list=../../../data/list/val.jsonl', '++dataset_conf.batch_size=2000', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=false', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++optim_conf.lr=0.0002', '++output_dir=./outputs']
Traceback (most recent call last):
File "../../../funasr/bin/train.py", line 225, in
main_hydra()
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "../../../funasr/bin/train.py", line 48, in main_hydra
main(**kwargs)
File "../../../funasr/bin/train.py", line 196, in main
trainer.validate_epoch(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/funasr/train_utils/trainer.py", line 432, in validate_epoch
retval = model(**batch)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
Error executing job with overrides: ['++model=iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=../../../data/list/train.jsonl', '++valid_data_set_list=../../../data/list/val.jsonl', '++dataset_conf.batch_size=2000', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=false', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++optim_conf.lr=0.0002', '++output_dir=./outputs']
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
Traceback (most recent call last):
File "../../../funasr/bin/train.py", line 225, in
return forward_call(*args, **kwargs)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/funasr/models/seaco_paraformer/model.py", line 122, in forward
main_hydra()
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
assert text_lengths.dim() == 1, text_lengths.shape
AssertionError_run_hydra(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
: torch.Size([])
_run_app(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "../../../funasr/bin/train.py", line 48, in main_hydra
main(**kwargs)
File "../../../funasr/bin/train.py", line 196, in main
trainer.validate_epoch(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/funasr/train_utils/trainer.py", line 432, in validate_epoch
retval = model(**batch)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/funasr/models/seaco_paraformer/model.py", line 122, in forward
assert text_lengths.dim() == 1, text_lengths.shape
AssertionError: torch.Size([])
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 197344) of binary: /home/chushaobo/anaconda3/envs/funasrv2/bin/python
Traceback (most recent call last):
File "/home/chushaobo/anaconda3/envs/funasrv2/bin/torchrun", line 8, in
sys.exit(main())
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chushaobo/anaconda3/envs/funasrv2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../../funasr/bin/train.py FAILED

Failures:
[1]:
time : 2024-04-17_15:23:05
host : chushaobo
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 197345)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-17_15:23:05
host : chushaobo
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 197344)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

from funasr.

R1ckShi avatar R1ckShi commented on September 27, 2024

使用 "git pull, pip install -e ./" 来更新源码,你这个代码没有更新

from funasr.

OKORKO avatar OKORKO commented on September 27, 2024

非常感谢,测试可以训练了,我训练几天看看效果

from funasr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.