jctian98 / e2e_lfmmi Goto Github PK

View Code? Open in Web Editor NEW

163.0 163.0 45.0 2.47 MB

E2E system with LF-MMI; word N-gram for Mandarin

Python 56.33% Shell 37.15% Perl 6.52%

e2e_lfmmi's People

Contributors

Stargazers

Watchers

e2e_lfmmi's Issues

kick off train failed: AttributeError: can't set attribute

When I kick off training, it met error.

File "/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/espnet/asr/pytorch_backend/asr.py", line 694, in train
updater = CustomUpdater(
File "/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/espnet/asr/pytorch_backend/asr.py", line 186, in init
self.device = device
AttributeError: can't set attribute

Traceback (most recent call last):
File "/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/../..//bin/asr_train.py", line 699, in
main(sys.argv[1:])
File "/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/../..//bin/asr_train.py", line 685, in main
train(args)
File "/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/espnet/asr/pytorch_backend/asr.py", line 694, in train
updater = CustomUpdater(
File "/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/espnet/asr/pytorch_backend/asr.py", line 186, in init
self.device = device
AttributeError: can't set attribute
/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91009) of binary: /home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/bin/python3
Traceback (most recent call last):
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/Data/jing.lu/tools/wenet/wenet/miniconda3/envs/K2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/../..//bin/asr_train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-06-26_18:59:57
host : scq03-802A13U0811-ai-app-13-2-msxf.host
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 91009)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

aishell1 is incompletely executed

Hello, please ask the following this contains all the code related to e2e lfmmi, which feels missing when executing (e.g., lfmmi criterion)

test word ngram error

Hi, when I tried the step=5 of prepare.sh, I reported the following error. Could you give me some guidance?

loss for lf_mmi is high.

Hi, sorry for interrupting.
I was runing the demo egs for aishell and noticing an abnormal phenomenon.
My setting is the same as aed.sh. While checking the loss thrend, I notice that the loss_ctc(ctc_type==k2_mmi) is going up after 3 epoch.

Am I doing something wrong?

Librispeech

Hi,

Thanks for publishing your work this is very interesting.

Is there any chance that you can upload the librispeech egs folder so others can reproduce it?

timeout when training

Hi, when I want to training model with aishell1, I meet the problem that connect() timeout . Can you help me?

use_adversial_examples

e2e_lfmmi/nets/pytorch_backend/e2e_asr_transducer_cs.py

Line 390 in 34b8056

if self.use_adversial_examples:

Hi,
Thank you for your project in advance. Would you tell me where define "use_adversial_examples" ? I find self.use_adversial_examples=args.cs_use_adversial_examples, but I don't get args.cs_use_adversial_examples anywhere.

Thx again.

code switch变量解释

非常有用的code-switch的工作！我有一个小小的疑问：

e2e_lfmmi/nets/pytorch_backend/e2e_asr_transducer_cs.py

Line 346 in 34b8056

ys_pad, cls_ids = ys_pad[:, 1:], ys_pad[:, 0].squeeze(0)

ys_pad, cls_ids = ys_pad[:, 1:], ys_pad[:, 0].squeeze(0)
这里的cls_ids是用的全局的language id作为label吗？分三类：中文、英文、中英混？

OSError: [Errno 38] Function not implemented

Hi!
I met this error while running lm-mmi decoding. If I set nj=188 ,10 jobs won't have this issue and have a common decoing result. But if I set nj=50, all of my jobs is crashed.

number of phones 218
Found parameter lm_scores with shape torch.Size([47960])
Found parameter lo.1.weight with shape torch.Size([219, 512])
Found parameter lo.1.bias with shape torch.Size([219])
Using MMI scorer type: frame
MMI Scorer Module: <class 'espnet.nets.scorers.mmi_rnnt_scorer.MMIRNNTScorer'>
Traceback (most recent call last):
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/../..//bin/asr_recog.py", line 456, in
main(sys.argv[1:])
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/../..//bin/asr_recog.py", line 433, in main
recog(args)
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/asr/pytorch_backend/asr.py", line 1197, in recog
word_ngram_scorer = word_ngram_scorer(
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/nets/scorers/word_ngram.py", line 179, in init
self.WordNgram = WordNgram(lang, device)
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/nets/scorers/word_ngram.py", line 51, in init
self.load_G()
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/nets/scorers/word_ngram.py", line 66, in load_G
fcntl.flock(f, fcntl.LOCK_EX) # lock
OSError: [Errno 38] Function not implemented

Any idea how to solve this issue?

How to use your trained model directly?

How to use your trained model directly, can you give an example? thank you very much

word_ngram result doesn't seem right

Hi, I tried running nets/scorers/word_ngram.py, but the result doesn't seem to be as expected

if __name__ == "__main__":
    device = torch.device("cuda:0")
    lang = sys.argv[1]
    word_ngram = WordNgram(lang, device)

    texts = ["甚至出现交易停滞的情况", "甚至出现交易停滞的情形", "欲擒故纵", "欲擒故放"]
    for i in range(1):
        scores = word_ngram.score_texts(texts, log_semiring=True)
        print(scores)

result

tensor([-inf, -inf, -inf, -inf], device='cuda:0', dtype=torch.float64)

G.fst.txt:

ngram-count -order 3 -lm aishell.arpa -kndiscount -interpolate -text text

python3 -m kaldilm \
  --read-symbol-table="words.txt" \
  --disambig-symbol='#0' \
  --max-order=3 \
  aishell.arpa > G.fst.txt

The words.txt file has 137079 lines, and nothing abnormal was found.

thanks.

Bigram LM receives supervision only from numerator FSA?

Thanks for this great framework for unified e2e lfmmi training and inference!
I noticed that snowfall has updated the MmiTrainingGraphCompiler in snowfall/training/mmi_graph.py:135 from

ctc_topo_P_vec = k2.create_fsa_vec([ctc_topo_P.detach()])

to a non-detach version

ctc_topo_P_vec = k2.create_fsa_vec([self.ctc_topo_P])

Since in this repo the denominator FSA vec is detached, It seems that bigram LM FSA parameters can only get supervision (gradient) from numerator FSA.

I'm not sure if I missed something or the detach operation could be a problem.

Any help or explanation would be appreciated. Thanks!

version issue

Could you please tell me the version of your lhotse?
I have problem running the decoding benchmark because librosa need numpy<1.22, while lhotse is complied with numpy==1.22, the error are as follws

2022-02-22 17:47:07,153 (asr_init:162) WARNING: reading model parameters from exp/train_sp_pytorch_8v100_ddp_rnnt_mmi/results_0/model.last91_100.avg.best
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
Traceback (most recent call last):
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/../..//bin/asr_recog.py", line 456, in
main(sys.argv[1:])
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/../..//bin/asr_recog.py", line 433, in main
recog(args)
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/asr/pytorch_backend/asr.py", line 1063, in recog
model, train_args = load_trained_model(args.model, training=False)
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/asr/pytorch_backend/asr_init.py", line 172, in load_trained_model
model_class = dynamic_import(model_module)
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/utils/dynamic_import.py", line 22, in dynamic_import
m = importlib.import_module(module_name)
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/nets/pytorch_backend/e2e_asr_transducer.py", line 52, in
from espnet.snowfall.warpper.warpper_mmi import K2MMI
File "/mypath/work/k2/E2E-ASR-Framework/egs/aishell1/espnet/snowfall/warpper/warpper_mmi.py", line 19, in
from lhotse.utils import nullcontext
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/init.py", line 4, in
from .cut import CutSet, MonoCut
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/cut.py", line 33, in
from lhotse.features import (
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/features/init.py", line 1, in
from .base import (
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/features/base.py", line 19, in
from lhotse.features.io import FeaturesWriter, get_reader
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/site-packages/lhotse/features/io.py", line 7, in
import lilcom
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/site-packages/lilcom/init.py", line 2, in
from .lilcom_interface import compress, decompress, get_shape
File "/mypath/install_dir/anaconda3/envs/k2/lib/python3.8/site-packages/lilcom/lilcom_interface.py", line 3, in
from . import lilcom_extension

Thank you very much.

about SOTA

hello jctian98
Impressive work!
Very happy to see that aishell's sota has been refreshed again.
I have some doubts, hope you can help me figure it out,
Why the aishell-1 result on paper with code leaderboard is 4.18%
https://paperswithcode.com/sota/speech-recognition-on-aishell-1
but the result on aishell-1 in you paper is, 4.10%,
Is there any different between two result?
Is the dev set used during training?
Sincerely hope to get your reply.

DDP is the only way to run this repo?

I have tried to remove the options about DDP,after this,can i run this repo on one mechaine?

How to set up a single machine with multiple gpu？

Hello,I am trying to train locally with multiple gpus and I am getting some errors. There are always four on the local machine. I want to set two cards for training. I have set the DDP parameters as follows, but some errors have occurred. Please, how do I set them?

export HOST_GPU_NUM=2
export HOST_NUM=1
export NODE_NUM=1
export INDEX=0

One gpu can be used for normal training, but multiple gpus cannot be executed normally

error log：
2022-08-30 17:59:56,822 (ctc:138) INFO: CTC input lengths: tensor([140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140], device='cuda:0')
2022-08-30 17:59:56,823 (ctc:143) INFO: CTC output lengths: tensor([23, 18, 22, 23, 20, 22, 22, 22, 22, 21, 21, 24, 22, 21, 17, 23], device='cuda:0')
2022-08-30 17:59:56,823 (ctc:154) INFO: ctc loss:1071.3641357421875
2022-08-30 17:59:57,021 (e2e_asr_transducer:92) INFO: loss:1988.887451171875
2022-08-30 17:59:57,528 (asr:250) INFO: on device cuda:0 grad norm=7541.00927734375
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801883 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
2022-08-30 18:30:01,813 (ctc:138) INFO: CTC input lengths: tensor([123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123], device='cuda:1')
2022-08-30 18:30:01,813 (ctc:143) INFO: CTC output lengths: tensor([22, 21, 22, 23, 19, 21, 20, 20, 21, 26, 20, 21, 22, 24, 14, 24], device='cuda:1')
2022-08-30 18:30:01,814 (ctc:154) INFO: ctc loss:936.232421875
2022-08-30 18:30:02,099 (e2e_asr_transducer:92) INFO: loss:1777.2901611328125
2022-08-30 18:30:02,719 (asr:250) INFO: on device cuda:1 grad norm=6572.220703125
Tue Aug 30 18:30:02 2022 | rank: 1 | | iteration: 0 | gradient applied
2022-08-30 18:30:02,909 (ctc:138) INFO: CTC input lengths: tensor([34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34], device='cuda:1')
2022-08-30 18:30:02,910 (ctc:143) INFO: CTC output lengths: tensor([4, 9, 9, 6, 6, 9, 7, 6, 4, 4, 6, 8, 6, 9, 6, 5], device='cuda:1')
2022-08-30 18:30:02,910 (ctc:154) INFO: ctc loss:262.8006286621094
2022-08-30 18:30:02,972 (e2e_asr_transducer:92) INFO: loss:503.36956787109375
2022-08-30 18:30:03,115 (asr:250) INFO: on device cuda:1 grad norm=1751.3741455078125
/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 147095 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 147080) of binary: /home/miniconda3/envs/lfmmi/bin/python
Traceback (most recent call last):
File "/home/miniconda3/envs/lfmmi/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/miniconda3/envs/lfmmi/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/miniconda3/envs/lfmmi/lib/python3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miniconda3/envs/lfmmi/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

env question

I followed the instructions, but there is always a version conflict. How can I improve it?

[root@b394acbc6baf tools]# make TH_VERSION=1.7.1 CUDA_VERSION=10.1
CUDA_VERSION=10.1
PYTHON=/anaconda/envs/lfmmi/bin/python3
PYTHON_VERSION=Python 3.9.15
USE_CONDA=1
TH_VERSION=1.7.1
WITH_OMP=ON
. ./activate_python.sh && ./installers/install_torch.sh "true" "1.7.1" "10.1"
2022-12-07T05:51:42 (install_torch.sh:149:main) [INFO] python_version=3.9.15
2022-12-07T05:51:42 (install_torch.sh:150:main) [INFO] torch_version=1.7.1
2022-12-07T05:51:42 (install_torch.sh:151:main) [INFO] cuda_version=10.1
2022-12-07T05:51:43 (install_torch.sh:97:install_torch) conda install -y pytorch=1.7.1 torchaudio=0.7.2 cudatoolkit=10.1 -c pytorch
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package pytorch conflicts for:
torchaudio=0.7.2 -> pytorch==1.7.1
pytorch=1.7.1

Package _openmp_mutex conflicts for:
pytorch=1.7.1 -> libgcc-ng[version='>=7.3.0'] -> _openmp_mutex[version='>=4.5']
python=3.9 -> libgcc-ng[version='>=11.2.0'] -> _openmp_mutex[version='>=4.5']

Package cudatoolkit conflicts for:
torchaudio=0.7.2 -> pytorch==1.7.1 -> cudatoolkit[version='>=10.1,<10.2|>=11.0,<11.1|>=10.2,<10.3|>=9.2,<9.3']
cudatoolkit=10.1The following specifications were found to be incompatible with your system:

feature:/linux-64::__glibc==2.17=0
feature:|@/linux-64::__glibc==2.17=0
python=3.9 -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.17

make: *** [Makefile:102: pytorch.done] Error 1

jctian98 / e2e_lfmmi Goto Github PK

e2e_lfmmi's People

Contributors

Stargazers

Watchers

Forkers

e2e_lfmmi's Issues

/home/Data/jing.lu/project/e2e_lfmmi/egs/aishell1.bk/../..//bin/asr_train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-06-26_18:59:57 host : scq03-802A13U0811-ai-app-13-2-msxf.host rank : 0 (local_rank: 0) exitcode : 1 (pid: 91009) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Recommend Projects

Recommend Topics

Recommend Org

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-06-26_18:59:57
host : scq03-802A13U0811-ai-app-13-2-msxf.host
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 91009)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html