Git Product home page Git Product logo

Comments (14)

pkufool avatar pkufool commented on July 21, 2024

I guess for someone with root access it is relatively easy to install

I think non of our dependencies need root access.

Is there any plan on making the installation process easier?

You can use docker, see https://github.com/k2-fsa/icefall/tree/master/docker

We normally install dependencies by pip, conda is not recomended.

As for the 5 and 6, will have a look and fix them.

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

Using python3.10 the installation looks like it is done (following the steps of installation guide), however, when trying to run the yesno training I am getting the following:

2024-03-20 10:31:29,424 INFO [asr_datamodule.py:255] About to get test cuts
Could not load library libcudnn_cnn_train.so.8. Error: /home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
...
Traceback (most recent call last):
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 575, in <module>
    main()
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 571, in main
    run(rank=0, world_size=1, args=args)
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 536, in run
    train_one_epoch(
  File "/home/pe.honnet/Projects/tl_icefall/egs/yesno/ASR/./tdnn/train.py", line 417, in train_one_epoch
    loss.backward()
  File "/home/pe.honnet/Projects/tl_icefall/venv/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/pe.honnet/Projects/tl_icefall/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: GET was unable to find an engine to execute this computation

I also checked that the cuda version I installed locally should be compatible with the installed drivers:

# got this installer cuda_12.1.0_530.30.02_linux.run
nvidia-smi 
Wed Mar 20 10:44:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+

from icefall.

JinZr avatar JinZr commented on July 21, 2024

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

Another attempt, using cuda 11.8 instead and what looks like a successful installation:

$ ./tdnn/train.py 
2024-03-20 11:07:31,740 INFO [train.py:481] Training started
2024-03-20 11:07:31,740 INFO [train.py:482] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'seed': 42, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': False, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.23.0.dev+git.d3106cf.clean', 'torch-version': '2.2.1+cu118', 'torch-cuda-available': True, 'torch-cuda-version': '11.8', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'ea92fc3-clean', 'icefall-git-date': 'Tue Mar 19 14:57:16 2024', 'icefall-path': '/home/pe.honnet/Projects/tl_icefall', 'k2-path': '/home/pe.honnet/Projects/tl_icefall/venv2/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/home/pe.honnet/Projects/tl_icefall/venv2/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'tlzhsrv010', 'IP address': '127.0.1.1'}}
2024-03-20 11:07:31,741 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2024-03-20 11:07:31,742 INFO [train.py:495] device: cuda:0
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:146] About to get train cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:247] About to get train cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:149] About to create train dataset
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:201] Using SimpleCutSampler.
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:207] About to create train dataloader
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:220] About to get test cuts
2024-03-20 11:07:33,017 INFO [asr_datamodule.py:255] About to get test cuts
Segmentation fault (core dumped)

from icefall.

csukuangfj avatar csukuangfj commented on July 21, 2024

For the segmentation fault, please see
#674


By the way, you are the first one with so many issues setting up the icefall environment for the past 6 months.

It would be great if you could tell us the exact commands you have run and tell us whether you have followed strictly
the installation doc for both k2 and icefall.

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

For the installation based on cuda 11.8 here is the full history:

318  python3.10 -m venv venv2
  319  source venv2/bin/activate
  320  cd cuda/
  321  wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
  322  chmod +x cuda_11.8.0_520.61.05_linux.run 
  323  ./cuda_11.8.0_520.61.05_linux.run   --silent   --toolkit   --installpath=$PWD/cuda-11.8.0   --no-opengl-libs   --no-drm   --no-man-page
  324  wget https://huggingface.co/csukuangfj/cudnn/resolve/main/cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz
  325  tar xvf cudnn-linux-x86_64-8.9.1.23_cuda11-archive.tar.xz --strip-components=1 -C  $PWD/cuda-11.8.0
  326  cd ..
  327  cp activate-cuda-12.1.sh activate-cuda-11.8.sh 
  328  nano activate-cuda-11.8.sh 
  329  source activate-cuda-11.8.sh 
  330  which nvcc
  331  nvcc --version
  332  pip3 install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
  333  cd k2_wheel/
  334  wget https://huggingface.co/csukuangfj/k2/resolve/main/ubuntu-cuda/k2-1.24.4.dev20240223+cuda11.8.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 
  335  ls
  336  pip install k2-1.24.4.dev20240223+cuda11.8.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 
  337  cd ..
  338  pip install git+https://github.com/lhotse-speech/lhotse
  339  pip install -r requirements.txt 
  340  export PYTHONPATH=$PWD:$PYTHONPATH
  341  cd egs/yesno/ASR/
  342  ./prepare.sh 
  343  ./tdnn/train.py 

In the other case (cuda 12.1) I used the same approach but based on cuda 12.1 (from https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#cuda-12-1) and adapted k2 wheel. Here is a new attempt (with same error as reported in previous comment):

  360  python3.10 -m venv venv3
  361  source venv3/bin/activate
  362  cd cuda/
  363  ./cuda_12.1.0_530.30.02_linux.run   --silent   --toolkit   --installpath=$PWD/cuda-12.1.0   --no-opengl-libs   --no-drm   --no-man-page
  364  tar xvf cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz --strip-components=1 -C $PWD/cuda-12.1.0
  365  cd ..
  366  source activate-cuda-12.1.sh 
  367  pip install torch torchaudio
  368  cd k2_wheel
  369  pip install k2-1.24.4.dev20240301+cuda12.1.torch2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  370  pip install git+https://github.com/lhotse-speech/lhotse
  371  cd ..
  372  pip install -r requirements.txt 
  373  export PYTHONPATH=$PWD:$PYTHONPATH
  374  cd egs/yesno/ASR/
  375  ./prepare.sh 
  376  ./tdnn/train.py 

from icefall.

csukuangfj avatar csukuangfj commented on July 21, 2024

Thanks! Yous commands look good.

tried and failed to install k2 from source (first because of the issue I mentioned in 1. but then after adding that it still fails with Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) (found version "12.1.66")

Could you give the complete error logs for this?


installing lhotse which I had not had issues before. Here I get another error related to lilcom

Could you give the complete error logs for this?

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

OK, so I retried the same thing (i.e. installing k2 from source). If I follow the instructions in https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#set-environment-variables-for-cuda-12-1
Then, running

git clone https://github.com/k2-fsa/k2.git
cd k2
export K2_MAKE_ARGS="-j6"
python3 setup.py install

I am first getting the error in the log file error_k2_from_source_1.log

I fixed it by adding this line to the activate-cuda script

export CUDAToolkit_INCLUDE_DIR=$CUDA_HOME/targets/x86_64-linux/include

Then, trying to install again k2 from source I get the error in the log file error_k2_from_source_2.log

-- Unable to find cuda_runtime.h in "/home/pe.honnet/Projects/tl_icefall/cuda/cuda-12.1.0/include" for CUDAToolkit_INCLUDE_DIR.
-- Unable to find cublas_v2.h in either "" or "/home/pe.honnet/Projects/tl_icefall/math_libs/include"

although in $CUDAToolkit_INCLUDE_DIR there is cuda_runtime.h and cublas_v2.h.

error_k2_from_source_1.log
error_k2_from_source_2.log

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

Regarding the lhotse issue, it was the lilcom issue solved by danpovey/lilcom#50 (comment)
The reason was that I was first creating a conda environment to then create a virtualenv (because there was no python3.10-venv installed on the system. I asked the admin to add it since and got rid of the conda solution).

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

Regarding the seg fault, I am not able to solve it with the solution you shared - the data preparation works, but it fails in the training.

from icefall.

csukuangfj avatar csukuangfj commented on July 21, 2024

Regarding the seg fault, I am not able to solve it with the solution you shared - the data preparation works, but it fails in the training.

Are you able to use the command in the above posted link to find out more?

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

Sure this is the output log (with the cuda 11.8 environment - with cuda 12.1 I have the error I had reported before)
gdb_output.log

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

Here is a finding from this comment https://discuss.pytorch.org/t/could-not-load-library-libcudnn-cnn-train-so-8-but-im-sure-that-i-have-set-the-right-ld-library-path/190277/3
I have simply removed the link libcudnn_cnn_train.so.8 in the folder .../cuda-12.1.0/lib and seemed to be able to run the tdnn/train.py script.
It is surprising that no one else got that error before (I had the same error on two different servers with old and recent GPUs).

from icefall.

pehonnet avatar pehonnet commented on July 21, 2024

@csukuangfj I am closing this issue as in the end I was able to make it work, but I think that my last comment (about removing libcudnn_cnn_train.so.8) may be something to keep in mind as other people will probably face it too.

from icefall.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.