Git Product home page Git Product logo

Comments (14)

yuekaizhang avatar yuekaizhang commented on August 22, 2024 5

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

from icefall.

yuekaizhang avatar yuekaizhang commented on August 22, 2024 2

Hi @SongLi89, thank you for raising this issue. I will help check if there are any performance bottlenecks. Will reply here with any updates.

from icefall.

XhrLeokk avatar XhrLeokk commented on August 22, 2024 1

Nice plot, surprisingly to know that the gap is that close.
Seems weird. 🤔

from icefall.

Ziyi6 avatar Ziyi6 commented on August 22, 2024 1

Met same problem. We're also using A100 and H100 servers, unsurprisingly the speed of H100 aren't as fast as we expected which is absolutely unnormal. At least the price we paid didn't bring us significant speed improvement. I think likely something must be set in the training code to be able to use H100 more efficiently?

from icefall.

rambowu11 avatar rambowu11 commented on August 22, 2024 1

Mark, we have a plan to buy H100 GPUs

from icefall.

SongLi89 avatar SongLi89 commented on August 22, 2024 1

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

Hi yuekai, thanks for the rapid replay.
so the A100 has mem of 40G where the H100 has 80G. Belows are the two screen shots.
image
image

Which torch/CUDA version you used for test?
so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

from icefall.

yuekaizhang avatar yuekaizhang commented on August 22, 2024

However, we found that training speeds were not significantly improved with more expensive one (H100).

@SongLi89
Could you tell me the specific comparison results of the training speed in your tests?

I am trying to reproduce your issue.

On the A100, it takes me about 0.6 seconds per step, and on the H100, it takes about 0.36 seconds per step. (by checking log file)

I am not sure if this speed ratio is similar to yours? (I used the aishell1 dataset, where the sentence lengths are slightly shorter, but the max_audio_duration setting is the same as yours.

from icefall.

yuekaizhang avatar yuekaizhang commented on August 22, 2024

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

from icefall.

SongLi89 avatar SongLi89 commented on August 22, 2024

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

thanks a lot I will try it

from icefall.

SongLi89 avatar SongLi89 commented on August 22, 2024

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

Hi yuekai, I tried with your environment and we have got similar acceleration ratio. Thanks a lot. But still it is great that the performance can further be improved. If you have ideas to speed up the training, please tell me. thanks.

from icefall.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.