Hi, just have a question regarding the training speed using differe

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Training speed is not improved by using a better GPU about icefall HOT 14 OPEN

SongLi89 commented on August 22, 2024 2

Training speed is not improved by using a better GPU

from icefall.

Comments (14)

yuekaizhang commented on August 22, 2024 5

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

from icefall.

yuekaizhang commented on August 22, 2024 2

Hi @SongLi89, thank you for raising this issue. I will help check if there are any performance bottlenecks. Will reply here with any updates.

from icefall.

XhrLeokk commented on August 22, 2024 1

Nice plot, surprisingly to know that the gap is that close.
Seems weird. 🤔

from icefall.

Ziyi6 commented on August 22, 2024 1

Met same problem. We're also using A100 and H100 servers, unsurprisingly the speed of H100 aren't as fast as we expected which is absolutely unnormal. At least the price we paid didn't bring us significant speed improvement. I think likely something must be set in the training code to be able to use H100 more efficiently?

from icefall.

rambowu11 commented on August 22, 2024 1

Mark, we have a plan to buy H100 GPUs

from icefall.

SongLi89 commented on August 22, 2024 1

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

Hi yuekai, thanks for the rapid replay.
so the A100 has mem of 40G where the H100 has 80G. Belows are the two screen shots.

Which torch/CUDA version you used for test?
so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

from icefall.

yuekaizhang commented on August 22, 2024

However, we found that training speeds were not significantly improved with more expensive one (H100).

@SongLi89
Could you tell me the specific comparison results of the training speed in your tests?

I am trying to reproduce your issue.

On the A100, it takes me about 0.6 seconds per step, and on the H100, it takes about 0.36 seconds per step. (by checking log file)

I am not sure if this speed ratio is similar to yours? (I used the aishell1 dataset, where the sentence lengths are slightly shorter, but the max_audio_duration setting is the same as yours.

from icefall.

yuekaizhang commented on August 22, 2024

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

from icefall.

SongLi89 commented on August 22, 2024

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:
docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16
If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

thanks a lot I will try it

from icefall.

SongLi89 commented on August 22, 2024

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:
docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16
If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

Hi yuekai, I tried with your environment and we have got similar acceleration ratio. Thanks a lot. But still it is great that the performance can further be improved. If you have ideas to speed up the training, please tell me. thanks.

from icefall.

Training speed is not improved by using a better GPU about icefall HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent