I train the model on my two Tesla M40 GPU. When I use nvidia-smi command to c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Low GPU usage about tacotron HOT 17 OPEN

kyubyong commented on August 16, 2024

Low GPU usage

from tacotron.

Comments (17)

basuam commented on August 16, 2024

@candlewill Are you running it with Python 3 or which Python version?
I had the problem that I couldn't use the GPUs and that was, I'm guessing, because I was using Python 2.7, it is in the closed issue #5 .

Try to see if you can find something useful there. Otherwise, if you can solve it, please, let us know how you did.

from tacotron.

candlewill commented on August 16, 2024

@basuam The Python I used is Python 3.6.0 with Anaconda 4.3.1 (64-bit), and GPU version TensorFlow (1.1) is used.

When training, the two GPUs are used, but just one for computation.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    2     19759    C   /home/train01/heyunchao/anaconda3/bin/python 21912MiB |
|    3     19759    C   /home/train01/heyunchao/anaconda3/bin/python 21794MiB |
+-----------------------------------------------------------------------------+

Only GPU 2 is used for computation, and the GPU-Util maintains at 0% in a long time.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 0000:02:00.0     Off |                    0 |
| N/A   19C    P8    18W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      On   | 0000:03:00.0     Off |                    0 |
| N/A   21C    P8    17W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40 24GB      On   | 0000:83:00.0     Off |                    0 |
| N/A   37C    P0    65W / 250W |  21916MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40 24GB      On   | 0000:84:00.0     Off |                    0 |
| N/A   32C    P0    57W / 250W |  21798MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

from tacotron.

marcossilva commented on August 16, 2024

@candlewill I believe there should be a slight modification in the code to allocate both GPUs because it's not the default for tensorflow. I cannot confirm because I've never trained it with more than one GPU but I do believe you must allocate both first manually otherwise tensorflow-gpu simply allocate one GPU. Have you trained other networks before without explicitly declaring which GPUs to use and checked if it has used them all?

from tacotron.

candlewill commented on August 16, 2024

To my understanding, maybe this is the reason: If multiple GPUs are not explicitly declared how to allocate, TensorFlow would choose the first GPU for computation as default, but use the memory of all GPUs.

from tacotron.

Kyubyong commented on August 16, 2024

candlewill's explanation is exact. I added train_multi_gpus.py for using multiple gpus.
@basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.

from tacotron.

candlewill commented on August 16, 2024

@Kyubyong In the current train.py code, the training is completely on CPU (see here ). I commented this line to allow use one GPU.

Then, I tried to compare the time cost of one epoch between train_multi_gpus.py and train.py. I find that, multi GPUs verstion takes a longer time about 220 seconds per epoch, while the single GPU version takes about 110 seconds.

My experiment environment is four Tesla K40m GPUs.

from tacotron.

Kyubyong commented on August 16, 2024

Did you run train_multi_gpus.py?

from tacotron.

candlewill commented on August 16, 2024

Yes, It takes a longer time about 220 seconds per epoch.

from tacotron.

Kyubyong commented on August 16, 2024

You changed the value of num_gpus in the hyperparams.py, did you?

from tacotron.

candlewill commented on August 16, 2024

Yes, I changed the value into 4.

from tacotron.

Kyubyong commented on August 16, 2024

One possibility is the batch size. If you have 4 gpus, you have to multiply the hp.batch_size by 4 for a fair comparison. If you see the code, mini-batch samples are split into 4 so each is fed in each gpu tower.

from tacotron.

Kyubyong commented on August 16, 2024

@candlewill Oh, and I removed the line of tf.device('/cpu:0'). I forgot to remove it. Thanks.

from tacotron.

Kyubyong commented on August 16, 2024

@candlewill Did you find why the multi gpu version is slower than the single gpu one? For me, the former is definitely way faster than the latter.

from tacotron.

candlewill commented on August 16, 2024

I forgot to increase the value of batch_size by multiplying with num_gpus when training.

from tacotron.

aijanai commented on August 16, 2024

I am running in a similar issue: GPU usage (AWS p2.xlarge, Deep Learning CUDA 9 Ubuntu AMI) is on average low, while CPU usage is always at peak.
Using python 2.7 or 3 doesn't make any difference.
It's like the GPU is used for some particular task only which is seldomly invoked.

ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov  4 08:20:19 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   75C    P0    93W / 149W |  10984MiB / 11439MiB |     50%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10238      C   python3                                    10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov  4 08:20:24 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   75C    P0    73W / 149W |  10984MiB / 11439MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10238      C   python3                                    10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$

from tacotron.

learningneo commented on August 16, 2024

@candlewill - I am facing same issue of low GPU usage as indicated by @aijanai . Could you please indicate which line did you comment to get full utilization on single GPU? The line which is mentioned by you in your comment is already commented in the code as it stands now and yet the performance is slow, hence the confusion.

from tacotron.

giridhar-pamisetty commented on August 16, 2024

candlewill's explanation is exact. I added train_multi_gpus.py for using multiple gpus.
@basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.

Can you please share train_multi_gpus.py.? It is not available now. And with train.py, I cant train on the GPUs.

from tacotron.

Low GPU usage about tacotron HOT 17 OPEN

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent