Git Product home page Git Product logo

Comments (17)

basuam avatar basuam commented on August 16, 2024

@candlewill Are you running it with Python 3 or which Python version?
I had the problem that I couldn't use the GPUs and that was, I'm guessing, because I was using Python 2.7, it is in the closed issue #5 .

Try to see if you can find something useful there. Otherwise, if you can solve it, please, let us know how you did.

from tacotron.

candlewill avatar candlewill commented on August 16, 2024

@basuam The Python I used is Python 3.6.0 with Anaconda 4.3.1 (64-bit), and GPU version TensorFlow (1.1) is used.

When training, the two GPUs are used, but just one for computation.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    2     19759    C   /home/train01/heyunchao/anaconda3/bin/python 21912MiB |
|    3     19759    C   /home/train01/heyunchao/anaconda3/bin/python 21794MiB |
+-----------------------------------------------------------------------------+

Only GPU 2 is used for computation, and the GPU-Util maintains at 0% in a long time.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 0000:02:00.0     Off |                    0 |
| N/A   19C    P8    18W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      On   | 0000:03:00.0     Off |                    0 |
| N/A   21C    P8    17W / 250W |      0MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40 24GB      On   | 0000:83:00.0     Off |                    0 |
| N/A   37C    P0    65W / 250W |  21916MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40 24GB      On   | 0000:84:00.0     Off |                    0 |
| N/A   32C    P0    57W / 250W |  21798MiB / 22939MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

from tacotron.

marcossilva avatar marcossilva commented on August 16, 2024

@candlewill I believe there should be a slight modification in the code to allocate both GPUs because it's not the default for tensorflow. I cannot confirm because I've never trained it with more than one GPU but I do believe you must allocate both first manually otherwise tensorflow-gpu simply allocate one GPU. Have you trained other networks before without explicitly declaring which GPUs to use and checked if it has used them all?

from tacotron.

candlewill avatar candlewill commented on August 16, 2024

To my understanding, maybe this is the reason: If multiple GPUs are not explicitly declared how to allocate, TensorFlow would choose the first GPU for computation as default, but use the memory of all GPUs.

from tacotron.

Kyubyong avatar Kyubyong commented on August 16, 2024

candlewill's explanation is exact. I added train_multi_gpus.py for using multiple gpus.
@basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.

from tacotron.

candlewill avatar candlewill commented on August 16, 2024

@Kyubyong In the current train.py code, the training is completely on CPU (see here ). I commented this line to allow use one GPU.

Then, I tried to compare the time cost of one epoch between train_multi_gpus.py and train.py. I find that, multi GPUs verstion takes a longer time about 220 seconds per epoch, while the single GPU version takes about 110 seconds.

My experiment environment is four Tesla K40m GPUs.

from tacotron.

Kyubyong avatar Kyubyong commented on August 16, 2024

Did you run train_multi_gpus.py?

from tacotron.

candlewill avatar candlewill commented on August 16, 2024

Yes, It takes a longer time about 220 seconds per epoch.

from tacotron.

Kyubyong avatar Kyubyong commented on August 16, 2024

You changed the value of num_gpus in the hyperparams.py, did you?

from tacotron.

candlewill avatar candlewill commented on August 16, 2024

Yes, I changed the value into 4.

from tacotron.

Kyubyong avatar Kyubyong commented on August 16, 2024

One possibility is the batch size. If you have 4 gpus, you have to multiply the hp.batch_size by 4 for a fair comparison. If you see the code, mini-batch samples are split into 4 so each is fed in each gpu tower.

from tacotron.

Kyubyong avatar Kyubyong commented on August 16, 2024

@candlewill Oh, and I removed the line of tf.device('/cpu:0'). I forgot to remove it. Thanks.

from tacotron.

Kyubyong avatar Kyubyong commented on August 16, 2024

@candlewill Did you find why the multi gpu version is slower than the single gpu one? For me, the former is definitely way faster than the latter.

from tacotron.

candlewill avatar candlewill commented on August 16, 2024

I forgot to increase the value of batch_size by multiplying with num_gpus when training.

from tacotron.

aijanai avatar aijanai commented on August 16, 2024

I am running in a similar issue: GPU usage (AWS p2.xlarge, Deep Learning CUDA 9 Ubuntu AMI) is on average low, while CPU usage is always at peak.
Using python 2.7 or 3 doesn't make any difference.
It's like the GPU is used for some particular task only which is seldomly invoked.

ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov  4 08:20:19 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   75C    P0    93W / 149W |  10984MiB / 11439MiB |     50%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10238      C   python3                                    10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov  4 08:20:24 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   75C    P0    73W / 149W |  10984MiB / 11439MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10238      C   python3                                    10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$ 

.

from tacotron.

learningneo avatar learningneo commented on August 16, 2024

@candlewill - I am facing same issue of low GPU usage as indicated by @aijanai . Could you please indicate which line did you comment to get full utilization on single GPU? The line which is mentioned by you in your comment is already commented in the code as it stands now and yet the performance is slow, hence the confusion.

from tacotron.

giridhar-pamisetty avatar giridhar-pamisetty commented on August 16, 2024

candlewill's explanation is exact. I added train_multi_gpus.py for using multiple gpus.
@basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.

Can you please share train_multi_gpus.py.? It is not available now. And with train.py, I cant train on the GPUs.

from tacotron.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.