Comments (17)
@candlewill Are you running it with Python 3 or which Python version?
I had the problem that I couldn't use the GPUs and that was, I'm guessing, because I was using Python 2.7, it is in the closed issue #5 .
Try to see if you can find something useful there. Otherwise, if you can solve it, please, let us know how you did.
from tacotron.
@basuam The Python I used is Python 3.6.0 with Anaconda 4.3.1 (64-bit), and GPU version TensorFlow (1.1) is used.
When training, the two GPUs are used, but just one for computation.
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 19759 C /home/train01/heyunchao/anaconda3/bin/python 21912MiB |
| 3 19759 C /home/train01/heyunchao/anaconda3/bin/python 21794MiB |
+-----------------------------------------------------------------------------+
Only GPU 2 is used for computation, and the GPU-Util maintains at 0% in a long time.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB On | 0000:02:00.0 Off | 0 |
| N/A 19C P8 18W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M40 24GB On | 0000:03:00.0 Off | 0 |
| N/A 21C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M40 24GB On | 0000:83:00.0 Off | 0 |
| N/A 37C P0 65W / 250W | 21916MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M40 24GB On | 0000:84:00.0 Off | 0 |
| N/A 32C P0 57W / 250W | 21798MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
from tacotron.
@candlewill I believe there should be a slight modification in the code to allocate both GPUs because it's not the default for tensorflow. I cannot confirm because I've never trained it with more than one GPU but I do believe you must allocate both first manually otherwise tensorflow-gpu simply allocate one GPU. Have you trained other networks before without explicitly declaring which GPUs to use and checked if it has used them all?
from tacotron.
To my understanding, maybe this is the reason: If multiple GPUs are not explicitly declared how to allocate, TensorFlow would choose the first GPU for computation as default, but use the memory of all GPUs.
from tacotron.
candlewill's explanation is exact. I added train_multi_gpus.py
for using multiple gpus.
@basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.
from tacotron.
@Kyubyong In the current train.py
code, the training is completely on CPU (see here ). I commented this line to allow use one GPU.
Then, I tried to compare the time cost of one epoch between train_multi_gpus.py
and train.py
. I find that, multi GPUs verstion takes a longer time about 220 seconds per epoch, while the single GPU version takes about 110 seconds.
My experiment environment is four Tesla K40m GPUs.
from tacotron.
Did you run train_multi_gpus.py
?
from tacotron.
Yes, It takes a longer time about 220 seconds per epoch.
from tacotron.
You changed the value of num_gpus in the hyperparams.py, did you?
from tacotron.
Yes, I changed the value into 4.
from tacotron.
One possibility is the batch size. If you have 4 gpus, you have to multiply the hp.batch_size by 4 for a fair comparison. If you see the code, mini-batch samples are split into 4 so each is fed in each gpu tower.
from tacotron.
@candlewill Oh, and I removed the line of tf.device('/cpu:0')
. I forgot to remove it. Thanks.
from tacotron.
@candlewill Did you find why the multi gpu version is slower than the single gpu one? For me, the former is definitely way faster than the latter.
from tacotron.
I forgot to increase the value of batch_size
by multiplying with num_gpus
when training.
from tacotron.
I am running in a similar issue: GPU usage (AWS p2.xlarge, Deep Learning CUDA 9 Ubuntu AMI) is on average low, while CPU usage is always at peak.
Using python 2.7 or 3 doesn't make any difference.
It's like the GPU is used for some particular task only which is seldomly invoked.
ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov 4 08:20:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 75C P0 93W / 149W | 10984MiB / 11439MiB | 50% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10238 C python3 10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov 4 08:20:24 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 75C P0 73W / 149W | 10984MiB / 11439MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10238 C python3 10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$
.
from tacotron.
@candlewill - I am facing same issue of low GPU usage as indicated by @aijanai . Could you please indicate which line did you comment to get full utilization on single GPU? The line which is mentioned by you in your comment is already commented in the code as it stands now and yet the performance is slow, hence the confusion.
from tacotron.
candlewill's explanation is exact. I added
train_multi_gpus.py
for using multiple gpus.
@basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.
Can you please share train_multi_gpus.py.? It is not available now. And with train.py, I cant train on the GPUs.
from tacotron.
Related Issues (20)
- How long should I wait for the result generated after I perform the train.py? HOT 1
- Can anyone guide me how to get Audio out from eval.py file for testing ?? HOT 1
- Bus error: 10 at training HOT 6
- How to synthesize long sentences?
- About the performance of synthesis
- speaker adaption : not to update the encoder parameters
- ref_db=20, max_db=100, Where did these values come from? statistics?
- Generated wave were empty HOT 8
- I've uploaded Donald Trump speeches and transcripts HOT 1
- error in utils,py
- How can we exploit forced alignments?
- Error in data_load.py---TypeError: a bytes-like object is required, not 'str'
- Segmentation fault on training HOT 1
- Get MelSpectogram for wavenet
- IOError: [Errno 2] No such file or directory: '/data/private/voice/LJSpeech-1.0/transcript.csv'
- IOError: [Errno 2] No such file or directory: '/data/private/voice/LJSpeech-1.0/transcript.csv' HOT 1
- Transcript.csv file? HOT 1
- Different result in train and eval and synthesize mode
- how much it takes to train and on what hardware?
- Tensorflow error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tacotron.