Comments (6)
Hello, @sonots san.
Thank you for the experiments and your effort.
There would be several reasons of the poor parallel efficiency and I have a few questions about it.
Possible reasons:
- In MNIST, the model is too small and distributed processing is not beneficial.
- In DCGAN, the model is GAN and it's so complicated and the discussion about parallel efficiency is not that simple
- MPI does busy-wait in blocking communication, so CPU busy rate does not help much to identify the bottleneck.
Questions:
- What did
iperf
actually measured? I guess it measures an average bandwidth of 1 or 2 seconds by default. What are the values listed in the table? If it's the average value through a single experiment, then the value should be much lower than the actual bandwidth because ChainerMN adopts synchronous parallelization. - Did you specify the number of processes in the
mpiexec
command? - Which MPI implementation did you use?
- What happens if you run 1 process/machine on x8 instances?
Thanks!
from chainermn.
What did iperf actually measured?
The result of iperf was like:
------------------------------------------------------------
Client connecting to sonots-p2-8xlarge-2, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 7] local 10.0.4.58 port 47890 connected with 10.0.4.102 port 5001
[ 3] local 10.0.4.58 port 47882 connected with 10.0.4.102 port 5001
[ 6] local 10.0.4.58 port 47888 connected with 10.0.4.102 port 5001
[ 4] local 10.0.4.58 port 47884 connected with 10.0.4.102 port 5001
[ 5] local 10.0.4.58 port 47886 connected with 10.0.4.102 port 5001
[ ID] Interval Transfer Bandwidth
[ 7] 0.0-10.0 sec 1.50 GBytes 1.28 Gbits/sec
[ 3] 0.0-10.0 sec 3.92 GBytes 3.37 Gbits/sec
[ 6] 0.0-10.0 sec 3.92 GBytes 3.36 Gbits/sec
[ 4] 0.0-10.0 sec 1.22 GBytes 1.05 Gbits/sec
[ 5] 0.0-10.0 sec 1.20 GBytes 1.03 Gbits/sec
[SUM] 0.0-10.0 sec 11.8 GBytes 10.1 Gbits/sec
Did you specify the number of processes in the mpiexec command?
I specified number of processes in the hostfile. I of course checked the number of processes running with ps
or top
commands.
Which MPI implementation did you use?
OpenMPI v3.0.0
What happens if you run 1 process/machine on x8 instances?
The result was same with p2.xlarge.
from chainermn.
I will re-evaluate with ImageNet anyway.
from chainermn.
Thanks!
(and after you finish a new experiment, please re-open this or create a new issue.)
from chainermn.
I've re-evaluated and now it looked reasonable. https://qiita.com/sonots/items/22384bbc61284f2fdf94#%E3%81%BE%E3%81%A8%E3%82%81
from chainermn.
Thanks!
from chainermn.
Related Issues (20)
- Don't inicialize global NCCL comm when HOT 2
- Checkpointer doesn't resume current learning rate HOT 8
- Adding allreduce for ndarray HOT 10
- mpirun doesn't exit when exception is thrown in some process HOT 7
- Asynchronous Allreduce HOT 2
- Handle list of dicts in MultiNodeIterator HOT 1
- would you please share hype parameters of GPUs=4 for resnet50 training with us ? HOT 23
- Expose `intra_size`, `inter_rank` and `inter_size` of communicators at readthedocs
- Provide functions for allreduce
- Manual selection for gpus in distributed training HOT 5
- CommunicatorBase.{scatter, allgather} is missing in the document
- Add `force_equal_length` flag to `scatter_dataset` method
- optimizer.setup() created by create_multi_node_optimizer returns an original optimizer HOT 2
- FP16 support HOT 1
- Forcing forkserver spawn earlier HOT 2
- When `in_size=None` is used in `Liner` and it is not used, an error occurs
- NCCL_ERROR_SYSTEM_ERROR: unhandled system error HOT 3
- CUDA streams usage HOT 6
- Non-Blocking Methodology on ChainerMN HOT 3
- Installation should do nothing but omit a warning.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chainermn.