Git Product home page Git Product logo

Comments (6)

keisukefukuda avatar keisukefukuda commented on June 17, 2024

Hello, @sonots san.
Thank you for the experiments and your effort.

There would be several reasons of the poor parallel efficiency and I have a few questions about it.

Possible reasons:

  • In MNIST, the model is too small and distributed processing is not beneficial.
  • In DCGAN, the model is GAN and it's so complicated and the discussion about parallel efficiency is not that simple
  • MPI does busy-wait in blocking communication, so CPU busy rate does not help much to identify the bottleneck.

Questions:

  • What did iperf actually measured? I guess it measures an average bandwidth of 1 or 2 seconds by default. What are the values listed in the table? If it's the average value through a single experiment, then the value should be much lower than the actual bandwidth because ChainerMN adopts synchronous parallelization.
  • Did you specify the number of processes in the mpiexec command?
  • Which MPI implementation did you use?
  • What happens if you run 1 process/machine on x8 instances?

Thanks!

from chainermn.

sonots avatar sonots commented on June 17, 2024

What did iperf actually measured?

The result of iperf was like:

------------------------------------------------------------
Client connecting to sonots-p2-8xlarge-2, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  7] local 10.0.4.58 port 47890 connected with 10.0.4.102 port 5001
[  3] local 10.0.4.58 port 47882 connected with 10.0.4.102 port 5001
[  6] local 10.0.4.58 port 47888 connected with 10.0.4.102 port 5001
[  4] local 10.0.4.58 port 47884 connected with 10.0.4.102 port 5001
[  5] local 10.0.4.58 port 47886 connected with 10.0.4.102 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  1.50 GBytes  1.28 Gbits/sec
[  3]  0.0-10.0 sec  3.92 GBytes  3.37 Gbits/sec
[  6]  0.0-10.0 sec  3.92 GBytes  3.36 Gbits/sec
[  4]  0.0-10.0 sec  1.22 GBytes  1.05 Gbits/sec
[  5]  0.0-10.0 sec  1.20 GBytes  1.03 Gbits/sec
[SUM]  0.0-10.0 sec  11.8 GBytes  10.1 Gbits/sec

Did you specify the number of processes in the mpiexec command?

I specified number of processes in the hostfile. I of course checked the number of processes running with ps or top commands.

Which MPI implementation did you use?

OpenMPI v3.0.0

What happens if you run 1 process/machine on x8 instances?

The result was same with p2.xlarge.

from chainermn.

sonots avatar sonots commented on June 17, 2024

I will re-evaluate with ImageNet anyway.

from chainermn.

keisukefukuda avatar keisukefukuda commented on June 17, 2024

Thanks!
(and after you finish a new experiment, please re-open this or create a new issue.)

from chainermn.

sonots avatar sonots commented on June 17, 2024

I've re-evaluated and now it looked reasonable. https://qiita.com/sonots/items/22384bbc61284f2fdf94#%E3%81%BE%E3%81%A8%E3%82%81

from chainermn.

keisukefukuda avatar keisukefukuda commented on June 17, 2024

Thanks!

from chainermn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.