Comments (6)
Hi, it happens when your MPI configuration is not correctly configured.
Please follow the MPI configuration in the README.md.
Before running my program, it is better to use a simple MPI program to test whether the send() and broadcast() MPI operation is correct. Or you can try:
Change the following code:
def init_config(self):
self.__broadcast_initial_config_to_client()
"""
comm.bcast (tree structure) is faster than a loop send/receive operation:
https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/
"""
# for process_id in range(1, self.size):
# self.__send_initial_config_to_client(process_id)
to
def init_config(self):
# self.__broadcast_initial_config_to_client()
"""
comm.bcast (tree structure) is faster than a loop send/receive operation:
https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/
"""
for process_id in range(1, self.size):
self.__send_initial_config_to_client(process_id)
By this way, you can test whether your MPI configuration is correct or not.
The speed of the send() MPI operation is a little bit slower than the broadcast() MPI operation when the worker number is large. But in the case of FL research, the worker number is smaller than 1000, so this won't largely increase the communication time.
from fednas.
Thanks very much! As you say , I have tried use send() MPI operation rather than broadcast operation, it still doesn't work. I find the base problem , which is When I use docker to run the code, there will be some another problems. "shm" was set to small , so the "share memory" is not enough to use。In the end ,thank you for responding my problems, and your codes is very nice~
from fednas.
I am glad to here that you like my implementation.
Can you run my program now?
Yeah, you also need to check your physical configuration to make sure the MPI communication is executable. When the program stuck without logging for more than 3 minutes, it means that the multiprocessing program meets a bug. It could be 1) GPU memory is not enough, try to reduce your barch size or worker number. 2) MPI configuration. Make sure the bandwidth is enough to hold the model size.
For 1), you have to retune your hyper-parameters.
from fednas.
Thank you for your reminder. My experimental running environment is on a K8s cluster, and the account I have allocated has 8 V100 (32G) GPUs, so the memory should not be a big problem. I will contact the cluster administrator to reopen a docker later, if I succeed After running your code, I will reply to you as soon as possible.
from fednas.
Hi, I run the code in a single GPU server successfully .Thank you ~
from fednas.
Great!
We plan to release a distributed learning library recently. If you use our code for research or project, please help to cite this FedNAS paper and our framework/library paper, thanks.
from fednas.
Related Issues (2)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fednas.