Git Product home page Git Product logo

Comments (6)

chaoyanghe avatar chaoyanghe commented on July 17, 2024

Hi, it happens when your MPI configuration is not correctly configured.
Please follow the MPI configuration in the README.md.
Before running my program, it is better to use a simple MPI program to test whether the send() and broadcast() MPI operation is correct. Or you can try:
Change the following code:

    def init_config(self):
        self.__broadcast_initial_config_to_client()
        """
        comm.bcast (tree structure) is faster than a loop send/receive operation:
        https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/
        """
        # for process_id in range(1, self.size):
        #     self.__send_initial_config_to_client(process_id)

to

    def init_config(self):
        # self.__broadcast_initial_config_to_client()
        """
        comm.bcast (tree structure) is faster than a loop send/receive operation:
        https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/
        """
        for process_id in range(1, self.size):
             self.__send_initial_config_to_client(process_id)

By this way, you can test whether your MPI configuration is correct or not.
The speed of the send() MPI operation is a little bit slower than the broadcast() MPI operation when the worker number is large. But in the case of FL research, the worker number is smaller than 1000, so this won't largely increase the communication time.

from fednas.

renyan1998 avatar renyan1998 commented on July 17, 2024

Thanks very much! As you say , I have tried use send() MPI operation rather than broadcast operation, it still doesn't work. I find the base problem , which is When I use docker to run the code, there will be some another problems. "shm" was set to small , so the "share memory" is not enough to use。In the end ,thank you for responding my problems, and your codes is very nice~

from fednas.

chaoyanghe avatar chaoyanghe commented on July 17, 2024

I am glad to here that you like my implementation.

Can you run my program now?

Yeah, you also need to check your physical configuration to make sure the MPI communication is executable. When the program stuck without logging for more than 3 minutes, it means that the multiprocessing program meets a bug. It could be 1) GPU memory is not enough, try to reduce your barch size or worker number. 2) MPI configuration. Make sure the bandwidth is enough to hold the model size.

For 1), you have to retune your hyper-parameters.

from fednas.

renyan1998 avatar renyan1998 commented on July 17, 2024

Thank you for your reminder. My experimental running environment is on a K8s cluster, and the account I have allocated has 8 V100 (32G) GPUs, so the memory should not be a big problem. I will contact the cluster administrator to reopen a docker later, if I succeed After running your code, I will reply to you as soon as possible.

from fednas.

renyan1998 avatar renyan1998 commented on July 17, 2024

Hi, I run the code in a single GPU server successfully .Thank you ~

from fednas.

chaoyanghe avatar chaoyanghe commented on July 17, 2024

Great!

We plan to release a distributed learning library recently. If you use our code for research or project, please help to cite this FedNAS paper and our framework/library paper, thanks.

from fednas.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.