Git Product home page Git Product logo

distribin's Introduction

distribin: Distributed metagenomic binning

As part of the PaCER Hackathon 2023 I set out to learn and refresh my understanding of distributed computing and see if I can apply those concepts to distribute and bin multiple metagenomes. I also wanted to learn about the message passing interface (MPI). Here is the basic idea of my project.

I used two approaches.

  1. Array jobs
  2. MPI

Array jobs

Using the SLURM --array and passing a list of files to use for each job, I was able to distribute the binning jobs to different nodes. Check out the metacoag.slurm file for further details.

MPI

This was the challenging part. Rather than rewriting my code to distribute chunks of one dataset, I decided to distribute one dataset per node using MPI, do the binning, gather the results and do some statistics.

Setting up MPI

Make sure you have a working MPI implementation like MPICH or OpenMPI.

Setting up mpi4py

As my code was in python, I used the MPI python interface mpi4py which you can install as explained here. Make sure to match the compatible mpi4py version to the MPI version you have. I highly recommend staying away from prebuilt version and building from source.

Setting up metacoag

I downloaded the source code of MetaCoAG from GitHub and installed it using pip locally.

git clone https://github.com/Vini2/MetaCoAG.git
cd MetaCoAG/
pip install -e .

The basic idea

from mpi4py import MPI

def run_metacoag(folder_name, rank):
  # bin dataset
  # get nbins
  return nbins

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()

    if rank == 0:
        # List of input folders
        input_folders = ["/path/to/test_dataset_1", 
                         "/path/to/test_dataset_2"]

        # Scatter the folders
        for i in range(1, size):
            comm.send(input_folders[i], dest=i)

    # Receive the folder
    if rank == 0:
        received_data = input_folders[0]
    else:
        received_data = comm.recv(source=0)

    # Process the received folder
    results = run_metacoag(received_data, rank)

    # Gather the number of bins from all processes
    all_results = comm.gather(results, root=0)


    if rank == 0:

        # Process the gathered results
        final_result = sum(all_results)
        print("Total number of bins produced:", final_result)

Basically, we have a set of folder paths in input_folders and we scatter them to nodes (done by comm.send). Then the nodes receive the data (done by comm.recv) and process it. Finally, the results are gathered (done by comm.gather and processed (here the total is calculated).

How to run?

Change input_folders to your list of folders, load the relevant modules (e.g., MPI and mpi4py) and launch the SLURM script using,

sbatch mpi_metacoag.slurm

Disclaimer

This is a crazy idea I came up with and might be totally useless. But I hope it might be useful for someone who wants to distribute files and run a command on multiple nodes using MPI.

Future work

I want to actually use MPI to distribute large graphs and do clustering on chunks. Fingers crossed! This was a great learning experience.

distribin's People

Contributors

vini2 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.