distribin: Distributed metagenomic binning

As part of the PaCER Hackathon 2023 I set out to learn and refresh my understanding of distributed computing and see if I can apply those concepts to distribute and bin multiple metagenomes. I also wanted to learn about the message passing interface (MPI). Here is the basic idea of my project.

I used two approaches.

Array jobs
MPI

Array jobs

Using the SLURM --array and passing a list of files to use for each job, I was able to distribute the binning jobs to different nodes. Check out the metacoag.slurm file for further details.

MPI

This was the challenging part. Rather than rewriting my code to distribute chunks of one dataset, I decided to distribute one dataset per node using MPI, do the binning, gather the results and do some statistics.

Setting up MPI

Make sure you have a working MPI implementation like MPICH or OpenMPI.

Setting up `mpi4py`

As my code was in python, I used the MPI python interface mpi4py which you can install as explained here. Make sure to match the compatible mpi4py version to the MPI version you have. I highly recommend staying away from prebuilt version and building from source.

Setting up `metacoag`

I downloaded the source code of MetaCoAG from GitHub and installed it using pip locally.

git clone https://github.com/Vini2/MetaCoAG.git
cd MetaCoAG/
pip install -e .

The basic idea

from mpi4py import MPI

def run_metacoag(folder_name, rank):
  # bin dataset
  # get nbins
  return nbins

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()

    if rank == 0:
        # List of input folders
        input_folders = ["/path/to/test_dataset_1", 
                         "/path/to/test_dataset_2"]

        # Scatter the folders
        for i in range(1, size):
            comm.send(input_folders[i], dest=i)

    # Receive the folder
    if rank == 0:
        received_data = input_folders[0]
    else:
        received_data = comm.recv(source=0)

    # Process the received folder
    results = run_metacoag(received_data, rank)

    # Gather the number of bins from all processes
    all_results = comm.gather(results, root=0)


    if rank == 0:

        # Process the gathered results
        final_result = sum(all_results)
        print("Total number of bins produced:", final_result)

Basically, we have a set of folder paths in input_folders and we scatter them to nodes (done by comm.send). Then the nodes receive the data (done by comm.recv) and process it. Finally, the results are gathered (done by comm.gather and processed (here the total is calculated).

How to run?

Change input_folders to your list of folders, load the relevant modules (e.g., MPI and mpi4py) and launch the SLURM script using,

sbatch mpi_metacoag.slurm

Disclaimer

This is a crazy idea I came up with and might be totally useless. But I hope it might be useful for someone who wants to distribute files and run a command on multiple nodes using MPI.

Future work

I want to actually use MPI to distribute large graphs and do clustering on chunks. Fingers crossed! This was a great learning experience.

vini2 / distribin Goto Github PK

distribin's Introduction

distribin: Distributed metagenomic binning

Array jobs

MPI

Setting up MPI

Setting up `mpi4py`

Setting up `metacoag`

The basic idea

How to run?

Disclaimer

Future work

distribin's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

vini2 / distribin Goto Github PK

distribin's Introduction

distribin: Distributed metagenomic binning

Array jobs

MPI

Setting up MPI

Setting up mpi4py

Setting up metacoag

The basic idea

How to run?

Disclaimer

Future work

distribin's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Setting up `mpi4py`

Setting up `metacoag`