Git Product home page Git Product logo

genomes-mags-database's Introduction

Creating a Database of Reference Genomes and Metagenome-Assembled Genomes (MAGs)

This repository is mostly a description of how to build a database of reference genomes from Refseq and metagenome-assembled genomes (MAGs) from multiple large-scale metagenomic projects from various environments. This database does not include human or host associated MAGs, and is mostly for exploring genomes/marker genes of environmental metagenomes. The included scripts cover downloading and reformatting sets of genomes, and subsequently calling genes or performing functional annotations for a specific subset of downloaded genomes for further analyses.

Refseq database built in July 2019. As of this date, downloading all complete Refseq genomes and MAGs from the below datasets amounts to approximately 30,000 genomes.

To include all genomes from NCBI regardless of completion status, download the genomes from the accession list accessions/2019-08-01-incomplete-genbank-genomes-accessions.txt. This includes all genomes that are of assembly level chromosome, contig, or scaffold deposited in Genbank as of 2019-08-01. The entire metadata file is too large to store on Github, and is stored in an OSF repository, with dated folders for updated database files.

Requirements

Metagenomic Datasets

Data

  • ncbi-bioproject-files/ contains individual bioproject accession information for all datasets, from which genomes were downloaded through ncbi-genome-download and used to merge metadata
  • bioproject-accession-lists/ contains accession lists for each bioproject, and the combined list for bulk download
  • metadata/ more detaild metadata information on specific metagenomic projects and downloaded genomes from NCBI

Massively Parallel Search of Genbank Assemblies for Specific Markers

Previously, I would download the entire genbank database (~200,000 genomes) and then go one by one with for loops to reformat, annotate, and search for specific markers of interest. This was extremely tedious, takes up a lot of space on a server, and also takes a long time to go one by one for each of these steps. Using the resources available through HTCondor & UW-Madison Center for High-Throughput Computing, I've repurposed all of these steps so each job is split by a genome assembly, and performs the reformmating, annotating, and marker searches by job. This way the jobs can be highly parallel, and can flock out to other resources such as the open science grid. All that needs to change periodically would be updating the list of genbank assemblies/ftp paths if there are major updates to the database in the metadata/ folder, and whatever marker you want to search for, which is specified in the submit file.

This pipeline serves somewhat the same and different purposes as the above mentioned steps. For the above, you can search specific, large-scale metagenomic projets for a marker or just to create a nice environmental MAG database. This can search through all Genbank genomes in one-go, including from metagenomic projects.

To run the pipeline, these steps are highly specific to UW-Madison's HTCondor system, specifically for running on the Center for High Throughput Computing cluster. The steps are a bit convoluted for setup, but once they are done you can perform searches for any marker you choose without downloading all of Genbank locally, which at this point you might consider more worthwhile.

  1. Clone this directory to get all the executables and scripts
  2. Follow the directions to install an Anaconda python distribution with prodigal, HMMer, and biopython installed with conda.
  3. Follow the prepare-chtc-wrapper.md instructions based on using the ChtcRun package. The metadata files have already been split up in the metadata/splits folder, you just have to follow the directions to configure the ChtcRun package correctly with the shared folder and corresponding queue directories.

genomes-mags-database's People

Contributors

elizabethmcd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

agronomist raufs

genomes-mags-database's Issues

Create DAG to submit in clusters

Total number of jobs exceeds max number of jobs that can be submitted in a session
So probably have to split up with a DAG, or manually split submissions by lines in the metadata file
Probably best for reproducibility to do the first option, but somewhat more painful

FTP downloads

NCBI wget downloads seem to poop out after around downloading 100 genomes, and right now have per job 500 genomes, so could split up each metadata file so instead has 50 genome split in each, which will create 4000 jobs, but technically the DAG system submits 1 large job and then goes downstream, so might be fine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.