Creating a Database of Reference Genomes and Metagenome-Assembled Genomes (MAGs)

This repository is mostly a description of how to build a database of reference genomes from Refseq and metagenome-assembled genomes (MAGs) from multiple large-scale metagenomic projects from various environments. This database does not include human or host associated MAGs, and is mostly for exploring genomes/marker genes of environmental metagenomes. The included scripts cover downloading and reformatting sets of genomes, and subsequently calling genes or performing functional annotations for a specific subset of downloaded genomes for further analyses.

Refseq database built in July 2019. As of this date, downloading all complete Refseq genomes and MAGs from the below datasets amounts to approximately 30,000 genomes.

To include all genomes from NCBI regardless of completion status, download the genomes from the accession list accessions/2019-08-01-incomplete-genbank-genomes-accessions.txt. This includes all genomes that are of assembly level chromosome, contig, or scaffold deposited in Genbank as of 2019-08-01. The entire metadata file is too large to store on Github, and is stored in an OSF repository, with dated folders for updated database files.

Requirements

Metagenomic Datasets

Anantharaman et al. 2016 "Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system". Bioproject: PRJNA288027
Parks et al. 2017 "Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life". Bioproject: PRJNA348753
Woodcroft et al. 2018 "Genome-centric view of carbon processing in thawing permafrost". Bioproject: PRJNA386568
Crits-Cristoph et al. 2018 "Novel soil bacteria possess diverse genes for secondary metabolite biosynthesis". Bioproject:PRJNA449266
Tully et al. 2018 "The reconstruction of 2,361 draft metagenome-assembled genomes from the global oceans". Bioproject: PRJNA391943
Dombrowski et al. 2018 "Expansive microbial metabolic versatility and biodiversity in dynamics Guaymas Basin hydrothermal sediments". Biproject: PRJNA362212

Data

ncbi-bioproject-files/ contains individual bioproject accession information for all datasets, from which genomes were downloaded through ncbi-genome-download and used to merge metadata
bioproject-accession-lists/ contains accession lists for each bioproject, and the combined list for bulk download
metadata/ more detaild metadata information on specific metagenomic projects and downloaded genomes from NCBI

Massively Parallel Search of Genbank Assemblies for Specific Markers

Previously, I would download the entire genbank database (~200,000 genomes) and then go one by one with for loops to reformat, annotate, and search for specific markers of interest. This was extremely tedious, takes up a lot of space on a server, and also takes a long time to go one by one for each of these steps. Using the resources available through HTCondor & UW-Madison Center for High-Throughput Computing, I've repurposed all of these steps so each job is split by a genome assembly, and performs the reformmating, annotating, and marker searches by job. This way the jobs can be highly parallel, and can flock out to other resources such as the open science grid. All that needs to change periodically would be updating the list of genbank assemblies/ftp paths if there are major updates to the database in the metadata/ folder, and whatever marker you want to search for, which is specified in the submit file.

This pipeline serves somewhat the same and different purposes as the above mentioned steps. For the above, you can search specific, large-scale metagenomic projets for a marker or just to create a nice environmental MAG database. This can search through all Genbank genomes in one-go, including from metagenomic projects.

To run the pipeline, these steps are highly specific to UW-Madison's HTCondor system, specifically for running on the Center for High Throughput Computing cluster. The steps are a bit convoluted for setup, but once they are done you can perform searches for any marker you choose without downloading all of Genbank locally, which at this point you might consider more worthwhile.

Clone this directory to get all the executables and scripts
Follow the directions to install an Anaconda python distribution with prodigal, HMMer, and biopython installed with conda.
Follow the prepare-chtc-wrapper.md instructions based on using the ChtcRun package. The metadata files have already been split up in the metadata/splits folder, you just have to follow the directions to configure the ChtcRun package correctly with the shared folder and corresponding queue directories.

elizabethmcd / genomes-mags-database Goto Github PK

genomes-mags-database's Introduction

Creating a Database of Reference Genomes and Metagenome-Assembled Genomes (MAGs)

Requirements

Metagenomic Datasets

Data

Massively Parallel Search of Genbank Assemblies for Specific Markers

genomes-mags-database's People

Contributors

Stargazers

Watchers

Forkers

genomes-mags-database's Issues

Create DAG to submit in clusters

FTP downloads

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent