Light

spacegraphcats / 2018-paper-spacegraphcats Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 4.12 MB

Paper text and pipeline for "Exploring neighborhoods in large metagenome assembly graphs..."

Home Page: https://biorxiv.org/content/early/2018/11/05/462788

Python 3.70% Makefile 0.03% TeX 11.48% Jupyter Notebook 48.89% Standard ML 33.81% R 2.09%

2018-paper-spacegraphcats's Introduction

spacegraphcats

Explore large, annoying graphs using hierarchies of dominating sets - because in space, no one can hear you miao!

This is a collaboration between the Theory In Practice lab at University of Utah, the Lab for Data Intensive Biology at UC Davis, and Dr. Felix Reidl at Birkbeck University of London. Initial development of spacegraphcats was generously supported by the Moore Foundation's Data Driven Discovery Initiative.

Documentation

This README file contains quickstart information. For use cases and other information, please see the spacegraphcats documentation at https://spacegraphcats.github.io/spacegraphcats.

Installation and execution quickstart

See installation instructions and the run guide.

For help or support with this software, please file an issue on GitHub. Thank you!

Quickstart

There are two quickstart examples available! Please see dory-example and twofoo-example. The latter example includes a snakemake Snakefile.

Notable dependencies

spacegraphcats uses code from BBHash, a C++ library for building minimal perfect hash functions (Guillaume Rizk, Antoine Limasset, Rayan Chikhi; see Limasset et al., 2017, arXiv, as wrapped by pybbhash.

spacegraphcats also uses functionality from khmer and sourmash.

Citation information

See the Genome Biology publication Exploring neighborhoods in large metagenome assembly graphs using spacegraphcats reveals hidden sequence diversity, Brown et al., 2020, doi: https://doi.org/10.1186/s13059-020-02066-4.

Pointers to interesting code

Interesting algorithms

The rdomset code for efficently calculating a dominating set of a graph at a given radius R is in spacegraphcats/catlas/rdomset.py.

The graph denoising code for removing low-abundance pendants from BCALM cDBGs is in function contract_degree_two in cdbg/bcalm_to_gxt.py.

Part of the indexPieces code for indexing cDBG nodes by dominating nodes is cdbg/index_cdbg_by_kmer.py. The remainder is implemented in search, below.

The search code for extracting query neighborhoods is in search/query_by_sequence.py; see especially the call to kmer_idx.count_cdbg_matches(...).

Interesting library functionality

Code for indexing large FASTQ/FASTA read files by cDBG unitig, and extracting the reads corresponding to individual unitigs from BGZF files, is available in cdbg/label_cdbg.py and search/search_utils.py, get_reads_by_cdbg, respectively.

2018-paper-spacegraphcats's People

Contributors

Stargazers

Watchers

Forkers

2018-paper-spacegraphcats's Issues

format headers of plass output to contain nbhd name and to be unique

The following code is how I format the headers in the plass output to be unique.

The last step, which concatenates all of the contigs into one file, is optional :)

# download data
curl -L -o hu-s1-plass-hardtrim-jan08.2019.tar.gz https://osf.io/uvb27/download
tar xvf hu-s1-plass-hardtrim-jan08.2019.tar.gz
cd hu-s1_k31_r1_search_oh0/

# format amino acid headers
# prepend nbhd name, cut read names, deduplicate names to amino acid contig sequences

# prepend file name to header
for infile in *fa
do
  awk '/>/{sub(">","&"FILENAME"_");sub(/\.fa.cdbg_ids.reads.hardtrim.fa.gz.plass.cdhit.fa/,x)}1' $infile > ${infile}.clean
done

# cut headers after first space
for infile in *clean
do
  cut -d ' ' -f1 $infile > ${infile}.cut
done

# deduplicate amino acid names
for infile in *cut
do
  awk '(/^>/ && s[$0]++){$0=$0"_"s[$0]}1;' $infile > ${infile}.dup 
done

# create one nbhd sequence 

cat hu-s1_k31_r1_search_oh0/*.dup > hu-s1_k31_r1_search_oh0_all.fa

Was SB1 adapter trimmed?

In README.md, I see instructions to k-mer trim the SB1 sample. Was this sample also adapter trimmed prior to k-mer trimming? If so, what commands were used to do this?
@ctb

update checkm rule to output tab separated table

The files in paper/figures/files_checkm/ are annoyingly difficult to parse. As per Ecogenomics/CheckM#29, there is an option --tab_table that will produce a tab separated file without the fancy formatting for easy parsing. I think this can be added in these lines of code:

https://github.com/spacegraphcats/2018-paper-spacegraphcats/blob/master/pipeline-base/Snakefile#L572
https://github.com/spacegraphcats/2018-paper-spacegraphcats/blob/master/pipeline-base/Snakefile#L584
https://github.com/spacegraphcats/2018-paper-spacegraphcats/blob/master/pipeline-base/Snakefile#L596
https://github.com/spacegraphcats/2018-paper-spacegraphcats/blob/master/pipeline-base/Snakefile#L607

Which will then look like this:

rm -fr checkm.plass.bins && mkdir checkm.plass.bins && ln {input} checkm.plass.bins && checkm lineage_wf -x fa checkm.plass.bins checkm.plass.out -t {threads} --genes --pplacer_threads={threads} --tab_table -f checkm-plass.txt

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.