Git Product home page Git Product logo

recovery's Introduction

RecoverY

RecoverY is a tool for shortlisting enriched reads from a sequencing dataset, based on k-mer abundance. Specifically, it can be used for isolating Y-specific reads from a Y flow-sorted dataset.

Usage

python recoverY.py

Important parameters for the user to choose are :

kmer-size :

  • the size of k used while iterating through every read
  • this must be the same as DSK's kmer-size
  • usually optimal in the range [25, 31] for Illumina 150x150 bp reads

strictness :

  • the # of successful matches to the Ymer table required per read, before classifying a read as Y-specific
  • usually optimal in the range [20, 50] for Illumina 150x150 bp reads

num_processors :

  • the # of processors available to RecoverY
  • currently set to 8

Installation

git clone https://github.com/makovalab-psu/RecoverY
cd RecoverY

Dependencies

The latest DSK binary (v2.2.0 for Linux) is provided in the dependency folder. See https://gatb.inria.fr/software/dsk/ for alternate versions. You do not need to install DSK separately.

Numpy and Biopython can be installed as follows :

pip install numpy
pip install biopython

Input

The following input files are required in ./data folder. Note that currently RecoverY expects the folder to be named "data".

r1.fastq : Enriched raw reads (first in pair) 
r2.fastq : Enriched raw reads (second in pair) 
kmers_from_reads : kmer counts from DSK for r1.fastq
trusted_kmers : kmer counts from DSK for human Y single copy genes

The input folder and file names can be changed by the user within the program.

Output

The ./output folder contains :

op_r1.fastq
op_r2.fastq

These are the Y-reads files produced by RecoverY.

Example

The data folder contains an example reads dataset and kmer tables. It can be used to test if RecoverY runs to completion.

Before running recoverY.py, please navigate to the data folder and un-compress the tar.xz file :

cd data/
tar xf kmers_from_reads.tar.xz

Subsequently, RecoverY can be run as :

cd ../
python recoverY.py

Results :

The data/r1.fastq and data/r2.fastq were generated from hg38 using wg-sim. Thus, each FASTQ record header has the chromosome of origin for a given read.

Using grep and wc commands, one can check if RecoverY has correctly retrieved most of the Y-reads.

grep "@chrY" data/r1.fastq | wc -l
grep "@chrY" output/op_r1.fastq | wc -l

Generating k-mer counts with DSK

The ./dependency folder contains a DSK binary and a script that help generate k-mer counts required for RecoverY. Usage is as follows :

cd dependency
./run_dsk.sh <FASTQ_FILE>

In this case, FASTQ_FILE is r1.fastq. The kmer_counts table will be generated in

./dependency/dsk_output/kmer_counts_from_dsk

Generating k-mer plots

Matplotlib and Seaborn are required to generate k-mer plots.

       pip install matplotlib
       pip install seaborn

After installation, please un-comment the following line from recoverY.py :

print "Generating kmer plot"
plot_kmers.plot_kmers()

Scripts

The following scripts are included with this distribution of RecoverY, and are automatically run by recovery.py as part of the pipeline. Users may consider them separately for custom needs if required.

kmers.py

a set of general purpose functions to work with kmers

kmerPaint.py

input : trusted_kmers and reads_from_kmers 
output : Ymer_table with new abundance threshold

classify_as_Y_chr.py

input : all raw reads (first in pair) and Ymer table
output : Y-specific reads according to RecoverY algorithm (first in pair)

find_mates.py

input : all raw reads (second in pair) and Y-specific reads accoding to RecoverY algorithm (first in pair)
output : Y-specific reads according to RecoverY algorithm (second in pair)

License

This program is released under the MIT License. Please see LICENSE.md for details

Citation

Please cite this Github repository if you use this tool in your research. Thanks ! https://github.com/makovalab-psu/RecoverY

recovery's People

Contributors

md5sam avatar

Watchers

Paul Medvedev avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.