Git Product home page Git Product logo

simwreck's Introduction

README SimWreck V1.0

Mark Timothy Rabanus-Wallace
November 2015
Australian Centre for Ancient DNA (ACAD)


______________________________
An aDNA read simulator for testing analysis pipelines.
Give it a genome and describe to it what condition the reads will be in.
Face the STDOUT and prepare to catch your sequences.

SimWreck requires Bioperl (http://www.bioperl.org/wiki/Installing_BioPerl)
______________________________


SYNOPSIS:

SimWreck -i genome.fasta -n50 > wreckedreads.fasta #fifty damaged short reads with default parameters
SimWreck -i genome.fasta -X -d.5 -D.15 #apply deamination damage to each sequence
nohup parallel 'SimWreck -i {} -n1000000 -d.3 -D.4 -s4 -S12 -m20 -M180 -p2 -b.02 > {\.}wrecked.fasta' ::: \*genome.fasta & #one million reads each from all \*genome.fasta, produced in parallel, with customised parameters
SimWreck -P -s4 -S12 -m20 -M180 -a120 -u25 #look at the read length distribution
        
ARGUMENTS:  [interval] (default)    Description
    
General
    
-i				Genome file to make reads from, in fasta format.
-n      [1,+inf) (10000) 	Number of reads desired.
-r      [0,1] (.5)      	Proportion of minus strand reads.    
-X                  		Damage-only mode. Returns each input sequence whole, with damage added as per any damage parameters specified.

Size distribution
    
-s      (0,+inf) (4)        	Shape parameter (alpha).
-S      (0,+inf) (4)        	Scale parameter (beta).
-m      [0,+inf) (80)       	Nucleotide length at left end of beta distribution.
-M      (0,+inf) (280)      	Nucleotide length at right end of beta distribution.
    
Damage
    
-p      [0,+inf) (3)       	Depurination weight parameter. Ratio of nicks that occur 3' of a purine:3' of a pyrimidine. (e.g. at 2, it   is twice as likely that a strand will begin or end with a purine).
-d      [0,+inf) (.3)      	Deamination weight parameter, influencing how frequently deamination results in pseudomutations.
-D      [0,+inf) (.5)      	Deamination decay parameter, influencing how the rate of deamination decays further from the ends of the     sequence.
-b      [0,-d] (.01)       	Baseline rate of deamination.
    
Plot Funtion
    
-P                  		Plot mode. Plots the shape of the beta distribution requested by parameters -s, -S, -, and -M.
-a      [0,+inf) (120)     	Plot width in characters.
-u      [0,+inf) (45)      	Plot height in characters.

TUTORIAL:

Welcome to SimWreck. If you suspect your analysis is being biased by the effects of DNA damage, this program will produce data that can help explore this suspicion.

SimWreck can a produce NGS reads that have a smiler damage profile to your library. You just have to tell it what that damage profile is like.
First up, you can adjust the length distribution of reads using -m,-M,-s, and -S. To explore what shapes are possible, try visualising a few curves …
SimWreck -P -a80 -u30 -s4 -S4
SimWreck -P -a80 -u30 -s4 -S8
SimWreck -P -a80 -u30 -s8 -S4
SimWreck -P -a80 -u30 -s2 -S2
SimWreck -P -a80 -u30 -s1 -S1
SimWreck -P -m20 -u25 -s1 -S4

You can change the size of the plot with -a (pixels across) and -u (up).


… when satisfied with the curve, set the length range to your satisfaction by adjusting -m and -M.


Once done, have a look at the MapDamage profile of a library you wish to emulate.
To set the deamination parameters, look at the deamination frequency curves.
The intercept with the y-axis can be entered as -d.
If the curve “bottoms out” at a particular value, enter it with -b.
The rate at which the curve decays from -d to -b is influenced by -D (higher=decaying faster). Default is usually pretty good, but values from .1 to 2 cover the range seen in most empirical damage profiles.
To set the depurination, compare the frequency of C/T to that of A/G at they first position before/after the reads. Work out the purine:pyrimidine ratio (i.e. if it’s 70% A/G and 30% C/T, the ratio is .7/.3 ~= 2.3) and enter it as -p.

Now set the number of reads (-n) and specify the input genome (-i). Redirect stdout to a file and run the program.

SimWreck produces reads that overlap the ends of the reference as well as those that fall entirely within it. When a read overlaps an end, the unknown nucleotides are assigned “N” - so if you’ve requested high coverage (or given a very short reference sequence), a portion of the reads will be N-heavy.

Perhaps you’d like to see whether your analysis results would change if your existent reads actually had more deamination damage than they already do. In this case, use the -X flag. You may set the deamination parameters -d, -D, and -p, and give the program your reads with -i. Deamination will be applied according to the described profile.

SimWreck will recognise .fasta, .fa, .fastq, .fastq.gz, .fq.gz extensions, and will return fasta/q format depending on the input. The quality scores of returned reads are identical to those given in the reference.

Try, using the test file testseq.fasta:
SimWreck -i testseq.fasta
SimWreck -i testseq.fasta -n1000 -d.3 -D.4 -s4 -S12 -m20 -M180 -p2 -b.02
SimWreck -i testseq.fasta -n1000 -d.3 -D.4 -b.02 -X


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.