Git Product home page Git Product logo

mcoutbryk's Introduction

mcOutbryk

SNP calling pipeline using mccortex

Inspired by/follow up of https://github.com/iqbal-lab/outbryk.git

requirements:

vcftools

vcflib

bwa

python3

mccortex

Installation:

Clone this repository:

git clone https://github.com/hcdenbakker/mcOutbryk

Add the mcOutbryk folder to your path

Clone mccortex:

git clone --recursive https://github.com/mcveanlab/mccortex.git

Compile the mccortex63 binary:

cd mccortex

make MAXK=63 all
make MAXK=31 all

Add the mccortex63 binary and the mccortex/scripts folder to your path

Execution

Execute the pipeline in the folder with illumina fastq.gz files of the isolates you are interested in. Create a 1 column list with the sample names and decide on an appropriate reference sequence.

For help run: MCOutbryk. py -h

MCOutbryk - de novo SNP and ambiguous site Caller

optional arguments:
  -h, --help            show this help message and exit
  --reference REFERENCE [REFERENCE ...]
                        Reference genome in fasta format, include path to file
                        if file is not in same directory
  --sample_list SAMPLE_LIST [SAMPLE_LIST ...]
                        Text file, one isolate/sample per line
  --results_dir RESULTS_DIR [RESULTS_DIR ...]
                        Directory which will contain results after the
                        analysis
  --processes PROCESSES [PROCESSES ...]
                        number of processes to run in parallel
  --delete_highly_divergent {yes,no}
                        delete highly divergent isolates automatically
                        (default: yes)
  --div_cutoff [DIV_CUTOFF]
                        number of SNPs used as cutoff to define highly
                        divergent isolates (default: 5000)
  --K [K]               K-mer size (default: 33)
  --SNP_vcf SNP_VCF [SNP_VCF ...]
                        VCF with sites to be called; if this is given, the
                        steps to create a SNP sites vcf de novo from the
                        isolates in the sample list will be skipped.

To run mcOutbryk:

MCOutbryk.py --reference reference.fasta --sample_list your_sample_list.txt --results_dir directory_for_results --processes parallel_processes

Warning: while this pipeline is generally heavy on the RAM usage, buidling the uncleaned graphs takes about 10 Gb of memory per process, it is recommended to perform 1 process per 10 Gb of memory (I run 10 parallel processes on a 128 Gb machine)

The pipeline does the following:

  1. Checks if raw data for samples in the list are available in the execution directory

  2. Create raw and cleaned graphs

  3. Bubble call each individual isolate vs reference (and delete cleaned graphs and some intermediary files)

  4. Create vcfs with trusted SNPs and create a 'pseudo-vcf' of all trusted SNPs

  5. Use 'pseudo-vcf' to genotype (= call SNPs) of all isolates using their raw graph

  6. Create SNP sites multi-fasta file (stored in resultsdir/MC_consensus.fasta; ambiguous calls: N, gaps: -, impossible to call: ?)); this file can be used as direct input for FastTree

mcoutbryk's People

Contributors

hcdenbakker avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.