Git Product home page Git Product logo

autoadapt's Introduction

autoadapt

Overview

As of November 2013, the NCBI Sequence Read Archive contains over three million gigabytes of publicly available DNA and RNA sequencing files.

However, there is a wide variety of sequencing adaptors and primers which may be contaminating each file, and these sequences normally need to be removed before doing any further analysis.

We developed a tool to automatically detect which adaptors and primers are present in a FASTQ file and remove those sequences from the file, as well to detect the quality score encoding type used.

We currently make heavy use of FastQC and cutadapt, both of which are included in the tools folder.

License

# autoadapt - Automatic quality control for FASTQ sequencing files
# Copyright (C) 2013  Rupert Shuttleworth
# [email protected]

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Install

autoadapt needs special versions of FastQC and cutadapt to be installed. The install happens locally (inside the autoadapt/tools folder). Type:

make install

Usage

autoadapt 0.2

Usage: ./autoadapt.pl [ <options> ] { <unpaired-in> <unpaired-out> | <paired-in-1> <paired-out-1> <paired-in-2> <paired-out-2> }

Options:
    --threads=N               number of threads to use (default: 1)
    --quality-cutoff=N        quality cutoff for BWA trimming algorithm (default: 20)
    --minimum-length=N        minimum length of sequences (default: 18)

Technical details

First we run FastQC to determine the quality score encoding type (e.g. phred33, phred64) and to look for any over-represented sequences that match against known adaptors and primers in the FastQC contaminants_list.txt file.

Then, the sequences for any detected contaminants (primers, adaptors, etc.) are removed using cutadapt. In addition, cutadapt will also remove low quality sequences and sequences that are shorter than a minimum length.

In order to speed up the trimming process, cutadapt can also be run in parallel on small chunks of the original FASTQ file. The file splitting and merging is handled by our script. When specifying the number of threads to use, you should consider how many CPUs are available and how fast your hard drive can read and write data.

The exact details of how we run FastQC and cutadapt are printed to the console during execution. For further explanation of what each FastQC or cutadapt program argument means, please see their respective documentation:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://code.google.com/p/cutadapt/

autoadapt's People

Contributors

optimuscoprime avatar spock avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.