mikkelschubert / adapterremoval Goto Github PK

AdapterRemoval v2 - rapid adapter trimming, identification, and read merging

Home Page: http://adapterremoval.readthedocs.io/

License: GNU General Public License v3.0

Makefile 0.56% C++ 95.58% Python 2.65% HTML 1.09% Dockerfile 0.12%

adapterremoval's Introduction

AdapterRemoval

AdapterRemoval searches for and removes adapter sequences from High-Throughput Sequencing (HTS) data and (optionally) trims low quality bases from the 3' end of reads following adapter removal. AdapterRemoval can analyze both single end and paired end data, and can be used to merge overlapping paired-ended reads into (longer) consensus sequences. Additionally, AdapterRemoval can construct a consensus adapter sequence for paired-ended reads, if which this information is not available.

For questions, bug reports, and/or suggestions, please use the GitHub tracker.

If you use AdapterRemoval v2, then please cite the paper:

Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid adapter
trimming, identification, and read merging. BMC Research Notes, 12;9(1):88
http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2

AdapterRemoval was originally published in Lindgreen 2012:

Lindgreen (2012): AdapterRemoval: Easy Cleaning of Next Generation
Sequencing Reads, BMC Research Notes, 5:337
http://www.biomedcentral.com/1756-0500/5/337/

Overview of major features

Trimming of adapters sequences from single-end and paired-end FASTQ reads.
Trimming of multiple, different adapters or adapter pairs.
Demultiplexing of single or double indexed reads, with or without trimming of adapter sequences.
Reconstruction of adapter sequences from paired-end reads, by the pairwise alignment of reads in the absence of a known adapter sequence.
Merging of overlapping read-pairs into higher-quality consensus sequences.
Multi-threading of all operations for increased throughput.
Reading and writing of gzip and bzip2 compressed files.
Reading and writing of interleaved FASTQ files.

Documentation

For a detailed description of program installation and usage, please refer to the online documentation. A summary of command-line options may also be found in the manual page, accessible via the command "man AdapterRemoval" once AdapterRemoval has been installed.

Installation

Installation with Conda

If you have Conda_ installed on your system:

conda install -c bioconda adapterremoval

Installing from sources

Installing AdapterRemoval from sources requires libz and libbz2.

To compile AdapterRemoval, download the latest release, unpack the archive and then simply run "make" in the resulting folder:

wget -O adapterremoval-2.3.1.tar.gz https://github.com/MikkelSchubert/adapterremoval/archive/v2.3.1.tar.gz
tar xvzf adapterremoval-2.3.1.tar.gz
cd adapterremoval-2.3.1
make

The resulting 'AdapterRemoval' executable is located in the 'build' subdirectory and may be installed by running "make install":

sudo make install

Getting started

To run AdapterRemoval, specify the location of pair 1 and (optionally) pair 2 FASTQ using the --file1 and --file2 command-line options:

AdapterRemoval --file1 myreads_1.fastq.gz --file2 myreads_2.fastq.gz

By default, AdapterRemoval will save the trimmed reads in the current working directly, using filenames starting with 'your_output'.

More examples of common usage may be found in the Examples section of the online documentation:

adapterremoval's People

Contributors

Stargazers

Watchers

adapterremoval's Issues

Possible to uninstall/reinstall?

Hello,

I am running Adapter Removal 2.2.2 through Qiime 1.9 on a Virtual Box setup (running most recent version VB 6.0)

Recently, when running AdapterRemoval at the beginning of my workflow, I get blank output_paired.collapsed.truncated files (sometimes blank output_paired.collapsed files as well). Previously, I was getting populated files. My raw data files have information in them, so I'm struggling to figure out where my issue is that would lead to blank files.

I'm wondering if there is a way to uninstall AdapterRemoval and reinstall to be sure I have all the necessary components?

Thank you for your help!

Sarah

Can you remove large number of probe sequences from just one end?

I am working with a targeted resequencing approach that uses probes to capture targets. These probe sequences range from ~20-40bp and are located after the RD1 sequencing primer from Illumina. The list can be very large (e.g. 15K), bit can be provided in a file. The adaptor on the other end is universal. Can adaptremoval go though a list of ;aired adaptors where one is each probe and the second is constant in reasonable time?

Thanks!

Conda installer

I was facing some issues with installation of AdapterRemoval on different machines, so I created a conda package.
Available on anaconda cloud: https://anaconda.org/maxibor/adapterremoval2
Building source here: https://github.com/maxibor/conda_recipes
I can do a PR if needed

Support for interleaved PE reads?

Is there any chance of interleaved PE reads being supported?

Internal barcodes

Hi @MikkelSchubert !

hope you're doing good - we're currently discussing a bit on how / whether AR2 is able to remove internal barcodes - is this currently supported ( I think not ?) and would it be something that could be added in a new release at some point?

x-ref / issue where we started discussing a little: nf-core/eager#632

Several spelling errors

Hi,
when packaging the recent version for Debian our policy checker spotted some spelling errors. Find a patch here to fix these typos.
Hope this helps, Andreas.

remove adapters sequences from BGI/MGI platform

Hi Mikkel,

Dose the Adapterremoval can identify and remove adapters in single/pair-end data from BGI/MGI platform (T7 et. al)? Or should I feed adapters information to it?

MGI sequencing adapter left = “AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA”
MGI sequencing adapter right = “AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG”

Or could these 2 (or maybe all listed) adapter sequences be added to the set of built-in?

Thanks
Bin

collapse with SE reads

Not sure I understand what the collapse option does with SE reads.

In the manpage says that collapse, in SE mode, attempts [...] to identify templates for which the entire sequence is available.
and on the collapsed output, it says :
[...] reads for which the adapter was identified by a minimum overlap, indicating that the entire template molecule is present.

But what do templates refer to? Is it reads that contain adapters?
If so, has the adapter been trimmed? And why are these reads not included in the normal output?
If not, what are these SE reads that are "collapsed"?

stdout and stderr reversed?

I have piped the output of AdapterRemoval to two separate log files for stdout and stderr using the > and 2> operator and noticed that the stdout output is found in stderr.

So running for example AdapterRemoval --file1 myreads_1.fastq.gz --file2 myreads_2.fastq.gz 1> stdout.log 2> stderr.log will result in an empty stdout.log and the stderr.log file containing the report on which files it reads and how long it took (which I believe should be in stdout)

combine with other QC tools

Hi Mikkel,

Do I need to run other QC tools (Trimmomatic, fastp, et. al) before or after the Adapterremoval if I also want to trim poor quality reads and bases at the both end.

Best
Bin

Test suite fails in i386 architecture with strange replacement of "(" by "'"

Hi,
there is a Debian bug report about test failures which strangely fail only on i386 and the error itself is pretty strange:

tests/unit/alignment_test.cpp:841: FAILED:
  REQUIRE( collapsed_result == collapsed_expected )
with expansion:
  '@Rec1\nGCATGATATATACAAC\n+\n012345'FBcEFGHIJ\n'
  ==
  '@Rec1\nGCATGATATATACAAC\n+\n012345(FBcEFGHIJ\n'

(just see the separating sign after 012345 - the diff is not easily to spot). This occures out of lots of Debian architectures where the package builds nicely only for i386 (I can not exclude that the specific autobuilder might have different language settings or so). The full build log is available as well.
Kind regards, Andreas.

create a bottled version in homebrew/science

that would great to to be able to use brew install adapterremoval and get updates

Enable multiple input file for forward/reverse

It would be great if support for enabling multiple forward/reverse files could be added, e.g. instead of
--file1 fileX_R1 --file2 fileX_R2
One could have
--file1 fileX_R1 fileY_R1 fileZ_R1 --file2 fileX_R2 fileY_R2 fileZ_R2
that would be great in various types of analysis procedures and shouldn't require much work to implement I guess. I already e-mailed you about that, just wanted to have something to follow up on / make this trackable for you as well.

Thanks!

AdapterRemoval v3 Feedback

Hi @MikkelSchubert I decided to make a dedicated issue for general feedback of AdapterRemoval v3 testing, as I may find other points to discuss:

Version v3.0.0pre 344591c

~~Leaving in single reads with Ns~~
```
    --combined-output  
        If set, all reads are written to the same file(s), specified by
        --output1 and --output2 (--output1 only if --interleaved-output is
        not set). Discarded reads are replaced with a single 'N' with Phred
        score 0 [default: off].
```
While I used to do this, @ashildv recently was informed by the ENA that include 'discarded reads' with a single 'N' will not
be accepted by their pipeline (it breaks, and the data gets rejected). Maybe it would be worth having e.g. 5 Ns or
something (or remove them entirely)? <- I realise could just do the custom output instead and make sure discarded goes in a separate file
--singleton flag: would it make sense for consistency to have --outputsingleton as the other output flags (1,2,merged) start with --output?
--settings FILE: could maybe be renamed, as the bulk of the contents of the JSON is stats rather than the settings itself
json output:
- it would be nice for this to also include the physical number of entries that are in the resulting output files when also merged (as a separate value), sort of equivalent to retained reads in v2.3.2. Currently the JSON only reports the number of output (passed) reads as it would be if everything was unmerged. So something like in addition to the passed, discarded and unidentified sections of the output JSON, having something like in_files or output_file would be nice to have as it helps match the expectation of a (unfamiliar) end-user between the file itself and the JSON report. However I recognise that this could be complicated given the very flexible output system now.
- It would nice to have some documentation for what each value means. I've tried playing around but I still can't work out how the various reads entry in the JSON relate to each other as what is in the final output FASTQ files

initial tests completed most of the above are more quality-of-life issues, otherwise everything is working as expected 👍

AdapterRemoval doesn't work with paired end reads >1GB

Hi. I'm working on aDNA and I have launched AdapterRemoval on raw reads (paired end reads).

This is the tipe of pipeline I used:
nohup directory/adapterremoval-2.2.4/build/AdapterRemoval --threads 8 --identify-adapters --file1 namefile1_r1.txt.gz --file2 namefile2_r2.txt.gz &

I does work with file <1GB, but I noticed that it stops running with files >1GB without error messages. Does anyone know why? Is it really a problem related to file dimension? It seems weird to me.

Thanks

Adapter Removal running error

I have 5 fastq files. For four files, the program runs perfectly. But for the rest one, it presents the error message as shown below:

Trimming single ended reads ...
Opening FASTQ file '/hpcfs/users/a1763076/project_ovary/scripts/cromwell-executions/RNApipeline/ac610329-d5a5-48e1-998f-489ce102fc96/call-trimming/inputs/1555881210/SRR8925551.fastq.gz'
Processed 31,028,568 reads in 13:29.5s; 38,000 reads per second ...
Error reading FASTQ record at line 32063091; aborting:
partial FASTQ record; cut off after sequence
Aborting thread due to error.
ERROR: AdapterRemoval did not run to completion;
do NOT make use of resulting trimmed reads!

I think the program runs well at first but it met some problem at middle.

I am wondering what kind of problems will cause this error.

Thanks and I am looking forward to hearing from you.

homebrew/science was deprecated

Hey, it seems that homebrew/science was deprecated and all its contents either deleted or migrated. brew search adapterremoval returns no results. Are there any plans to get AdapterRemoval back on Homebrew?

Support a single output file

Hi Mikkel,

Thanks for an awesome tool. One small issue we have at the moment is being able to output all reads that pass filtering to a single file (or even STDOUT).

How difficult would it be to implement an option --output-everything that accepts a single filename where reads shall be written? This file would be interleaved, and maintain correct pairing with filtered reads using the "single N" convention that many QC tools use. This entails replacing empty (due to collapsing, minimum length filtering or trimming) read sequences with a single 'N', with minimum quality score (example below).

@read1/1 PASSED
ACGTACTACTG
+
IIIIIIIIIII
@read1/2 PASSED
GATACTACTGT
+
IIIIIIIIIII
@read2/1 PASSED
CTGACGTACTA
+
IIIIIIIIIII
@read1/2 FAILED
N
+
#

Cheers,
Kevin

Cannot demultiplex large number of barcode combinations

Hi Mikkel,

Thanks for the nice project!
I have almost 5000 barcode combinations (system with fwd and rev barcodes). adapterremoval unfortunately is not able to handle such a high number of files to be open in parallel.
Could this be improved?

Greetings,
Martin

Input file is overwritten and cut off

Hi @MikkelSchubert !

I am looking for a sensible way to separate the adapter clipping functionality of AR from the collapsing functionality, and have run into an odd behaviour.

I am using some public data from the ENA, downloadable here: https://www.ebi.ac.uk/ena/browser/view/PRJEB30331
The md5sums match those of the ENA. I am using version 2.3.2 off bioconda.

I started out by removing the adapters from the fastqs without any filtering or trimming.

AdapterRemoval --file1 ../ERR3003613_1.fastq.gz --file2 ../ERR3003613_2.fastq.gz --basename CS01.pe  \
--adapter1 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' \
--adapter2 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA' --minadapteroverlap 1

The resulting files look fine.

$ wc -l CS01.pe.pair*
      37892680 CS01.pe.pair1.truncated
      37892680 CS01.pe.pair2.truncated

I then try to collapse, trim and filter the adapter clipped files:

$ AdapterRemoval --file1 CS01.pe.pair1.truncated --file2 CS01.pe.pair2.truncated --basename CS01.pe \
--qualitymax 41 --trimns --trimqualities --minlength 30 --minquality 20 --collapse

Trimming paired end reads ...
Opening FASTQ file 'CS01.pe.pair1.truncated', line numbers start at 1
Opening FASTQ file 'CS01.pe.pair2.truncated', line numbers start at 1
Error reading FASTQ record at line 24661; aborting:
    partial FASTQ record; cut off after sequence
Aborting thread due to error.
ERROR: AdapterRemoval did not run to completion;
       do NOT make use of resulting trimmed reads!

I then checked the input files again:

$ wc -l CS01.pe.pair*
       600 CS01.pe.pair1.truncated
       600 CS01.pe.pair2.truncated

After multiple tries, it seems that the line at which the error is thrown changes, but it is always 600 lines that remain in the input files.

Fatal error running v2

I ran the following:

$ AdapterRemoval --file1 SNJW007_S2_L002_R1_001.fastq --file2 SNJW007_S2_L002_R2_001.fastq --gzip --basename AR_demux --barcode-list CCS6_barcodes_forAR.txt

And got this error:

Read 15 barcodes / barcode pairs from 'CCS6_barcodes_forAR.txt'...
Trimming paired end reads ...

FATAL ERROR:
Debug assertion failed in 'src/demultiplex.cc', line 105: child != -1

This should not happen! Please file a bug-report at
https://github.com/MikkelSchubert/adapterremoval/issues/new
Abort trap: 6

I had some trouble adding C++11, could that be the issue?

--identify-adapters doesn't respect --mate-separator

It seems that '--identify-adapters' doesn't respect --mate-separator the same way the remove code does (it might even ignore it altogether)

[nash]$ /opt/adapterremoval/AdapterRemoval --basename test --file1 r1.100.fastq --file2 r2.100.fastq --mate-separator .
Trimming paired end reads ...
Opening FASTQ file 'r1.100.fastq'
Opening FASTQ file 'r2.100.fastq'
Processed a total of 50 reads in 0.0s; 58,000 reads per second on average ...
[nash]$ /opt/adapterremoval/AdapterRemoval --identify-adapters --file1 r1.100.fastq --file2 r2.100.fastq --mate-separator .
Attempting to identify adapter sequences ...
Opening FASTQ file 'r1.100.fastq'
Opening FASTQ file 'r2.100.fastq'
ERROR: Unhandled exception in thread:
Pair contains reads with mismatching names:
- 'SRR5275456.1.1'
- 'SRR5275456.1.2'

Note that AdapterRemoval by determines the mate numbers as the digit found at the end of the read name, if this is preceded by the character '/'; if these data makes use of a different character to
separate the mate number from the read name, then you will need to set the --mate-separator command-line option to the appropriate character.

Attached are the 2 files in the example above.

r1.100.fastq.gz
r2.100.fastq.gz

on documentation of trimwindows

Dear Mikkel,
It would be very helpful if the below information from the documentation at least in brief is included for --trimwindows in the help itself. In the existing help the part regarding 5' is not mentioned. "average quality and the quality of the first base in the window is greater than --minquality." will have profound effect especially in cases where --trimwindows is set to 0 to obtain a read-length average based quality filtering.

(https://adapterremoval.readthedocs.io/en/latest/misc.html#window-based-quality-trimming)
Reads are trimmed as follows for a given window size:

    The new 5’ is determined by locating the first window where both the average quality and the quality of the first base in the window is greater than --minquality.
    The new 3’ is located by sliding the first window right, until the average quality becomes less than or equal to --minquality. The new 3’ is placed at the last base in that window where the quality is greater than or equal to --minquality.
    **If no 5’ position could be determined, the read is discarded.**

identify-adapters error

Hi,
I used AdapterRemoval (ver. 2.1.7) to identify adapter sequences from paired-ended reads, but it returns an error message.
AdapterRemoval --identify-adapters --file1 file_R1.fastq --file2 file_R2.fastq
Attemping to identify adapter sequences ...
Error reading FASTQ record at line 1; aborting:
invalid character in FASTQ sequence; only A, C, G, T and N are expected!
My fastq files have been used with other programs without problems (like FastQC or bowtie2). The quality score encoding is Sanger / Illumina 1.9.
I join here the ten first sequences from each fastq file. Could you tell me what is the problem ?
Thank you.

file_R2.zip
file_R1.zip

Add option for singular 'combined' output FASTQ files

Currently, AdapterRemoval2 offers many different output 'streams' for each way a set of input files can be processed. In particular, quality trimmed vs non-quality trimmed get their own output flags.

However, this makes it difficult for downstream pipeline developers to define exactly what output files should be expected/used for subsequent analysis. i.e. there are so many useful options in AR2, but each combination produces different combinations of output files which can be hard to work out which to use (and makes lots of code duplication in pipeline processes e.g. in nf-core/eager, 9 different separate command statements rather than just using dynamic variable input: https://github.com/nf-core/eager/blob/de38b07149d3dabdfa38b0014c4126b2fe17ca12/main.nf#L855-L971, ).

It would helpful for an option that produces a 'single' FASTQ file with all valid output (i.e. not discarded), based on the parameters set by the user.

For example:

if collapsing is turned on, proudce a single FASTQ file that contains:
all collapsed+untrimmed, collapsed+trimmed + singleton reads.
If single end data, both trimmed and untrimmed singletons.

(etc.)

One addition

This would highly simplify a lot of manual processing that has to be done by pipelines/users.

installation Adapterremoval with Conda

Hi,
I am trying to instal the adapterremoval con miniconda and I am getting this error and I am not able to run it properly.

(C:\Users\marco507\Miniconda3) C:\Users\marco507>conda install -c bioconda adapterremoval
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

**PackagesNotFoundError: The following packages are not available from current channels:

adapterremoval**

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Do you know what is the problem?
Thank you!
Maria//

Support JSON log output

To carry over from: MultiQC/MultiQC#1005, just wanted to have this open to make the conversation 'cleaned up' in terms of requests.

As stated on the PR, @MikkelSchubert had stated an interest in improving the output log (.settings), and also making a JSON version of the log (with better documentation).

A few requests from @apeltzer and I in how we use AR2 in nf-core/eager, what would be great would be:

JSON format as close to MultiQC accept JSON as possible (or easily modifiable to make it work). Documentation
An entry for % of input reads that have had adapters removed

This is a useful pre-screening indicator whether your DNA fragments are short, as expected for aDNA, therefore most reads have adapters

An entry of % of read pairs that were collapsed

This is anther useful pre-screening indicator, as aDNA reads are so short, you expect them to merge.

Note for 2/3 both are only meant to be rough indicators, so they don't have to be 100% exact - given some of the complexity in the way AR2 works (see MultiQC thread).

No sudo permission

I am very interested in the function AdapterRemoval, but I don't have the sudo permission. It's hard to install this software.Is there any plan to offer pip or conda installtation?

Combining --identify-adapters with trimming

Hi,
I've really enjoyed using your tool, very well written and well documented. I was wondering if there is any way to or if there is a plan to add a way to combine --identify-adapters with what adapters are used in trimming. I'm working on a large dataset (>3,000 samples) which spans over several years and different labs and there are several different sets of adaptors used. I've written a wrapper that runs --identify-adapters first and then uses the identified adaptors if they differ from the defaults when running the trimming step. I didn't see an option for this so I wanted to check.

Thanks!
Nick

Bioconda package is there

Hi!
just wanted to mention that there is a bioconda package available now for your tool (I sent a pull request to bioconda-recipes):

conda install -c bioconda adapterremoval

Similar to that, there will be a corresponding docker image automatically available via the biocontainers project. You could mention this in the README, just to people can use it :-)

Cheers,
Alex

Incorrect trimming of adapters when alignment contains gaps

All alignments were processed with the following parameters:

AdapterRemoval --file1 test.fastq --basename out.fastq --gzip --threads 4 --trimns --trimqualities --preserve5p --adapter1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --adapter2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA --minlength 30 --minquality 20 --minadapteroverlap 1

The reads afterwards are taken from the discards file and appear to be shorter than they should be by the gap length. It is an unknown if the gaps in the alignment of the adapter make discarding these reads beneficial, but perhaps that should be an explicit option and choice for the user. The following alignments illustrate the problem:

One:

@K00233:58:HJGVTBBXX:5:2121:29680:48526 1:N:0:ATGGTATA+CATATTGA
CGGCGCCTCGGGCTTCGGCAACTTCGGCGGAGATCGGAAGAGACACGTCTGAACTCCAGTCACATGGTATATCTCG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJ

CGGCGCCTCGGGCTTCGGCAACTTCGGCGGAGATCGGAAGAG-ACACGTCTGAACTCCAGTCACATGGTATATCTCG
                              ||||||||||||||||||||||||||||||||||
------------------------------AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-------------

@K00233:58:HJGVTBBXX:5:2121:29680:48526 1:N:0:ATGGTATA+CATATTGA
CGGCGCCTCGGGCTTCGGCAACTTCGGCG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJ

Two:

@K00233:58:HJGVTBBXX:6:1125:23581:11548 1:N:0:ATGGTATA+CATATTGA
GGCCTACCGCGCGCCGTTTCCGACGTTCAGCAGATCGGAAGCACACGTCTGAACTCCAGTCACATGGTATATCTCG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-FJJJ

GGCCTACCGCGCGCCGTTTCCGACGTTCAGCAGATCGGAAG--CACACGTCTGAACTCCAGTCACATGGTATATCTCG
                               ||||||||||||||||||||||||||||||||||
-------------------------------AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-------------

@K00233:58:HJGVTBBXX:6:1125:23581:11548 1:N:0:ATGGTATA+CATATTGA
GGCCTACCGCGCGCCGTTTCCGACGTTCA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJ

Three:

@K00233:58:HJGVTBBXX:5:2109:30036:5112 1:N:0:ATGGTATA+CATATTGA
GCGCCAGGGCCGGTGGTGGGCTCAGTGGTGAGATCGGAAGACACGTCTGAACTCCAGTCACATGGTATATCTCGTA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

GCGCCAGGGCCGGTGGTGGGCTCAGTGGTGAGATCGGAAGA---CACGTCTGAACTCCAGTCACATGGTATATCTCGTA
                              ||||||||||||||||||||||||||||||||||
------------------------------AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC---------------

@K00233:58:HJGVTBBXX:5:2109:30036:5112 1:N:0:ATGGTATA+CATATTGA
GCGCCAGGGCCGGTGGTGGGCTCAGTG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJ

In all cases the reads are discarded as the length of the gap is also trimmed from the read.

What means Length 0 in the read distribution field under the settings file report

Hello,

I would like to understand this minor detail of the report:

[Length distribution]
Length Mate1 Discarded All
0 0 92 92
1 0 0 0

I've just pasted 4 lines of the report. My reads length of sequencing was 50. For me would makes sense a scale from 1 to 50 but reads of length 0 what are they?

I couldn't fin this answer surfing the web, then I think you could make this clear please. Apologies if it is very stupid question.

Best,

Victor

How to demultiplexing paired-end reads in mixed orientation?

Dear Dr. Schubert,

this tool has helped me a lot, thanks a lot.

for my protocol, the barcodes location is depending on the orientation in which the DNA was sequenced.

for example, the read layout could be either this

r1:barcode1-forwardprimer-sequence; r2: barcode2-reverseprimer-sequence

or this

r1: barcode2-reverseprimer-sequence; r2: barcode1-forwardprimer-sequence.

so how to demultiplexing paired-end reads in mixed orientation in a single run?

if there's no such function, could you consider adding this function in later version?

wish everything goes well with you. :)

Benchmarking takes ages to generate files

Hi,

I'm a member of the Debian Med team which is a group inside Debian to package free software in life sciences and medicine for official Debian. I'd like to inform you that I have packaged adpaterremoval and it is now available in Debian unstable. I also tried to experiment with the benchmarking which is really interesting since we could on one hand test and on the other hand compare several of the tools we have packaged (not all tools in the suite are packaged yet, but a fair amount which makes a comparison interesting).

I noticed that you are using pIRS for creating artificial sequences. When having a packaging centric look into pIRS I noticed that some files are copyrighted by Illumina under a non-free / non-distributable license (see debian/copyright) I might try to contact the pIRS authors or even Illumina about a clarification but I doubt that chances are very good. May be its easier to replace the sequence creation program at all. For instance in Debian we have packaged art-nextgen-simulation-tools and artfastgenerator. I personally can not tell about their quality and how these compare to pIRS - I just know that there might be potential replacements for pIRS (may be even more?).

Another issue I have with the benchmarking suite is that I takes ages to finalise the generation process filling up the hard disk with GBytes of files. I admit I have interrupted the process several times - even after trying

`--- a/benchmark/benchmark.sh
+++ b/benchmark/benchmark.sh
@@ -44,7 +44,7 @@ ADAPTER_ID_INSERT_SIZES=($(seq 250 5 350
MAX_THREADS=4

Number of read (pairs) to simulate using pIRS for each replicate

-SIMULATED_NREADS=1000000
+SIMULATED_NREADS=100

results=$(mktemp -d /tmp/adapter-benchmark.XXXXXX)
echo "*** Results will be found in $results ***"
`
to reduce the size.

May be that's due to the fact that I used the unpatched pIRS. Could you please clarify the reason for the pIRS patch. I could add this to my (inofficial / non-distributable) packaging for pIRS to enable a sensible benchmarking.

Kind regards

    Andreas.

AdapterRemoval returning junk reads in paired1 and 2.truncated files

Hi,

I am having an issue with AdapterRemoval returning junk sequences in the paired1 and paired2.truncated files.

AdapterRemoval v2.1.7.
code used -

AdapterRemoval --threads 10 --file1 <(zcat SI325271_S1_R1.fastq.gz) --file2 <(zcat SI325271_S1_R2.fastq.gz) --minalignmentlength 10 --trimns --trimqualities --minquality 30 --collapse --minlength 25 --basename SI325271.30 --adapter1 AGATCGGAAGAGCACACGTGTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG --adapter2 AGATCGGAAGAGCGTCGGTAGGGAAAGAGTGTTCTCTAGGGTGTAGATCTCGGTGGTCGCCGGATCATT

The junk reads are basically 150-bp reads containing long stretches of As or Ts. They are present in the output.paired1.truncated and output.paired2.truncated files, but do not make it past merging not in output.collapsed or output.collapsed.truncated. We see them regardless of using the --minquality 20 or 30 thresholds.

Is there any insight you can provide concerning this issue?

Missing include statement

Hi,
please consider applying the following patch which was needed to build version 2.2.1 under Debian when using gcc 6.
Kind regards, Andreas.

Mate mismatch on multi-file mode

Hi, I ran into an issue using Adapterremoval on a dataset with a very weird situation:

AdapterRemoval was run on multi-file mode
I was lead to believe that the paired-end mode would automatically assume that --file1 and --file2 were forward and reverse end reads like it did in all other cases, but it seems to be choking.
What would cause something like this?

the data in question is a part of the NCBI SRA:
project: PRJNA389280
SRR5947944_1.fastq
SRR5947944_2.fastq

The read IDs are the same for each file, so the only way to check for forward-or-reverse was the filename, which has been fine up to now for all other data, except with this set.

version number not up-to-date

Hi,
Just a tiny issue here, but the version number in the source code (2.3.0) is out of sync with the version number in the github release and the changelog (2.3.1).
best regards,
Robin

Do not cut the base at the read 5' head

when use "-trimqualities" , adapterremoval will remove the base at 5' head of reads.
This will affect the "MarkDuplicates" step in the GATK4, and I need to align reads by UMI, now have to use trimmoticate to trim low quality base.

Does anyway donot trim base at 5' head , when use "-trimqualities" ?

Thanks, adapterremoval realy usefule.

Invalid fastq files produced

Hi,
I used AdapterRemoval recently and found out that sometimes it produced invalid fastq files like this:

@ST-E00114:418:HHKV7CCXX:8:1109:26484:40020 2:N:0:GCCAAT
TTCAAGAGGTCTTAAGGAGGCTAACTCATATATAGGTTAGGTCTAGATCTAGAGGATGAGAGCATGGGAGACATGGATGTACATAGACTCGATAAGTACTT
+
AA<F<FAFAA<,FFKKK<FFFKKKKKKKKKKKFKKKAFFF79:40020 2:N:0:GCCAAT

As shown, the quality line is truncated and merged with a segment of the header line. There are four copies of those lines which may be due to the 4 threads that I specified to the program.

This was complained by bowtie2 that saw "a space" in the quality line.

Trimming reads on 3' and 5' ends

Hi Mikkel!

do you think it would be possible to add an option to trim [n] bases off after clipping adapters off the reads? I mean in case of DNA damage, lots of people are doing this manually (removing a single base of the read ends for example, or clip off internal barcodes).

It shouldn't be too difficult to add this or what do you think?

No result after passing in a sample with no adapters

Hi,
I'm trying to use your tool on a dataset that has no adapters.
However, the output of the program is a completely blank Fastq. Shouldn't it leave my file alone, if there are no adapters?

The data I used is from here: http://huttenhower.sph.harvard.edu/humann2
Their synthetic human gut rna sample.
I am using factory default settings, and fastqc tells me this sample has no adapters.

How to construct the adapter list/specify adapters

Hi,

I need to remove adapters from a paired-end samples that were prepared with "TruSeq DNA PCR-Free HT" (it can be seen here at page 15 (https://dnatech.genomecenter.ucdavis.edu/wp-content/uploads/2013/06/illumina-adapter-sequences_1000000002694-00.pdf)

If I understand correctly the first adapter (--adapter1 /first column in the file) is the one I expect to find in one of the files and the second adapter (--adapter2/second column in the file). However I've no way to know which is which without consulting the provider?

Also do I need to provide the adapters as is (as defined in the documentations) or to reverse complement one of them? (I couldn't really tell what is fixed with issue #31 )

Thanks,
Yoni

Adapter lists is interpreted differently compared to command line input

Hi,

It seems that AdapterRemoval is interpreting adapter sequences provided via command line and via file differently. In the first case adapter-1 and adapter-2 are parsed to the same sequences as user provides. In the second case with --adapter-list option adapter-2 is parsed to reverse complement. Is this expected behaviour? From user perspective I would say that it is not.

Karolis

How to interpret the outputs of the multiple inputs after trimming?

Hello AdapterRemoval,
Thanks for developing this amazing package!

I have 4 fastq files. WT_1 and WT_2 are paried-end sequencing files from sample WT. KO_1 and KO_2 are paried-end sequencing files from sample KO.
WT_1.fastq.gz
WT_2.fastq.gz
KO_1.fastq.gz
KO_2.fastq.gz

I run AdapterRemoval --file1 WT_1.fastq.gz KO_1.fastq.gz --file2 WT_2.fastq.gz KO_2.fastq.gz --adapter1 CTGTCTCTTATACACATCT --adapter2 CTGTCTCTTATACACATCT --thread 16, and only get 2 files: your_output.pair1.truncated and your_output.pair1.truncated. Is your_output.pair1.truncated a merged file that combines WT_1 and KO_1 after trimming? If so, it doesn't make any sense to combine mate1s of two different samples.

Shall I only input the fastq files from the same experiment group into this run? I mean, if I have
WT1_1.fastq.gz
WT1_2.fastq.gz
WT2_1.fastq.gz
WT2_2.fastq.gz
KO1_1.fastq.gz
KO1_2.fastq.gz
KO2_1.fastq.gz
KO2_2.fastq.gz
I should do trimming for WT and KO respectively, like AdapterRemoval --file1 WT1_1.fastq.gz WT2_1.fastq.gz --file2 WT1_2.fastq.gz WT2_2.fastq.gz --adapter1 CTGTCTCTTATACACATCT --adapter2 CTGTCTCTTATACACATCT --thread 16 first?

Could you please explain a little more about the meaning of pair1 and pair2 files when we input multiple inputs into this run?
Thanks!
Best
YJ

Missing <numeric> header include in fastq.cc

The build fails on macOS (tested with 10.10, 10.11, and 10.12) with the following error:

src/fastq.cc:198:29: error: no member named 'accumulate' in namespace 'std'
    long running_sum = std::accumulate(m_qualities.begin(),
                       ~~~~~^
1 error generated.

See https://jenkins.brew.sh/job/Homebrew%20Science%20Pull%20Requests/410/version=sierra/console

It seems #include <numeric> is missing.

Is Phred encoding automatically detected?

After reading the docs I am still bit confused as to whether you auto-detect Phred+33 or Phred+64 ?

If so, will it handle Q=42 (eg K) in Sanger Phred+33 files?
Many parsers fail when Q > 40 but it can exist in Moleculo and other data.

Will things still work if R1 is Phred+33 and R2 is Phred+64 ?

Adapters Not Being Trimmed (Apparently)?

Dear Mikkel,

I might have noticed what could be some sort of misbehaviour. I have the attached test files (T14_Test_{Pair}.fastq), and I have tried to process them using the following command (v2.3.1):

AdapterRemoval --file1 T14_Test_1.fastq --file2 T14_Test_2.fastq --trimns --trimqualities --collapse --minlength 5 --minquality 20 --maxns 30 --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCAC --adapter2 ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --basename Test_AR

My understanding is that the output should look something like this (Pair 1):

@HISEQ:247:C87NTANXX:7:1101:2220:2855 1:N:0:TAATGC
ATCGTTAATCGATTTTCCTCG
+
BBBBBFFFBFFFBFFBFFFFF
@HISEQ:247:C87NTANXX:7:1101:2220:2856 1:N:0:TAATGC
ATCGTTAATCGATTTTCCTCGTAATGCGCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAA
+
BBBBBFFFBFFFBFFBFFFFFFFFFFFFFFFFFFFFFF/<FF//<BFFFFB<FBBBFF/FFFB/<FB

But that is not what I have been getting. On the contrary, all the reads survive almost untouched. Could you please let me know if I am missing something here? I am really sorry if I am (probably I am), but I have tried different options and I cannot see what I could do differently.

Please let me know should you need any extra information from my end.

Thanks in advance, George.
T14_Test_1-2.zip

--combined-output generates strange GC% distribution in fastqc reports

Dear Mikkel,

I have been making much use of AdapterRemoval over the years, but it is only recently I picked up the --combined-output option out of convenience (as most downstream analysis want just one R1 and R2 file).

Something that has been unsettling me about this option is that whenever I create a Fastqc report, the GC-content distribution tends to show an increase in low GC content reads (compared to the raw data). Since the --combined-output generates a lot of reads with just one N, it wouldn't surprise me to see a peak at GC% = 0, but this is not the case. Instead, I suddenly see an above expected number of reads with GC % 0 - 20 % in the R1 file, and a very abrupt increase in 0-50 % GC content in the R2 file. I witnessed this both with RNA-seq data and target capture data independently.

When I run the same samples without --combined-output, none of the output files display an increase in low GC content reads, which is why I'm suspecting this has something to do with the --combined-output option. Could you help me understand why I see such a pattern and let me know if this is or isn't something to worry about?

The Adapterremoval command I used was:
AdapterRemoval --file1 "${R1[i]}" --file2 "${R2[i]}" --output1 "${R1[i]%.fq}_trimmed.fq" --output2 "${R2[i]%.fq}_trimmed.fq" --trimns --trimqualities --minquality 20 --collapse --trimwindows 12 --minlength 36 --combined-output

This is the GC content of untrimmed files:

Here of the trimmed R1 file:

And her eof the trimmed R2 file:

Additionally I had another question regarding the --collapse function, namely how would AdapterRemoval deal with this in case of repetitive sequences, where there's danger of collapsing fragments which are not actually overlapping?

Thanks a lot,
Clara

basic_ios::clear: iostream error

I have tried to run AdapterRemoval on two operating systems, and have received the following error on both:

Trimming paired end reads ...
Opening FASTQ file 'PCS_S21_L001_R1_001.fastq', line numbers start at 1
Opening FASTQ file 'PCS_S21_L001_R2_001.fastq', line numbers start at 1
ERROR: Unhandled exception in thread:
basic_ios::clear: iostream error
ERROR: AdapterRemoval did not run to completion;
do NOT make use of resulting trimmed reads!

The first os was macOS Catalina v10.15.1, while the second was on a server and was CentOS Linux v7. I'm not sure how to handle the error?

mikkelschubert / adapterremoval Goto Github PK

adapterremoval's Introduction

AdapterRemoval

Overview of major features

Documentation

Installation

Installation with Conda

Installing from sources

Getting started

adapterremoval's People

Contributors

Stargazers

Watchers

Forkers

adapterremoval's Issues

Number of read (pairs) to simulate using pIRS for each replicate

Recommend Projects

Recommend Topics

Recommend Org