aseyboldt / fastq-rs Goto Github PK

License: MIT License

Rust 100.00%

fastq-rs's Introduction

A fast parser for fastq.

This library can process fastq files at about the speed of the coreutils wc -l (about 2GB/s on my laptop, seqan runs at about 150MB/s). It also makes it easy to distribute the processing of fastq records to many cores, without losing much of the performance.

See the documentation for details and examples.

Benchmarks

We compare this library with the fastq parser in rust-bio, the C++ library seqan 2.2.0, with kseq.h and with wc -l.

We test 4 scenarios:

A 2GB test file is uncompressed on a ramdisk. The program counts the number of records in the file.
The test file lz4 compressed on disk, with an empty page cache. Again, the program should just count the number of records.
The test file is lz4 compressed on disk with empty page cache, but the program sends records to a different thread. This thread counts the number of records.
The same as scenario 3, but with gzip compression.

All measurements are taken with a 2GB test file (TODO describe!) on a Haskwell i7-4510U @ 2GH. Each program is executed three times (clearing the os page cache where appropriate) and the best time is used. Libraries without native support for a compression algorithm get the input via a pipe from zcat or lz4 -d. The C and C++ programs are compiled with gcc 6.2.1 with the fags -O3 -march=native. All programs can be found in the examples directory of this repository.

	ramdisk	lz4	lz4 + thread	gzip	gzip + thread
`wc -l`	2.3GB/s	1.2GB/s	NA	300MB/s	NA
`fastq`	1.9GB/s	1.9GB/s	1.6GB/s	650MB/s	620MB/s
`rust-bio`	730MB/s	NA	250MB/s	NA	NA
`seqan`	150MB/s	NA	NA	NA	NA
`kseq.h`	980MB/s	680MB/s	NA	NA	NA

Some notes from checking perf record:

wc -l and fastq spend most of the time in memchr(), but in contrast to wc, fastq has to check that headers begin with @ and separator lines with + and do some more bookeeping. lz4 -d uses a large buffer size (default 4MB), which seems to prevent the operating system from running lz4 and wc concurrently when connected by a pipe. fastq avoids this problem with an internal queue.
rust-bio looses some time copying data and validating utf8. The large slowdown in the threaded version stems from the fact, that it sends each record to the other thread individually. Each send (I use a sync_channel from the rust stdlib) requires the use of synchronisation primitives, and three allocations for header, sequence and quality. Collecting records in a Vec and sending only after a large number of them is available speeds this up from 150MB/s to 250MB/s.
seqan is busy allocating stuff, and uses (I think) a naive implementation of memchr() to find line breaks.

fastq-rs's People

Contributors

Stargazers

Watchers

Forkers

pmarks veldsla sreenathkrishnan sstadick natir antonjmlarsson larsnaesbye sagrudd mlkaufman igvf-dacc

fastq-rs's Issues

crates.io release?

Hi @aseyboldt, Any chance we could get an updated crates.io (maybe 0.5.1), with the latest change? Thanks!

tests don't run on Mac OS: parasailors build

The build of parasailors fails inside it's build.rs. I'd recommend dropping parasailors as a dev dependency because it's not critical for testing fastq or demonstrating it's functionality, and projects with native code are less likely to build successfully than pure-Rust crates.

error: failed to run custom build command for `parasail-sys v0.2.5`

Caused by:
  process didn't exit successfully: `/Users/patrick/code/fastq-rs/target/debug/build/parasail-sys-7eb8e8d2fd2591d9/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'Problem copying library to target directoy.: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/libcore/result.rs:1084:5
stack backtrace:
   0:        0x1009c4b62 - backtrace::backtrace::libunwind::trace::hce12a9913e4eeca6
                               at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/libunwind.rs:88
   1:        0x1009c4b62 - backtrace::backtrace::trace_unsynchronized::h56a939a6ba5a4791
                               at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/mod.rs:66
   2:        0x1009c4b62 - std::sys_common::backtrace::_print::h587c601f87837d17
                               at src/libstd/sys_common/backtrace.rs:47
   3:        0x1009c4b62 - std::sys_common::backtrace::print::hded6a7e1e62f7308
                               at src/libstd/sys_common/backtrace.rs:36
   4:        0x1009c4b62 - std::panicking::default_hook::{{closure}}::h3f994bbc901f9889
                               at src/libstd/panicking.rs:200
   5:        0x1009c482d - std::panicking::default_hook::h6c261b7dad1af707
                               at src/libstd/panicking.rs:214
   6:        0x1009c5280 - std::panicking::rust_panic_with_hook::hd3c20890ac648923
                               at src/libstd/panicking.rs:477
   7:        0x1009c4dbd - std::panicking::continue_panic_fmt::hf444d349a369432b

each_zipped and parse_path

Hi,

fastq looks good - but I'm wondering if there's a way to iterate multiple fastqs together when each of them may or may not be compressed? My current attempt basically involves munging the source of each of these methods together.

Also, while I'm talking feature requests, any chance of fasta support? With that it would be a nice kseq.h replacement.

Thanks,
ben

Encounter `error: "Fastq record is too long"` when parsing Nanopore sequence data

What are the limits of the maximum sequence length within a record? I am imagining a workflow that should be regularly accommodating of reads over 100kb in length (and with recent ultra-long updates should occasionally expect multi Mb sequence reads.

What would be the most sustainable approach to working through this hurdle? Updating the buffer usize (and forking the project), reverting to bio::io::fastq? As a new to rust developer I'd welcome any comments as to e.g. how performance is going to suffer.

Would welcome some thoughts here - thanks!

buffer clean panics

The Buffer clean function doesn't consider the case that when buffer.start < 16 the alignment causes new_end to be larger than old_end. This causes an underflow in https://github.com/aseyboldt/fastq-rs/blob/master/src/buffer.rs#L72 which happened to me when I was parsing a gzipped fastq file.

If you want to align you could change the result into an isize or not move the data when buffer.start is less than 16. Or since the result is never used you could skip it entirely.

Iterate through two fastqs together

Hello

Thank you for this library. I am trying to learn rust by re-implementing some code. My code iterates through two fastqs and takes part of the sequence from one and appends it to the header of the other.

I can iterate through both at once using something like this:

    for (idx_records, read_records) in idx_parser.record_sets().zip(read_parser.record_sets()) {
        for (idx_record, read_record) in idx_records.iter().zip(read_records.iter()) {
            for (idx_read, read_read) in idx_record.iter().zip(read_record.iter()) {
                println!("Do something");
            }
        }
    }

But I was wondering if this was possible with parser.each()?

Sorry I know this is not an issue with your code, but any help would be appreciated
Thanks

Interleaved fastq file

I think related to #3 (?), how would you parse an interleaved fastq file? Where read 1 is the first entry of the file and read 2 is the second entry, etc. Thank you!

aseyboldt / fastq-rs Goto Github PK

fastq-rs's Introduction

A fast parser for fastq.

Benchmarks

fastq-rs's People

Contributors

Stargazers

Watchers

Forkers

fastq-rs's Issues

crates.io release?

tests don't run on Mac OS: parasailors build

each_zipped and parse_path

Encounter `error: "Fastq record is too long"` when parsing Nanopore sequence data

buffer clean panics

Iterate through two fastqs together

Interleaved fastq file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent