Git Product home page Git Product logo

fastq-rs's Introduction

Build Status

A fast parser for fastq.

This library can process fastq files at about the speed of the coreutils wc -l (about 2GB/s on my laptop, seqan runs at about 150MB/s). It also makes it easy to distribute the processing of fastq records to many cores, without losing much of the performance.

See the documentation for details and examples.

Benchmarks

We compare this library with the fastq parser in rust-bio, the C++ library seqan 2.2.0, with kseq.h and with wc -l.

We test 4 scenarios:

  • A 2GB test file is uncompressed on a ramdisk. The program counts the number of records in the file.
  • The test file lz4 compressed on disk, with an empty page cache. Again, the program should just count the number of records.
  • The test file is lz4 compressed on disk with empty page cache, but the program sends records to a different thread. This thread counts the number of records.
  • The same as scenario 3, but with gzip compression.

All measurements are taken with a 2GB test file (TODO describe!) on a Haskwell i7-4510U @ 2GH. Each program is executed three times (clearing the os page cache where appropriate) and the best time is used. Libraries without native support for a compression algorithm get the input via a pipe from zcat or lz4 -d. The C and C++ programs are compiled with gcc 6.2.1 with the fags -O3 -march=native. All programs can be found in the examples directory of this repository.

ramdisk lz4 lz4 + thread gzip gzip + thread
wc -l 2.3GB/s 1.2GB/s NA 300MB/s NA
fastq 1.9GB/s 1.9GB/s 1.6GB/s 650MB/s 620MB/s
rust-bio 730MB/s NA 250MB/s NA NA
seqan 150MB/s NA NA NA NA
kseq.h 980MB/s 680MB/s NA NA NA

Some notes from checking perf record:

  • wc -l and fastq spend most of the time in memchr(), but in contrast to wc, fastq has to check that headers begin with @ and separator lines with + and do some more bookeeping. lz4 -d uses a large buffer size (default 4MB), which seems to prevent the operating system from running lz4 and wc concurrently when connected by a pipe. fastq avoids this problem with an internal queue.
  • rust-bio looses some time copying data and validating utf8. The large slowdown in the threaded version stems from the fact, that it sends each record to the other thread individually. Each send (I use a sync_channel from the rust stdlib) requires the use of synchronisation primitives, and three allocations for header, sequence and quality. Collecting records in a Vec and sending only after a large number of them is available speeds this up from 150MB/s to 250MB/s.
  • seqan is busy allocating stuff, and uses (I think) a naive implementation of memchr() to find line breaks.

fastq-rs's People

Contributors

aseyboldt avatar natir avatar pmarks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fastq-rs's Issues

tests don't run on Mac OS: parasailors build

The build of parasailors fails inside it's build.rs. I'd recommend dropping parasailors as a dev dependency because it's not critical for testing fastq or demonstrating it's functionality, and projects with native code are less likely to build successfully than pure-Rust crates.

error: failed to run custom build command for `parasail-sys v0.2.5`

Caused by:
  process didn't exit successfully: `/Users/patrick/code/fastq-rs/target/debug/build/parasail-sys-7eb8e8d2fd2591d9/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'Problem copying library to target directoy.: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/libcore/result.rs:1084:5
stack backtrace:
   0:        0x1009c4b62 - backtrace::backtrace::libunwind::trace::hce12a9913e4eeca6
                               at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/libunwind.rs:88
   1:        0x1009c4b62 - backtrace::backtrace::trace_unsynchronized::h56a939a6ba5a4791
                               at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/mod.rs:66
   2:        0x1009c4b62 - std::sys_common::backtrace::_print::h587c601f87837d17
                               at src/libstd/sys_common/backtrace.rs:47
   3:        0x1009c4b62 - std::sys_common::backtrace::print::hded6a7e1e62f7308
                               at src/libstd/sys_common/backtrace.rs:36
   4:        0x1009c4b62 - std::panicking::default_hook::{{closure}}::h3f994bbc901f9889
                               at src/libstd/panicking.rs:200
   5:        0x1009c482d - std::panicking::default_hook::h6c261b7dad1af707
                               at src/libstd/panicking.rs:214
   6:        0x1009c5280 - std::panicking::rust_panic_with_hook::hd3c20890ac648923
                               at src/libstd/panicking.rs:477
   7:        0x1009c4dbd - std::panicking::continue_panic_fmt::hf444d349a369432b

each_zipped and parse_path

Hi,

fastq looks good - but I'm wondering if there's a way to iterate multiple fastqs together when each of them may or may not be compressed? My current attempt basically involves munging the source of each of these methods together.

Also, while I'm talking feature requests, any chance of fasta support? With that it would be a nice kseq.h replacement.

Thanks,
ben

Encounter `error: "Fastq record is too long"` when parsing Nanopore sequence data

What are the limits of the maximum sequence length within a record? I am imagining a workflow that should be regularly accommodating of reads over 100kb in length (and with recent ultra-long updates should occasionally expect multi Mb sequence reads.

What would be the most sustainable approach to working through this hurdle? Updating the buffer usize (and forking the project), reverting to bio::io::fastq? As a new to rust developer I'd welcome any comments as to e.g. how performance is going to suffer.

Would welcome some thoughts here - thanks!

buffer clean panics

The Buffer clean function doesn't consider the case that when buffer.start < 16 the alignment causes new_end to be larger than old_end. This causes an underflow in https://github.com/aseyboldt/fastq-rs/blob/master/src/buffer.rs#L72 which happened to me when I was parsing a gzipped fastq file.

If you want to align you could change the result into an isize or not move the data when buffer.start is less than 16. Or since the result is never used you could skip it entirely.

Iterate through two fastqs together

Hello

Thank you for this library. I am trying to learn rust by re-implementing some code. My code iterates through two fastqs and takes part of the sequence from one and appends it to the header of the other.

I can iterate through both at once using something like this:

    for (idx_records, read_records) in idx_parser.record_sets().zip(read_parser.record_sets()) {
        for (idx_record, read_record) in idx_records.iter().zip(read_records.iter()) {
            for (idx_read, read_read) in idx_record.iter().zip(read_record.iter()) {
                println!("Do something");
            }
        }
    }

But I was wondering if this was possible with parser.each()?

Sorry I know this is not an issue with your code, but any help would be appreciated
Thanks

Interleaved fastq file

I think related to #3 (?), how would you parse an interleaved fastq file? Where read 1 is the first entry of the file and read 2 is the second entry, etc. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.