Git Product home page Git Product logo

faimm's Introduction

faimm

Random access to indexed fasta using a memory mapped file.

Usage

This crate provides indexed fasta access by using a memory mapped file to read the sequence data. It is intended for accessing sequence data on genome sized fasta files and provides random access based on base coordinates. Because an indexed fasta file uses a limited number of bases per line separated by (sometimes platform-specific) newlines you cannot directly use the bytes available from the mmap.

Access is provided using a view of the mmap using zero-based base coordinates. This view can then be used to iterate over bases (represented as u8) or parsed into a string. Naive GC counting is also available.

Access to the sequence data doesn't require the IndexedFasta to be mutable. This makes it easy to share.

Example

use faimm::IndexedFasta;
let fa = IndexedFasta::from_file("test/genome.fa").expect("Error opening fa");
let chr_index = fa.fai().tid("ACGT-25").expect("Cannot find chr in index");
let v = fa.view(chr_index,0,50).expect("Cannot get .fa view");
//count the bases
let counts = v.count_bases();
//or print the sequence
println!("{}", v.to_string());

Limitations

The parser uses a simple ASCII mask for allowable characters (64..128), does not apply any IUPAC conversion or validation. Anything outside this range is silently skipped. This means that also invalid fasta will be parsed. The mere presence of an accompanying .fai provides the assumption of a valid fasta. Requires Rust >=1.64

Alternatives

Rust-bio provides a competent indexed fasta reader. The major difference is that it has an internal buffer an therefore needs to be mutable when performing read operations. faimm is also faster. If you want record based access (without an .fai index file) rust-bio or seq_io provide this.

Performance

Calculating the GC content of target regions of an exome (231_410 regions) on the Human reference (GRCh38) takes about 0.7 seconds (warm cache), slightly faster than bedtools nuc (0.9s probably a more sound implementation) and rust-bio (1.3s same implementation as example) Some tests show counting can also be improved using SIMD, but nothing has been released.

faimm's People

Contributors

veldsla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

nihilee

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.