vccri / atlantool Goto Github PK
View Code? Open in Web Editor NEWCommand line tool to index and search BAM files by QNAME
License: MIT License
Command line tool to index and search BAM files by QNAME
License: MIT License
Writing this up in an issue so that it's not lost: As of writing this, the index size for a 120 GB file is 11 GB. In order to reduce that more, I looked into storing only block-level pointers.
Currently, in the .data.bgz
file we store the virtual offset (location to record in BAM file) with each QNAME. That means each offset is a different 8 byte number, and hard to compress.
Instead of storing individual offsets, we can instead:
.blocks
file, 8 bytes per pointer, one after the other. Can be compressed or uncompressed..data.bgz
, instead of storing 8 bytes, just store a block index instead (block number). In the 120 GB file, there are ~9 million blocks, so that means a block index takes up 3 bytes at worst. Because we don't know how many blocks we have, using a variable encoding like Varint makes sense. So for earlier indexes, we only use 1, 2, 3 bytes, while not putting a (potentially too low) limit on the number of blocks.Now to look up a record in the BAM, we need to:
blocks
index. If it's uncompressed, all you need is to seek to byte position block_number * 8
I did the indexing changes, and the resulting file sizes were:
6.0G qname.data.bgz
531K qname.index.bgz
71M qname.blocks
That's almost a 50% reduction in index size, so pretty good.
I didn't have time to check the effect on search performance yet, but don't think it would be too bad. In the average case, we'd have to scan a single block in the BAM which has a maximum size of 64 KB.
.blocks
file to allow for seeking, and 71 MB is small enough. But we could compress it, read it all into memory and then use that to retrieve the block positions..blocks
file, could we just store the block pointer in .data.bgz
and hope that compression takes care of things? Answer: I don't think it would work because the compression is done on chunks of QNAMEs, which don't necessarily contain the same block pointers.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.