Git Product home page Git Product logo

atlantool's People

Contributors

amitdev avatar badgersow avatar huy avatar huyleatlassian avatar robinst avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

robinst

atlantool's Issues

Smaller index size by storing block-level pointers only

Writing this up in an issue so that it's not lost: As of writing this, the index size for a 120 GB file is 11 GB. In order to reduce that more, I looked into storing only block-level pointers.

Index changes

Currently, in the .data.bgz file we store the virtual offset (location to record in BAM file) with each QNAME. That means each offset is a different 8 byte number, and hard to compress.

Instead of storing individual offsets, we can instead:

  • On the initial scan, remember the position of the first record per block. Write those positions to a .blocks file, 8 bytes per pointer, one after the other. Can be compressed or uncompressed.
  • When storing the value for QNAME in .data.bgz, instead of storing 8 bytes, just store a block index instead (block number). In the 120 GB file, there are ~9 million blocks, so that means a block index takes up 3 bytes at worst. Because we don't know how many blocks we have, using a variable encoding like Varint makes sense. So for earlier indexes, we only use 1, 2, 3 bytes, while not putting a (potentially too low) limit on the number of blocks.

Search changes

Now to look up a record in the BAM, we need to:

  • Find the QNAME in data (same as before)
  • Read its block number
  • Look up the block position in the blocks index. If it's uncompressed, all you need is to seek to byte position block_number * 8
  • Go to that position in the BAM file and do a linear scan until that QNAME is found

Results

I did the indexing changes, and the resulting file sizes were:

6.0G  qname.data.bgz
531K  qname.index.bgz
71M   qname.blocks

That's almost a 50% reduction in index size, so pretty good.

I didn't have time to check the effect on search performance yet, but don't think it would be too bad. In the average case, we'd have to scan a single block in the BAM which has a maximum size of 64 KB.

Thoughts

  • I didn't compress the .blocks file to allow for seeking, and 71 MB is small enough. But we could compress it, read it all into memory and then use that to retrieve the block positions.
  • Instead of having an additional .blocks file, could we just store the block pointer in .data.bgz and hope that compression takes care of things? Answer: I don't think it would work because the compression is done on chunks of QNAMEs, which don't necessarily contain the same block pointers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.