Git Product home page Git Product logo

2i_bench's People

Contributors

jermp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

2i_bench's Issues

Feature Request: Binomial Coefficient Entropy Coding for Inverted Index Postings

  • Feature Name: Binomial Coefficient Entropy Coding for Inverted Index Postings
  • Status: draft
  • Start Date: 2021-12-13
  • Authors: Murage Kibicho

Summary

This is a proposal for encoding inverted index integer sequences using a number system based on binomial coefficients.
This number system is chosen because the minimum number of bits needed to encode a list of strictly increasing integers
is given by the equation ceilinglog2
where U is largest number in the list and n is the number of elements in the list.
For example to encode the sequence 1,2,3,4,10,11 we need at least 9 bits: 11 choose 6 and
log2(462) = 8.85 ~ 9 bits.
Binary interpolative coding, one of the best methods for inverted list compression takes at least 20 bits to represent the example sequence. You can confirm this here.
However, if we represent the same sequence as a sum of binomial coefficients, it takes 10 bits to encode the list, and an extra 3 bits to store the length of the list. The total is 13 bits. You can confirm this here.

Motivation

Compression is a solved problem. The best compressors work by changing radixes, or number bases to find the most concise representation of data.
For instance, arithmetic coding is a generalized change of radix for coding
at the information theoretic entropy bound. This bound is a measure of redundancy - how many duplicates are in your data. Strictly increasing integer sequences have no duplicates, therefore, cannot be compressed
according to the information theoretic bound. This means huffman coding and arithmetic coding methods are inefficient with non-repetitive integer sequences.

This rfc proposes the use of the combinatorial number system to encode inverted index integer sequences at the combinatorial information theoretic entropy bound.
Just like arithmetic coding, this change in number systems allows us to encode integer sequences with the least number of bits.

I am a Math major in my junior year and if this RFC succeeds I would love to take a gap year and work on this library full time. The library
has sample code for converting between binary and the combinatorial number system using a greedy algorithm, or by generating a lookup table.

Technical design

Overview of Combinatorial Number System

Any natural number N can be uniquely written as a sum of binomial coefficients
using this greedy algorithm.

  1. Find the largest binomial coefficient such that akChoosek
  2. Subtract to find the residue NminusAkChoosek
  3. Find the largest binomial coefficient such that Repeat

Example: Find the combinatorial representation of n63K4

Convert to Binomial

To reverse the process, sum your list of binomial coefficients
Sum

Comparison to Binary Interpolative Coding

In the example above, it can be seen that the sequence forward sequence
can be encoded in for2 using the combinatorial number system with an extra 3 bits to store the length of the sequence. This is a total of 9 bits to encode this sequence.

Using this library it can be confirmed that binary interpolative coding takes between 15 to 23 bits to encode the same sequence.

This is a screenshot of the result of binary interpolative coding.
Screenshot

The combinatorial number systems always encodes integer sequences at the entropy limit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.