Git Product home page Git Product logo

hyperloglog's Introduction

HyperLogLog Build Status codecov Maven Central

HyperLogLog is an amazing data structure for estimating the cardinality (with very high accuracy) of large data sets that uses very little memory. This implementation of HyperLogLog contains the original algorithm by Flajolet et. al as well hyperloglog++ algorithm by Heule et. al. Refer 'References' section for blog posts/paper to find out the inner workings of hyperloglog.

Features

  • Built-in support for 32-bit and 64-bit hashcodes (Murmur3_32 and Murmur3_128 respectively)
  • API support for specifying hashcode directly (instead of using internal ones)
  • SPARSE and DENSE encoding support
  • Bit-packing of DENSE registers for better compression. Serialized hyperloglog size with bitpacking is ~10KB for millions of distinct items, ~12K for few billion distinct items. When bit-packing is disabled the serialized size is ~16KB.
  • Delta encoding and varints for SPARSE registers. Serialized hyperloglog size with sparse representation is from as low as 10s of bytes (boolean column) and above.
  • Bias correction using lookup table for better accuracy
  • Command line tool (hll)
  • Configurable options to enable/disable the above features

Installation

git clone https://github.com/prasanthj/hyperloglog.git hyperloglog
cd hyperloglog
mvn package -DskipTests

hll - Command Line Tool

After running mvn package -DskipTests, run hll to display the usage options

Example usage: hll -n 1000 <OR> hll -f /tmp/input.txt <OR> hll -d -i /tmp/out.hll
usage: HyperLogLog
 -b,--enable-bitpacking <arg>   enable bit-packing of registers. default =
                                true
 -c,--no-bias <arg>             use bias correction table (no-bias
                                algorithm). default = true
 -d,--deserialize               deserialize hyperloglog from file. specify
                                -i for input file
 -e,--encoding <arg>            specify encoding to use (SPARSE or DENSE).
                                default = SPARSE
 -f,--file <arg>                specify file to read input data
 -i,--input-file <arg>          specify input file for deserialization
 -n,--num-random-values <arg>   number of random values to generate
 -o,--output-file <arg>         specify output file for serialization
 -p,--num-register-bits <arg>   number of bits from hashcode used as
                                register index between 4 and 16 (both
                                inclusive). default = 14
 -r,--relative-error            print relative error calculation
 -s,--serialize                 serialize hyperloglog to file. specify -o
                                for output file                                
 -t,--standard-in               read data from standard in
  

Examples

Test with 'n' random numbers

#./hll -r -n 20000
Actual count: 20000
Encoding: DENSE, p: 14, estimatedCardinality: 19993
Relative error: 0.034999847%

Test with input file

#./hll -r -f /etc/passwd
Actual count: 84
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Relative error: 0.0%

Test serialization

#./hll -r -n 100000000 -s -o /tmp/out.hll
Actual count: 100000000
Encoding: DENSE, p: 14, estimatedCardinality: 100069607
Relative error: -0.069606304%
Serialized hyperloglog to /tmp/out.hll
Serialized size: 10248 bytes
Serialization time: 20 ms

./hll -r -f /etc/passwd -s -o /tmp/out.hll
Actual count: 84
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Relative error: 0.0%
Serialized hyperloglog to /tmp/out.hll
Serialized size: 337 bytes
Serialization time: 5 ms

Test deserialization

#./hll -d -i /tmp/passwd.hll
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Count after deserialization: 84
Deserialization time: 42 ms

Test disabling bit-packing of registers

#./hll -r -n 10000000 -b false -s -o /tmp/out.hll
Actual count: 10000000
Encoding: DENSE, p: 14, estimatedCardinality: 10052011
Relative error: -0.52011013%
Serialized hyperloglog to /tmp/out.hll
Serialized size: 16392 bytes
Serialization time: 27 ms

Test reading from standard in

#cat /etc/passwd | ./hll -r -t
Actual count: 84
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Relative error: 0.0%

Issues

Bug fixes or improvements are welcome! Please fork the project and send pull request on github. Or report issues here https://github.com/prasanthj/hyperloglog/issues

License

Apache licensed.

References

[1] http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

[2] http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/

[3] http://research.neustar.biz/tag/flajolet-martin-sketch/

[4] http://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/

[5] http://antirez.com/news/75

hyperloglog's People

Contributors

babadopulos avatar prasanthj avatar t3rmin4t0r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hyperloglog's Issues

simple performance increase for HLLDenseRegister

I propose that we increase the performance (and at the same time decrease memory usage) of HLLDenseRegister by changing the implementation of the set() and getSumInversePow2() methods.

First, in HLLConstants, we add a:
public static final double[] inversePow2Data
array with 64 values

item 0 would be the constant 2^0
item 1 would be the constant 2^-1
item 2 would be the constant 2^-2
etc.

Then in HLLDenseRegister, we remove the member array:
private double[] invPow2Register;

In the set() method we remove:
invPow2Register[idx] = Math.pow(2, -value);

We change the implementation of getSumInversePow2() to:

  public double getSumInversePow2() {
    double sum = 0;
    for (byte b : register) {
      sum += HLLConstants.inversePow2Data[b];
    }
    return sum;
  }

The set method will be significantly faster for two reasons:

  • It will not have the expensive Math.pow() call for every set call.
  • There will only be random memory access to register, and not also to invPow2Register. With p=14, with just a register array, there will be random access to 16kiB of data, which fits in an Intel CPU's L1 cache. With the current code, one also needs the invPow2Register, and that is a total of 144kiB of data, which means random access to the L2 cache.

For the getSumInversePow2 method, the performance will probably be about the same, perhaps a bit faster, or perhaps slightly slower. The performance of the set() method is more important, because most applications will call set() orders-of-magnitude more times than the application will call HyperLogLog.count(). With the current getSumInversePow2() method, it's reading sequentially through an array that would fit in the L2 cache. With the proposed new code, it'd do a sequential read of the register array that would almost always be in the L1 cache, and a random read of the HLLConstants.inversePow2Data array that'd be in the L1 cache (it would be 512 bytes).

Installation issue

Installation fails with the following warning and error:
[WARNING] The POM for org.apache.hive:hive-standalone-metastore:jar:3.0.0-SNAPSHOT is missing, no dependency information available

[ERROR] Failed to execute goal on project hyperloglog: Could not resolve dependencies for project com.github.prasanthj:hyperloglog:jar:1.2-SNAPSHOT: Could not find artifact org.apache.hive:hive-standalone-metastore:jar:3.0.0-SNAPSHOT -> [Help 1]

I attach the detailed log as well. Please check.
Thanks in advance.
hll_installation_error.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.