Git Product home page Git Product logo

hyperminhash-java's Introduction

Build Status

HyperMinHash-java

A Java implementation of the HyperMinHash algorithm, presented by Yu and Weber. HyperMinHash allows approximating set unions, intersections, Jaccard Indices, and cardinalities of very large sets with high accuracy using only loglog space. It also supports streaming updates and merging sketches, just the same as HyperLogLog.

This repo implements two flavors of HyperMinHash:

  1. HyperMinHash: An implementation based on HyperLogLog with the addition of the bias correction seen in HyperLogLog++.
  2. BetaMinHash: An implementation which uses LogLog-Beta for the underlying LogLog implementation. Loglog-beta is almost identical in accuracy to HyperLogLog++, except it performs better on cardinality estimations for small datasets (n <= 80k), holding memory fixed. Since we use Loglog-Beta, we refer to our implementation as BetaMinHash. However, our implementation currently only supports a fixed precision p=14.

If you expect to be dealing with low cardinality datasets (<= 80,000 unique elements), we recommend using BetaMinHash as it has a smaller memory footprint and is more accurate than HyperLogLog in the range from 20,000-80,000, holding memory fixed. However, note that different sketch types are not interchangeable i.e: obtaining the intersection of an HMH and a BMH is not currently supported.

Both implementations are equipped with serialization/deserialization capabilities out of the box for sending sketches over the wire or persisting them to disk.

Usage

Importing via Maven

<dependency>
  <groupId>com.liveramp</groupId>
  <artifactId>hyperminhash</artifactId>
  <version>0.2</version>
</dependency>

Cardinality estimation

Set<byte[]> mySet = getMySet();
BetaMinHash sketch = new BetaMinHash();
for (byte[] element : mySet){
    sketch.add(element);
}

long estimatedCardinality = sketch.cardinality();

Merging (unioning) sketches

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashCombiner.getInstance();
BetaMinHash combined = combiner.union(sketches);

// to get cardinality of the union
long unionCardinality = combined.cardinality();

// using HyperMinHash instead of BetaMinHash
Collection<HyperMinHash> sketches = getSketches();
SketchCombiner<HyperMinHash> combiner = HyperMinHashCombinre.getInstance();
HyperMinHash combined = combiner.union(sketches);

Cardinality of unions

BetaMinHash combined = combiner.union(sketches);
long estimatedCardinality = combined.cardinality();

Cardinality of intersection

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashComber.getInstance();
long intersectionCardinality = combiner.intersectionCardinality(sketches);

Serializing a sketch

To get a byte[] representation of a sketch, use the IntersectionSketch.SerDe interface:

HyperMinHash sketch = getSketch();
HyperMinHashSerde serde = new HyperMinHashSerde();

byte[] serialized = serde.toBytes(sketch);
HyperMinHash deserialized = serde.fromBytes(serialized);

int sizeInBytes = serde.sizeInBytes(sketch);

Maintainers

Commit authorship was lost when merging code. The maintainers of the library, in alphabetical order, are:

  1. Christian Hansen (github.com/ChristianHansen)
  2. Harry Rackmil (github.com/harryrackmil)
  3. Shrif Nada (github.com/sherifnada)

Acknowledgements

Thanks to Seif Lotfy for implementing a Golang version of HyperMinHash. We use some of his tests in our library, and our BetaMinHash implementation references his implementation.

hyperminhash-java's People

Contributors

aqeelat avatar atlantis-github-bakohsh5 avatar brucesw avatar dependabot[bot] avatar joshk0 avatar naiqusun avatar ops-github-du4joawe avatar roshan avatar sherifnada avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyperminhash-java's Issues

intersectionCardinality giving the wrong result

I'm trying to experiment with the code in scala and wrote the following pieces as a test case

import com.liveramp.hyperminhash._
import java.util.ArrayList
val sketch1 = new BetaMinHash()
val sketch2 = new BetaMinHash()
val set1 = (1 to 10000).map(_.toByte).toArray
val set2 = (5000 to 15000).map(_.toByte).toArray
sketch1.offer(set1)
sketch2.offer(set2)
val sketches = new ArrayList[BetaMinHash]()
sketches.add(sketch1)
sketches.add(sketch2)
val combiner = BetaMinHashCombiner.getInstance();
val combined = combiner.intersectionCardinality(sketches);

combined gives me 0 while the actual result should be something around 5000. Am I doing something wrong here?

Question about similarity of multiple sketches

First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.

My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54

As I read arXiv:1710.08436, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.

  1. Does the estimation still have same accuracy as of 2-set Jaccard Index ?
  2. If so, is there any proof ?

Sorry for obscure question. Thanks again.

HyperMinHash Jaccard similarity calculation

The HyperMinHash Jaccard similarity calculation appears to only compare the "mantissa" portion of the register to increase c, rather than the whole register. The algorithm 2.1.4 in the paper seems to compare the full tuple to increase c. Is there a reason why only the mantissa is compared in your implementation?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.