Git Product home page Git Product logo

zetasketch's Introduction

ZetaSketch

ZetaSketch is a collection of libraries for single-pass, distributed, approximate aggregation and sketching algorithms.

These algorithms estimate statistics that are often too expensive to compute exactly.

The estimates use far fewer memory resources than exact calculations. For example, the HyperLogLog++ algorithm can estimate daily active users with:

ZetaSketch currently includes libraries to implement the following algorithms:

Algorithm Statistics Libraries
HyperLogLog++ Estimates the number of distinct values Java

What is a sketch?

ZetaSketch libraries calculate statistics from sketches. A sketch is a summary of a large data stream. You can extract statistics from a sketch to estimate particular statistics of the original data, or merge sketches to summarize multiple data streams.

After choosing an algorithm, you can use its corresponding libraries to:

  • Create sketches
  • Add new data to existing sketches
  • Merge multiple sketches
  • Extract statistics from sketches

HyperLogLog++

The HyperLogLog++ (HLL++) algorithm estimates the number of distinct values in a data stream. HLL++ is based on HyperLogLog; HLL++ more accurately estimates the number of distinct values in very large and small data streams.

Creating a sketch

// Create a sketch for estimating the number of unique strings in a data stream.
// You can also create sketches for estimating the number of unique byte
// sequences, integers, and longs.

HyperLogLogPlusPlus<String> hll = new HyperLogLogPlusPlus.Builder().buildForStrings();

// You can also set a custom precision. The default normal and sparse precisions
// are 15 and 20, respectively.
HyperLogLogPlusPlus<String> hllCustomPrecision = new HyperLogLogPlusPlus.Builder()
    .normalPrecision(13).sparsePrecision(19).buildForStrings();

Adding new data to a sketch

// Add three strings to the `hll` sketch. You must first initialize an empty
// sketch and then add data to it.
hll.add("apple");
hll.add("orange");
hll.add("banana");

Merging sketches

// Merge `hll2` and `hll3` with `hll`. The sketches must have the same
// original data type and precision.
hll.merge(hll2);
hll.merge(hll3);

Extracting cardinality estimates

// Return the estimate of the number of distinct values.
long result = hll.result();

How to use ZetaSketch

Please find the instructions for your build tool on the right side of https://search.maven.org/artifact/com.google.zetasketch/zetasketch

How to build ZetaSketch

ZetaSketch uses Gradle as its build system. To build the project, simply run:

./gradlew build

License

Apache License 2.0

Contributing

We are not currently accepting contributions to this project. Please feel free to file bugs and feature requests using GitHub's issue tracker.

Disclaimer

This is not an officially supported Google product.

zetasketch's People

Contributors

phstudy avatar zfraa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zetasketch's Issues

Maven package

Are you planning on releasing an official maven package/artifact?

[question] Is this optimization valid?

Is this optimization valid? If I understand correctly, Java arrays are objects, and therefore wouldn't be boxed.

/**
* Adds {@code value} to the aggregator. This provides a performance optimization over {@link
* #add(Object)} to avoid unnecessary boxing.
*
* @see #add(Object)
*/
public void add(byte[] value) throws IllegalArgumentException {
checkAndSetType(Type.BYTES);
addHash(Hash.of(value));
}

HyperLogLogPlusPlus fails on serializeToByteArray

In Scio we have the following error thrown on version 0.1.0. Unfortunately it is unclear how to reproduce, and I can't provide input data. But stack trace suggests that there is something wrong with management of array slices in com.google.zetasketch.internal.GrowingByteSlice:

16:04:53.990  [info]   java.lang.IllegalArgumentException: 19 > 5
16:04:53.990  [info]   at Due to Exception while trying to `encode` an instance of com.spotify.data.dm.sources.internal.AudienceEngagementMetrics: Can't encode field listeners value com.spotify.scio.extra.hll.zetasketch.ZetaSketchHll@151901cb.(:0)
16:04:53.990  [info]   at java.base/java.util.Arrays.copyOfRange(Arrays.java:4029)
16:04:53.990  [info]   at com.google.zetasketch.internal.GrowingByteSlice.maybeExtendLimit(GrowingByteSlice.java:219)
16:04:53.990  [info]   at com.google.zetasketch.internal.GrowingByteSlice.putNextVarInt(GrowingByteSlice.java:191)
16:04:53.990  [info]   at com.google.zetasketch.internal.GrowingByteSlice.putNextVarInt(GrowingByteSlice.java:30)
16:04:53.990  [info]   at com.google.zetasketch.internal.DifferenceEncoder.putInt(DifferenceEncoder.java:54)
16:04:53.990  [info]   at com.google.zetasketch.internal.hllplus.SparseRepresentation.set(SparseRepresentation.java:424)
16:04:53.990  [info]   at com.google.zetasketch.internal.hllplus.SparseRepresentation.flushBuffer(SparseRepresentation.java:348)
16:04:53.990  [info]   at com.google.zetasketch.internal.hllplus.SparseRepresentation.compact(SparseRepresentation.java:243)
16:04:53.990  [info]   at com.google.zetasketch.HyperLogLogPlusPlus.serializeToByteArray(HyperLogLogPlusPlus.java:298)
16:04:53.990  [info]   at com.spotify.scio.extra.hll.zetasketch.ZetaSketchHll$$anonfun$coder$2.apply(ZetaSketchHLL.scala:121)
16:04:53.990  [info]   at com.spotify.scio.extra.hll.zetasketch.ZetaSketchHll$$anonfun$coder$2.apply(ZetaSketchHLL.scala:121)

So it fails on array = Arrays.copyOfRange(array, arrayOffset(), Math.max(growthCapacity, limit)); and there might be something wrong with Math.max(growthCapacity, limit), apparently it is out of boundaries

Expose HyperLogLogPlusPlus#add(byte[], int offset, int size)

Some callers can have binary representations that look like: byte[], int offset, int size.

The only current access pattern for bytes is HyperLogLogPlusPlus#add(byte[] bytes), which forces such callers to copy bytes. This copy can be avoided because HashFunction exposes HashFunction#hashBytes(byte[] input, int off, int len).

HyperLogLogPlusPlus should be serializable

To allow inclusion of this in systems like Apache Beam and Apache Flink the HyperLogLogPlusPlus class should be serializable.
I ran into this while trying to implement a UDF for Apache Flink, there serializability is mandatory.

Also when trying to put an instance of the HyperLogLogPlusPlus class through a simple Beam pipeline resulted in

Caused by: java.io.NotSerializableException: com.google.zetasketch.HyperLogLogPlusPlus
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

Tests in the source tree

There are some test classes located under the main java source tree.
This is then 'fixed' in the build.gradle with this

// Customize source code layout
sourceSets {
    main {
        java {
            srcDirs = ['java']
            exclude 'java/com/google/zetasketch/testing/**'
        }
        proto {
            srcDirs = ['proto']
        }
    }
    test {
        java {
            srcDirs = ['java/com/google/zetasketch/testing', 'javatests']
        }
    }
}

A consequence of this very strange construct is that IntelliJ gets really confused and it no longer automatically sees the sources as sources but instead it sees one level deeper as being test sources.

Screenshot from 2019-07-11 14-23-55

Why not simply move the test classes to the place where all test classes live?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.