Git Product home page Git Product logo

skizze's Introduction


Build Status license

Skizze ([ˈskɪt͡sə]: german for sketch) is a sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.

Unlike a Key-Value store, Skizze does not store values, but rather appends values to defined sketches, allowing one to solve frequency and cardinality queries in near O(1) time, with minimal memory footprint.

Current status ==> pre-Alpha

Motivation

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly.

Skizze is a (fire and forget) service that provides a probabilistic data structures (sketches) storage that allows estimation of these and many other metrics, with a trade off in precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

Example use cases (queries)?

  • How many distinct elements are in the data set (i.e. what is the cardinality of the data set)?
  • What are the most frequent elements (the terms “heavy hitters” and “top-k elements” are also used)?
  • What are the frequencies of the most frequent elements?
  • How many elements belong to the specified range (range query, in SQL it looks like SELECT count(v) WHERE v >= c1 AND v < c2)?
  • Does the data set contain a particular element (membership query)?

How to build and install

make dist
./bin/skizze

Example usage:

./bin/skizze-cli

Create a new Domain (Collection of Sketches):

#CREATE DOM $name $estCardinality $topk
CREATE DOM demostream 10000000 100

Add values to the domain:

#ADD DOM $name $value1, $value2 ....
ADD DOM demostream zod joker grod zod zod grod

Get the cardinality of the domain:

# GET CARD $name
GET CARD demostream

# returns:
# Cardinality: 9

Get the rankings of the domain:

# GET RANK $name
GET RANK demostream

# returns:
# Rank: 1	  Value: zod	  Hits: 3
# Rank: 2	  Value: grod	  Hits: 2
# Rank: 3	  Value: joker	  Hits: 1

Get the frequencies of values in the domain:

# GET FREQ $name $value1 $value2 ...
GET FREQ demostream zod joker batman grod

# returns
# Value: zod	  Hits: 3
# Value: joker	  Hits: 1
# Value: batman	  Hits: 0
# Value: grod	  Hits: 2

Get the membership of values in the domain:

# GET MEMB $name $value1 $value2 ...
GET MEMB demostream zod joker batman grod

# returns
# Value: zod	  Member: true
# Value: joker	  Member: true
# Value: batman	  Member: false
# Value: grod	  Member: true

List all available sketches (created by domains):

LIST

# returns
# Name: demostream  Type: CARD
# Name: demostream  Type: FREQ
# Name: demostream  Type: MEMB
# Name: demostream  Type: RANK

Create a new sketch of type $type (CARD, MEMB, FREQ or RANK):

# CREATE CARD $name
CREATE CARD demosketch

Add values to the sketch of type $type (CARD, MEMB, FREQ or RANK):

#ADD $type $name $value1, $value2 ....
ADD CARD demostream zod joker grod zod zod grod

In Progress:

More REPL

  • Redesign data-structures main interface
  • Add new domains model
  • Add snapshotting
  • Add AOF
  • Add gRPC API
  • Add REPL
    • DELETE DOM $name # delete domain and all its sketches
    • DELETE $type $name # delete a sketch of $type CARD, MEMB, FREQ, RANK and $name
    • LIST DOM # list all domains
    • SAVE # Explicityly save state of all domains and sketches
  • New Docs
  • Clean up

License

Skizze is available under the Apache License, Version 2.0.

Authors

skizze's People

Contributors

gitter-badger avatar martinpinto avatar mbarkhau avatar njpatel avatar seiflotfy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.