Git Product home page Git Product logo

hive-hll-udf's Introduction

Approximate Queries for Hive

HyperLogLog and LinearCounting UDFs have been written before, that part isn't new,but nobody seems to have built a proper work-flow composable HLL++ implementation for actual data storage in Hive tables.

When I was in Zynga's SEG, I used to constantly follow DAU charts as my go-to early warning tool.

The influx of new users will start to flatten out whenever Facebook had an API outage or someone enables an experiment with bad code.

For similar funnel analysis and cohort tracking with fairly large error bars, most people need time series aggregations for distinct user information and rough intersections between user groupings.

And the lag between data ingestion and it showing up on a graph is a real problem for the people who hold the knobs for experiments.

This use-case is a far more complex work-flow than an approx_distinct implementation, but something @prasanth_j's HLL++ library can enable, if you wrote the right UDAF/UDF combinations.

If you maintained an RRD of daily and hourly values of HLL(uid) tables, then to generate an upto the hour approximate MAU calculation would involve no scans of any user data beyond the current hour.

You could start off simple with

select approx_distinct(uid) from raw_data;

But for the full composability example, you can see why splitting up the components of approx_distinct was a good idea

days as (select hll_merge(uid_hll) as d_hll from days_rrd where day between ....),
hours as (select hll_merge(uid_hll) h_hll as  from today_rrd where hour between ...)
  select hll_count(hll_union(days.d_hll, hours.d_hll, current.c_hll)) as mau from  
     (select hll(uid) as c_hll from       current_hour_raw) current;

You can use these with GROUP BY ... WITH CUBE or ROLLUP as well, if you have multiple groupings in the report.

You can do more set cardinality operations as well, but it multiplies the error bar everytime you operate on it - you can do 2*hll_count(hll_union(a,b)) - hll_count(a) - hll_count(b) for approximate interesections.

The current functions are

  • CREATE TEMPORARY FUNCTION hll as 'org.notmysock.hive.UDAFHyperLogLog';
  • CREATE TEMPORARY FUNCTION hll_merge as 'org.notmysock.hive.UDAFHyperLogLogMerge';
  • CREATE TEMPORARY FUNCTION hll_count as 'org.notmysock.hive.UDFHyperLogLogValue';
  • CREATE TEMPORARY FUNCTION hll_debug as 'org.notmysock.hive.UDFHyperLogLogDebug';
  • CREATE TEMPORARY FUNCTION hll_union as 'org.notmysock.hive.UDFHyperLogLogUnion';
  • CREATE TEMPORARY FUNCTION approx_distinct as 'org.notmysock.hive.UDAFApproximateDistinct';

Or well, you could add these as permanent functions into your HiveServer2.

hive-hll-udf's People

Contributors

t3rmin4t0r avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.