Git Product home page Git Product logo

datacube's Introduction

Introduction

A data cube is an abstraction for counting things in complicated ways (Wikipedia). This project is a Java implementation of a data cube backed by a pluggable database backend.

The purpose of a data cube is to store aggregate information about large numbers of data points. The data cube stores aggregate information about interesting subsets of the input data points. For example, if you're writing a web server log analyzer, your input points could be log lines, and you might be interested in keeping a count for each browser type, each browser version, OS type, OS version, and other attributes. You might also be interested in counts for a particular combination of (browserType,browserVersion,osType), (browserType,browserVersion,osType,osVersion), etc. It's a challenge to quickly add and change counters without wasting time writing database code and reprocessing old data into new counters. A data cube helps you keep these counts. You declare what you want to count, and the data cube maintains all the counters as you supply new data points.

A bit more mathily, if your input data points have N attributes, then the number of counters you may have to store is the product of the cardinalities of all N attributes in the worst case. The goal of the datacube project is to help you maintain these counters in a simple declarative way without any nested switch statements or other unpleasantness.

Urban Airship uses the datacube project to support its analytics stack for mobile apps. We handle about ~10K events per second per node.

Requires JDK 1.6.

Features

  • Performance: high-speed asynchronous batching IO backend
  • Bulk loading with Hadoop MapReduce
  • Pluggable database interface

IO

Each input data point may affect multiple counts in the data cube. For example, if you're counting events with a timestamp, a single event may increment the count for its hour, day, month, and year, ending up with four increments that must be applied to the database. Updating the database for each of these increments wouldn't scale to thousands of events per second, so we use the standard trick of batching counter updates in the client. When an input data point is given to the data cube, it updates a batch in memory for each of the affected counters. Periodically the batches are flushed to the backing database. If a single counter is incremented multiple times in the same batch, the increments are combined into a single database update.

TODO parameters to tune, implementation details to explain parameters

Bulk loading / backfilling

Database backend

A data cube can be backed by any database that supports a key-value interface and allows iterating over keys. To add support for a new database backend, implement the DbHarness interface and optionally the IdService interface. See HBaseDbHarness.java and HBaseIdService.java for examples. If you add support for a new database, we'd love to have you contribute your work back into the datacube project.

Currently HBase is the only supported backing database.

Example

IdService idService = new CachingIdService(5, new MapIdService());
ConcurrentMap<BoxedByteArray,byte[]> backingMap = 
        new ConcurrentHashMap<BoxedByteArray, byte[]>();
        
DbHarness<LongOp> dbHarness = new MapDbHarness<LongOp>(backingMap, LongOp.DESERIALIZER, 
        CommitType.READ_COMBINE_CAS, idService);

HourDayMonthBucketer hourDayMonthBucketer = new HourDayMonthBucketer();

Dimension<DateTime> time = new Dimension<DateTime>("time", hourDayMonthBucketer, false, 8);
Dimension<String> zipcode = new Dimension<String>("zipcode", new StringToBytesBucketer(), 
        true, 5);
        
DataCubeIo<LongOp> cubeIo = null;
DataCube<LongOp> cube;
        
Rollup hourAndZipRollup = new Rollup(zipcode, time, HourDayMonthBucketer.hours);
Rollup dayAndZipRollup = new Rollup(zipcode, time, HourDayMonthBucketer.days);
Rollup hourRollup = new Rollup(time, HourDayMonthBucketer.hours);
Rollup dayRollup = new Rollup(time, HourDayMonthBucketer.days);
        
List<Dimension<?>> dimensions =  ImmutableList.<Dimension<?>>of(time, zipcode);
List<Rollup> rollups = ImmutableList.of(hourAndZipRollup, dayAndZipRollup, hourRollup,
        dayRollup);
        
cube = new DataCube<LongOp>(dimensions, rollups);

cubeIo = new DataCubeIo<LongOp>(cube, dbHarness, 1, Long.MAX_VALUE, SyncLevel.FULL_SYNC);
        
DateTime now = new DateTime(DateTimeZone.UTC);
        
// Do an increment of 5 for a certain time and zipcode
cubeIo.writeSync(new LongOp(5), new WriteBuilder(cube)
        .at(time, now)
        .at(zipcode, "97201"));
        
// Do an increment of 10 for the same zipcode in a different hour of the same day
DateTime differentHour = now.withHourOfDay((now.getHourOfDay()+1)%24);
cubeIo.writeSync(new LongOp(10), new WriteBuilder(cube)
        .at(time, differentHour)
        .at(zipcode, "97201"));

// Read back the value that we wrote for the current hour, should be 5 
Optional<LongOp> thisHourCount = cubeIo.get(new ReadBuilder(cube)
         .at(time, HourDayMonthBucketer.hours, now)
        .at(zipcode, "97201"));
Assert.assertTrue(thisHourCount.isPresent());
Assert.assertEquals(5L, thisHourCount.get().getLong());
        
// Read back the value we wrote for the other hour, should be 10
Optional<LongOp> differentHourCount = cubeIo.get(new ReadBuilder(cube)
        .at(time, HourDayMonthBucketer.hours, differentHour)
        .at(zipcode, "97201"));
Assert.assertTrue(differentHourCount.isPresent());
Assert.assertEquals(10L, differentHourCount.get().getLong());

// The total for today should be the sum of the two increments
Optional<LongOp> todayCount = cubeIo.get(new ReadBuilder(cube)
        .at(time, HourDayMonthBucketer.days, now)
        .at(zipcode, "97201"));
Assert.assertTrue(todayCount.isPresent());
Assert.assertEquals(15L, todayCount.get().getLong());

Quickstart

Add datacube to your maven build. (TODO upload to a public repo)

Figure out your dimensions. These are the attributes of your incoming data points. Some examples of dimensions are time, latitude, and browser version. Create one Dimension object for each dimension. Use these dimensions to instantiate a data cube

You can skip using an IdService for now. This is an optional optimization for dimensions have that have long coordinates with low cardinality. For example, if you have a "country" dimension, the country name might be dozens of characters long, but there are only a few bytes of entropy. You could assign integers to countries and only use a few bytes to represent a country coordinate.

Create one Rollup object for each kind of counter you want to keep. For example, if you want to keep a counter of web hits by (time,browser), this would be one Rollup object.

Create a DbHarness object that will handle writing to the database. Currently, only HBaseDbHarness exists.

Create a DataCubeIo object, passing your DataCube object and your DbHarness.

Insert data points into your cube by passing them to DataCubeIo.writeSync().

Read back your rollup values by calling DataCubeIo.get().

Building

The POM is configured to build with specific versions of HBase and Hadoop. If your versions differ from those in the POM, you can override the versions by passing hbaseVersion and hadooopVersion. For example:

$ mvn -DhbaseVersion=0.90.6 -DhadoopVersion=0.20.2 package

The build artifact jars each have a classifier of the form hbase${hbaseVersion}-hadoop${hadoopVersion}, so you can depend on them in another project by doing something like:

<dependency>
  <groupId>com.urbanairship</groupId>
  <artifactId>datacube</artifactId>
  <version>${datacube.version}</version>
  <classifier>hbase0.94.0-hadoop1.0.3</classifier>
</dependency>

The main build artifact jar (without a classifer) uses the default HBase and Hadoop versions, which may change between datacube releases.

You can pass -DhadoopVersion and -DhbaseVersion to maven to choose which version of Haodop and HBase to depend on. Hadoop 2 is not yet supported since the artifact names are different. For example:

To build against your own version of HBase or Hadoop, just add your repository to the POM and pass -DhbaseVersion or -DhadoopVersion to the datacube build.

datacube's People

Contributors

awinterman avatar dcolish avatar drevell avatar gmcquillan avatar hkolbeck avatar lulciuca avatar meyersj avatar stepheno avatar timrobertson100 avatar todd534 avatar tomraney avatar uadeploy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.