Git Product home page Git Product logo

datahem's Introduction

Overview

DataHem is a serverless real-time end-2-end ML pipeline built entirely on Google Cloud Platform services - AppEngine, PubSub, Dataflow, BigQuery and Cloud ML.

Benefits

When building ML/Data products, your most valuable asset is your data. Hence, the purpose of DataHem is to give you:

  1. full control and ownership of your data
  2. unsampled data
  3. data in real time
  4. the ability to replay/reprocess your data unlimited times
  5. data synergies -> collect once and use for multiple purposes (reporting, analytics and building data/ML products)
  6. low cost of operations and maintenance
  7. scalability
  8. data as a stream and at rest
  9. activation of data
  10. ability to delete data on a row by row basis

Target architecture

Target architecture

Use cases

1. Digital Analytics

The first use is to leverage your implementation of Google Analytics / Measurement Protocol. Google Analytics is awesome, but has some limitations worth to address in order to take reporting, analytics and machine learning to the next level. By adding a custom task to your Google Analytics tracker, DataHem eliminates many of the limitations of both the free and the premium version of Google Analytics and gives you:

  • Unsampled data
  • Real-time data
  • Unlimited custom dimensions and metrics
  • Unlimited data volume
  • Enriched data as a stream
  • Unlimited reprocessing of data
  • No licensing fees (open source)

License

DataHem is licensed under AGPL 3.0 or later

DataHem ecosystem

The architecute of DataHem consists of loosely coupled parts to enable future replacements and extensions of parts.

  • tracker: Send data to the collector, currently supporting Google Analytics javascript tracker
  • collector: Collect data sent from trackers and publish the data on pubsub, currently running on Google App Engine Standard (Java)
  • processor: Process bounded and unbounded data and write to PubSub and BigQuery, currently using Google Dataflow (Apache Beam) and supports processing of Google Analytics hits and AWS Kinesis events
  • serializer: Serialize structured data, currently using protocol buffers
  • infrastructor: Infrastructure as code to easily setup API:s and services required, currently using Google Deployment Manager
  • predictor (backlog) predictions made on streaming data
  • pseudonymizor (backlog) pseudonymizing personal and/or sensitive data
  • ruler (backlog) processing rules for personal data
  • activator (backlog) serving predictions via REST/gRPC
  • orchestrator (backlog) workflow management DAGs using Google Cloud Composer

Setup

Follow instructions in wiki how to set up the various parts in DataHem

Background

DataHem was started in June 2017 by robertsahlin / ML-engineer. It was open sourced and officially brought under Mathem's mhlabs Github account and announced in May 2018.

The name DataHem is a play of words to resemble MatHem, the Swedish online grocery store where DataHem is developed. "Data" = "data". "Hem" = the swedish word for "Home".

datahem's People

Contributors

ml-engineer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.