Git Product home page Git Product logo

snowplow.github.com's People

Contributors

aalekh avatar alexanderdean avatar andrioni avatar benfradet avatar bigsnarfdude avatar bogaert avatar chuwy avatar colmsnowplow avatar dmpacheco avatar fblundun avatar ggaviani avatar idanby1 avatar ihortom avatar jbeemster avatar jonalmeida avatar mrcurtis avatar ndingwall avatar ninjabear avatar petermonte avatar ronnyml avatar yalisassoon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snowplow.github.com's Issues

Write Berlin & Buda blog post

Alex is going to be speaking at Budapest DW Forum 2014

On Data warehouse day (Thursday 5th), Alex will be speaking about:

With the release of Amazon Kinesis in late 2013, the Snowplow team set themselves the challenge of porting Snowplow's Hadoop-based architecture to Kinesis. Alex Dean from Snowplow will share their experiences porting Snowplow to Kinesis, including: "hero" use cases for event streaming (with a live demo); building a lambda architecture with Kinesis and EMR; moving from a batch to streaming mindset.

Since its inception, the Snowplow open source event analytics platform (https://github.com/snowplow/snowplow) has always been tightly coupled to the batched-based Hadoop ecosystem, and Elastic MapReduce in particular. With the release of Amazon Kinesis in late 2013, we set ourselves the challenge of porting Snowplow to Kinesis, to give our users access to their Snowplow event stream in near-real-time.

With this porting process now complete, Alex Dean, Snowplow Analytics co-founder and technical lead, will share Snowplow's experiences in adopting stream processing as a complementary architecture to Hadoop and batch-based processing. In particular, Alex will explore:

  • "Hero" use cases for event streaming which drove our adoption of Kinesis
  • Why we waited for Kinesis, and thoughts on how Kinesis fits into the wider streaming ecosystem
  • How Snowplow achieved a lambda architecture with minimal code duplication, allowing Snowplow users to choose which (or both) platforms to use
  • Key considerations when moving from a batch mindset to a streaming mindset - including aggregate windows, recomputation, backpressure

On June 6th, Friday (Training day), Alex will be giving a half-day workshop:

Abstract

Hadoop is everywhere these days, but it can seem like a complex, intimidating ecosystem to those who have yet to jump in. In this hands-on workshop, Alex Dean, co-founder of Snowplow Analytics, will take you "from zero to Hadoop", showing you how to run a variety of simple (but powerful) Hadoop jobs on Elastic MapReduce, Amazon's hosted Hadoop service. Alex will start with a no-nonsense overview of what Hadoop is, explaining its strengths and weaknesses and why it's such a powerful platform for data warehouse practitioners. Then Alex will help get you setup with EMR and Amazon S3, before leading you through a very simple job in Pig, a simple language for writing Hadoop jobs. After this we will move onto writing a more advanced job in Scalding, Twitter's Scala API for writing Hadoop jobs. For our final job, we will consolidate everything we have learnt by building a multi-step job flexing Pig, Scalding and Apache HBase, the Hadoop database.

In detail

Agenda

Introducing Hadoop
Our simple job:
    Setting up EMR, S3 and our local client tools
    Writing our Pig Latin script
    Running and inspecting our results
Scalding:
    Introduction to Scalding and Cascading
    Writing our Scalding app
    Running and inspecting our results
Putting it all together:
    Introduction to HBase
    Writing our second Pig Latin script
    Updating our Scalding app
    Running and inspecting our results
Conclusions & next steps

Sort out homepage

Need to nail down our messaging really tightly, and put up customer testimonials

Write blog post explaining entities

Core flow

  • Entities exist through time, with slowly changing state
  • Transactional databases, RESTful APIs and many programming languages model entities by allowing the entities state to mutate through time: CRUD
  • A lot of people in analytics think that events are somehow different from entities
  • Events are not different from entities - events contain entities. In subject verb object, subject and object are entities. In context, if I add to basket on a webpage, then the webpage is an entity
  • The difference is, that events model entities in a different way. Events snapshot the entity, freeze it in amber at the moment the event took place. The entities in an event are frozen snapshots - they are immutable facts

Differences from classic data warehousing

  • Classic data warehousing attempts to deal with the problem of slowly changing entity state in transactional systems
  • It does this by regularly snapshotting the entities in the transactional system and storing the entity at that time in a fact table (TODO: check terminology)
  • This means that you have a history of the slowly changing entity, precise to the window size i.e. how often the ETL batch process is run (typically once a day)
  • However, this does not tell you when the state changed, and thus you cannot know definitely the entity state at a point in time (e.g. when an event happened)
  • By contrast, in an event model, you can capture the exact entity state at the time of the event, so you know exactly what state the entities were in when the event happened
  • If that is still not detailed enough, then add more granular events, which could even be at the level of tracking state changes (reflexive verb)

Snowplow world

  • We believe that you should just snapshot the entity at collection time
  • At analytics time, you figure out how you treat those snapshots of the entity which are slowly changing
  • There is no right and wrong way of doing this - it is purely contextual factors
  • If doing an analysis of how behaviour changes with user age, then you would use age at event time
  • If you are audience targeting for an email campaign, you would use their current age

Notes

  • This is why you have to be careful doing sidepipe joins - because you are often joining an event which took place a year ago to the current state of an entity (e.g. a customer record), which isn't a true reflection of the state of the entity when the event took place
  • It's interesting to look at recent database innovations which are trying to bring databases much closer to the idea of immutable entities - e.g. Datomic and eventsourced-persistence
  • Also note how databases can generate an event stream of mutation events, and how Kafka has support for handling mutation events efficiently (e.g. with compaction)

On reflexive verbs

  • Most all state changes are the result of an external actor, and so should be completely capturable in the event flow
  • However, some are not - my age increasing from 33 to 34 is not the product of an external actor. Thus we could say that I reflexively change my age, but that is still a valid verb - it is just subject verb object, where object is self, i.e. a reflexive verb

jekyll serve not working for me

I'm getting this:

vagrant@precise64:/vagrant/snowplow.github.com$ jekyll serve
Configuration file: /vagrant/snowplow.github.com/_config.yml
            Source: /vagrant/snowplow.github.com
       Destination: /vagrant/snowplow.github.com/_site
      Generating...   Liquid Exception: incompatible character encodings: ISO-8859-1 and UTF-8 in atom.xml
error: incompatible character encodings: ISO-8859-1 and UTF-8. Use --trace to view backtrace
vagrant@precise64:/vagrant/snowplow.github.com$ jekyll serve --trace
Configuration file: /vagrant/snowplow.github.com/_config.yml
            Source: /vagrant/snowplow.github.com
       Destination: /vagrant/snowplow.github.com/_site
      Generating...   Liquid Exception: incompatible character encodings: ISO-8859-1 and UTF-8 in atom.xml
/home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/tags/for.rb:117:in `block (2 levels) in render': incompatible character encodings: ISO-8859-1 and UTF-8 (Encoding::CompatibilityError)
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/tags/for.rb:105:in `each'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/tags/for.rb:105:in `each_with_index'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/tags/for.rb:105:in `block in render'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/context.rb:105:in `stack'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/tags/for.rb:104:in `render'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/block.rb:106:in `block in render_all'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/block.rb:93:in `each'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/block.rb:93:in `render_all'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/block.rb:82:in `render'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/template.rb:124:in `render'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/liquid-2.5.5/lib/liquid/template.rb:132:in `render!'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/convertible.rb:88:in `render_liquid'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/convertible.rb:150:in `do_layout'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/page.rb:115:in `render'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/site.rb:239:in `block in render'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/site.rb:238:in `each'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/site.rb:238:in `render'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/site.rb:39:in `process'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/command.rb:18:in `process_site'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/commands/build.rb:23:in `build'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/lib/jekyll/commands/build.rb:7:in `process'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/jekyll-1.4.3/bin/jekyll:97:in `block (2 levels) in <top (required)>'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/commander-4.1.6/lib/commander/command.rb:180:in `call'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/commander-4.1.6/lib/commander/command.rb:180:in `call'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/commander-4.1.6/lib/commander/command.rb:155:in `run'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/commander-4.1.6/lib/commander/runner.rb:422:in `run_active_command'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/commander-4.1.6/lib/commander/runner.rb:82:in `run!'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/commander-4.1.6/lib/commander/delegates.rb:12:in `run!'
    from /home/vagrant/.rvm/gems/ruby-1.9.3-p484@global/gems/commander-4.1.6/lib/commander/import.rb:10:in `block in <top (required)>'

Add testimonial from Eric @ Viadeo

Eric's testimonial (we can clean up the English):

'' Snowplow was a game changer here at Viadeo. In few months we drop Google Analytics and Ominiture to fully rely on Snowplow in production and handle about 10 Millions analytics events per day. We like the versatility of the infrastructure - we use our internal Hadoop Spark infrastructure to create deep BI reports merging analytics data with our back-end business logs. In the same time we love the simplicity and the open-innovation approach offers by Redshift storage: today at Viadeo every single engineer or product manager is able to setup a rich metrics dashboard in few minutes."

Add jobs page

Spec for engineer, listing the qualities that we look for. Explain we are building the team in London at the moment, but will look at working with remotes / opening an office in the US in due course

cc @alexanderdean

Add testimonial from Joao

Having a web analytics platform with summarized data, like Google Analytics, can provide immense value to many organizations, but the future of web analytics is not summarized data, it's large amounts of unstructured and highly granular enriched data that can provide even more value in the hands of the right team.

Snowplow provides me complete ownership of my clickstream data with total flexibility at an affordable cost. I also see it as an insurance policy for the future.

Snowplow democratized clickstream data

Add partners page

Technology partners

  • AWS
  • Looker
  • Qubole

Implementation partners

  • Simon Rumble
  • Thoughtworks
  • Havas Media / DBi
  • Dutch group (?)
  • Torque (?)

Supporting organisations

  • Github
  • Navicat
  • Microsoft

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.