snowplow / snowplow Goto Github PK

The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP

License: Apache License 2.0

CSS 0.10% HTML 3.20% JavaScript 6.70% Scala 47.25% Thrift 3.25% PLpgSQL 23.84% Python 7.13% Shell 8.53%

snowplow analytics data data-pipeline data-collection product-analytics marketing-analytics snowplow-pipeline snowplow-events

snowplow's Introduction

As of January 8, 2024, Snowplow is introducing the Snowplow Limited Use License Agreement, and we will be releasing new versions of our core behavioral data pipeline technology under this license.

Our mission to empower everyone to own their first-party customer behavioral data remains the same. We value all of our users and remain dedicated to helping our community use Snowplow in the optimal capacity that fits their business goals and needs.

We reflect on our Snowplow origins and provide more information about these changes in our blog post here → https://eu1.hubs.ly/H06QJZw0

Overview

Snowplow is a developer-first engine for collecting behavioral data. In short, it allows you to:

Collect events such as impressions, clicks, video playback (or even custom events of your choosing).
Store the data in a scalable data warehouse you control (Amazon Redshift, Databricks, Elasticsearch, Google BigQuery, Snowflake) or emit it via a stream (Amazon Kinesis, Google PubSub, Kafka).
Leverage a wide range of tools to model and analyze the behavioral data: dbt, Looker, Metabase, Mode, Streamlit, Superset, Redash, and more.

Thousands of organizations around the world generate, enhance, and model behavioral data with Snowplow to fuel advanced analytics, AI/ML initiatives, or composable CDPs.

Why Snowplow?
Where to start?
Snowplow technology 101
About this umbrella repository
Community

Why Snowplow?

🏔️ Rock solid architecture capable of processing billions of events per day.
🛠️ Over 20 SDKs to collect data from web, mobile, server-side, and other sources.
✅ A unique approach based on schemas and validation ensures your data is as clean as possible.
🪄 Over 15 enrichments to get the most out of your data.
🏭 Send data to popular warehouses and streams — Snowplow fits nicely within the Modern Data Stack.

➡ Where to start? ⬅️

Snowplow Community Edition	Snowplow Behavioral Data Platform
Community Edition equips you with everything you need to start creating behavioral data in a high-fidelity, machine-readable way. Head over to the Quick Start Guide to set things up.	Looking for an enterprise solution with a console, APIs, data governance, workflow tooling? The Behavioral Data Platform is our managed service that runs in your AWS, Azure or GCP cloud. Book a demo.

The documentation is a great place to learn more, especially:

Tracking design — discover how to approach creating your data the Snowplow way.
Pipelines — understand what’s under the hood of Snowplow.

Would rather dive into the code? Then you are already in the right place!

Snowplow technology 101

The repository structure follows the conceptual architecture of Snowplow, which consists of six loosely-coupled sub-systems connected by five standardized data protocols/formats.

To briefly explain these six sub-systems:

Trackers fire Snowplow events. Currently we have 15 trackers, covering web, mobile, desktop, server and IoT
Collector receives Snowplow events from trackers. Currently we have one official collector implementation with different sinks: Amazon Kinesis, Google PubSub, Amazon SQS, Apache Kafka and NSQ
Enrich cleans up the raw Snowplow events, enriches them and puts them into storage. Currently we have several implementations, built for different environments (GCP, AWS, Apache Kafka) and one core library
Storage is where the Snowplow events live. Currently we store the Snowplow events in a flat file structure on S3, and in the Redshift, Postgres, Snowflake and BigQuery databases
Data modeling is where event-level data is joined with other data sets and aggregated into smaller data sets, and business logic is applied. This produces a clean set of tables which make it easier to perform analysis on the data. We officially support data models for Redshift, Snowflake and BigQuery.
Analytics are performed on the Snowplow events or on the aggregate tables.

For more information on the current Snowplow architecture, please see the Technical architecture.

About this repository

This repository is an umbrella repository for all loosely-coupled Snowplow components and is updated on each component release.

Since June 2020, all components have been extracted into their dedicated repositories (more info here) and this repository serves as an entry point for Snowplow users and as a historical artifact.

Components that have been extracted to their own repository are still here as git submodules.

Trackers

A full list of supported trackers can be found on our documentation site. Popular trackers and use cases include:

Web	Mobile	Gaming	TV	Desktop & Server
JavaScript	Android	Unity	Roku	Command line
AMP	iOS	C++	iOS	.NET
	React Native	Lua	Android	Go
	Flutter		React Native	Java
				Node.js
				PHP
				Python
				Ruby
				Scala
				C++
				Rust
				Lua

Collector

Enrich

Loaders

Iglu

Data modeling

Web

Mobile

Parsing enriched event

Bad rows

Terraform Modules

Community

We want to make it super easy for Snowplow users and contributors to talk to us and connect with one another, to share ideas, solve problems and help make Snowplow awesome. Join the conversation:

Meetups. Don’t miss your chance to talk to us in person. We are often on the move with meetups in Amsterdam, Berlin, Boston, London, and more.
Discourse. Our forum for all Snowplow users: engineers setting up Snowplow, data modelers structuring the data, and data consumers building insights. You can find guides, recipes, questions and answers from Snowplow users and the Snowplow team. All questions and contributions are welcome!
Twitter. Follow @Snowplow for official news and @SnowplowLabs for engineering-heavy conversations and release announcements.
GitHub. If you spot a bug, please raise an issue in the GitHub repository of the component in question. Likewise, if you have developed a cool new feature or an improvement, please open a pull request, we’ll be glad to integrate it in the codebase! For brainstorming a potential new feature, Discourse is the best place to start.
Email. If you want to talk to Snowplow directly, email is the easiest way. Get in touch at [email protected].

Copyright and license

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

snowplow's People

Contributors

Stargazers

Watchers

Forkers

gsterndale richo ramn loocadesign 99designs dpayonk atkinson tzuryby legutierr mmoulton niekbeimer lboudard blayze chiehwen gitlisted e3test tims duilio d3mcfadden ddines pvaughn genejones hodgesz carlyl ippy04 christiankohler rluta pkmishra michaelbernstein wowgeeker soitun fotanus rcs netvarun breno begger freekh ankbul shermozle toliver dominiclovell tmmoto gregakespret cnh bunshar asherbond dongzhi bilenskaya prabhjotsl andxh frutik evisoft shuoc eychih bhandrigan satya-ak pekap nsangeeth fedorov dk-dev greywillfade tonyzhu wethanet vtemian web5design bytebiscuit pouyajaf arunpn itayw timodwhit thebeansgroup briceshatzer pdaniel andar yotpoltd angleman justinstern jasperspijkstra kinabalu emperiusm j-flynn mvogelgesang bachnguyen pkallos 3mp8r3 code2sys acinader mindis someapp bruhman78 agvstin mvitorino pedroprez goldentarek chris-biblio hzbadr stevencasey perisri101 pitchtarget majiao419

snowplow's Issues

Add sp.js to CDNJS

So SnowPlow users can retrieve the minified sp.js from CDNJS as well as our own CloudFront distribution

Add skip/restart mode to EmrEtlRunner

Sometimes the ETL process doesn't complete successfully (e.g. because the Processing Bucket is not empty, or there's a problem in the logs being processed).

It would be nice to be able to rerun EmrEtlRunner from a specific point via a command-line argument, e.g:

--from start (the default)
--from emr
--from archive

Add support for extracting the five marketing fields

Still needs doing

Add versioning to JavaScript tracker

Rename rdm to txn_id

Because rdm isn't used for cachebusting - it's used for de-duping when Cf requests the same file from multiple sources

JavaScript tracker missing a Url-based sync constructor

The sync constructor looks like this:

getTracker: function (accountId) {
  return new SnowPlow.Tracker(accountId);
},

There is no equivalent option for creating a sync tracker based on a URL, rather than account ID.

Add new page-id field to JavaScript tracker and Serde

Currently SnowPlow tracks the page name and page URL - but these are not very precise from an analytics perspective:

What if a new version of the page is launched on the same URL?
What if split testing means two different users see different versions of the page on the same URL?

The idea then is to allow the JavaScript to set a custom page-id:

setPageId('<<CMS/JavaScript generated page ID>>');

Then obviously this page ID would need to be passed through on the querystring and then extracted by the Serde...

The alternative to this approach would be to have a big, constantly changing lookup table on the analytics side when different versions of pages went live - and even this wouldn't meet the A/B testing requirement.

Add HiveQL which generates a flatfile format compatible with ICE, Postgres, BigQuery etc

Currently we have hive-format-etl.q let's have non-hive-format-etl.q too

Wiki: faulty link "for engineers"

On https://github.com/snowplow/snowplow/wiki/SnowPlow-overview
the link "For engineers" seems wrong

Add a test to check the URL fallback

Add a `EventTest` set of tests, similar to the `PageViewTest`

Update ETL to support /i as well as /ice.png

We're planning on moving the pixel to /i (see #29) from /ice.png to gain some extra chars.

Might as well add this support into the Hive serde sooner rather than later.

Add tracking of JS errors to JavaScript tracker

So that you do something like this:

window.onerror = function(errorMessage, url, line) {
      _snaq.push['trackError', errorMessage, line];
  };

Allow HiveQL/Serde upload into a non-US bucket

Current limitation where the HiveQL and Serde need to be uploaded into a US bucket in S3.

Add support for Amazon spot instances etc to EmrEtlRunner

Elasticity already supports this, so just need to add this into the EmrEtlRunner configuration file somehow...

Build Scalding ETL tool

Removing need for Hive in the ETL step.

Add social tracking features to JS tracker

See Google Analytics addSocial here: https://developers.google.com/analytics/devguides/collection/gajs/gaTrackingSocial

Again, make the fields first class citizens

Add in browser language detection to SnowPlow.js

Add SiteCatalyst-style redirect support

Waiting on clarification from @shermozle in shermozle/SnowCannon#23

Update tracking code to support HTTPS

Change this line in your tag:

sp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://snplow.com/sp.js';

To be:

sp.src = ('https:' == document.location.protocol ? 'https://d2nqfiix2qwfci.cloudfront.net' : 'http://snplow.com') + '/sp.js';

Add platform field to ETL output

Where platform is 'web', 'mobile', 'desktop', 'console' etc

Add setSiteId()

Functionality so you can use the same SnowPlow install for multiple sites

Amazon setup guide out of date

Screenshots, wizard steps in the guide doesn't correspond to whats in AWS

Rename setAccount() to setCollectorCf() for consistency

setAccount() is actually CloudFront collector specific - should reflect this in the name:

setCollectorCf('xxx')

To complement the existing setCollectorUrl('yyy')

Introduce this in two phases: first deprecate the existing setAccount with a warning like so:

if (typeof console !== 'undefined') {
  console.log("setAccount is deprecated and will be removed in an upcoming version. Please use setCollectorCf() instead.");
}

Fix Specs tests which run in parallel

Some weirdness between Specs2 and Hadoop - it looks like Specs is running tests in parallel and they're getting into conflict because they're sharing a cached Hive record.

Add support for "client-timestamp" field

For mobile, sometimes you want to batch the updates and send them in after the fact. This means that you cannot rely on the CloudFront timestamp. Update the serde to support a client-timestamp field which overrides the CloudFront timestamp if set.

Remove goal tracking behaviour

Wiki updates

(I'm making this issue vague since I'll probably update it a few times)

GitHub doesn't support pull requests on wiki's, for shame.

So I've pushed your wiki wtih a few patches at github.comricho/snowplow.wiki

It'd be great if you could pull from the doc_fixes branch

Integration documentation: extra char

On https://github.com/snowplow/snowplow/wiki/Integrating-SnowPlow-into-your-website
there is an extra ] here:
_trackEvent('Checkout', 'Add', 'ASO01043', 'blue:xxl', '2.0']);

Implement the SnowPlowAdImpDeserializer

Lower priority - please vote if you want this one.

Scala Hadoop Shred: deduplicate event_ids with different event_fingerprints (synthetic duplicates)

When you observe two events in your event stream with the same event_id, one of three things could be happening:

category	cause	payload matches?	probable time between duplicates	fix
ID collision	huge event volumes / algorithm flaws	no	far apart	give one event a new ID
synthetic copy	browser pre-cachers, anti-virus software, adult content screeners, web scrapers	partially (most or all client-sent fields)	close by logical time	either a) delete synthetic copy or b) give it a new ID & preserve relationship to "parent" event
natural copy	at least once processing	yes	close by ETL time	delete all but one event

Thinking about this further, a simple de-duplication algorithm would be:

If the payload matches exactly, then delete all but one copy
If the payload differs in any way, then give one event a new ID and preserve its relationship to "parent" event

With this approach, distinguishing between ID collisions and synthetic copies can still be done (if needed) at analysis time.

Could this be done using bloom filters? (http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf) Not directly:

unless event_id **definitely not in set** of bloom[event_ids]:
    if hash(event) **definitely not in set** of bloom[event hashes]:
        assign new event_id and store old event_id in original_event_id field
    else:
        delete event # Deletes some false positives!

Rather than delete some false positives (causing data loss), it would be safer to err on the side of caution:

unless event_id **definitely not in set** of bloom[event_ids]:
    assign new event_id and store old event_id in original_event_id field

But this approach will still cause inflated counts with at least once processing systems. So a hybrid model, using a KV cache of N days of events and another KV cache of N days of event hashes:

if event_id in n_days_cache[event_ids]:
    if hash(event) in n_days_cache[event hashes]:
        delete event
    else:
         assign new event_id and store old event_id in original_event_id field
else:
    unless event_id **definitely not in set** of bloom[event_ids]:
        assign new event_id and store old event_id in original_event_id field

This prevents natural copies within an N day window and safely renames ID collisions and synthetic copies all the way back through time.

Title was: Occasionally CloudFront registers an event twice

Content was:

Presumably because it hits two separate nodes at the same time.

Update the ETL to dedupe this - algorithm would be to hash the querystring and check for duplicates within a X minute timeframe. (Hashing the full raw querystring would implicitly include txn_id, the random JavaScript-side identifier, in the uniqueness check.)

Fix serde so tabs are changed to ' ' (four spaces)

Leaving tabs in the serde output can break tab delimited files

Support for Eucalyptus Walrus (Amazon S3 Private Cloud Equiv)

Hi,

I notice that you depend on Amazon S3, are there any plans on supporting or has anyone tried to see if this works with the private cloud equivalents such as Eucalyptus Walrus (http://open.eucalyptus.com/wiki/EucalyptusStorage_v1.4)?

Why do the null check tests not work?

Have had to be commented out. Tests don't work, but Deserializer works fine (returns null for appropriate rows)

Update querystring to include page url on end

Because CloudFront does not store cs(Referer) if HTTP status is not 200 (e.g. 304 Not Modified or 403 Forbidden).

Update docs too.

Upgrade to httpclient 4.2 when out of beta

httpclient 4.2 has a querystring parse method which avoids creating a new URI object. This should be a slight speedup over the current verison.

Here's the javadoc:

http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/client/utils/URLEncodedUtils.html#parse(java.lang.String, java.nio.charset.Charset)

Strip out legacy ecomm tracking from SnowPlow.js

Update ETL to handle new site/app ID

Field passed on the querystring is called aid, short for Application ID.

Implement strategy pattern within the Scalding ETL

An idea from Simon Andersson.

Basically: there is a standard ETL job, but you can add extra capabilities to it via plugins, or even swap out the standard querystring logic with an alternative implementation. The ETL process becomes closer to a framework, rather than an opinionated implementation.

Need to make sure that this doesn't impact the subsequent sub-systems (storage and analytics) too badly (in terms of those sub-systems having to deal with lots of different potential data structures).

Remove legacy custom vars behaviour

Inconsistent method names doc: trackEvent

On https://github.com/snowplow/snowplow/wiki/Integrating-SnowPlow-into-your-website

We have these:
_snaq.push(['trackEvent', 'Videos', ...
_snaq.push(['_trackEvent', 'Mixes', ...

trackEvent or _trackEvent?

Add version handling to ETL output

ETL records its own version, e.g. "hive-0.4.3"
ETL records versioning from both tracker (e.g. "js-0.6.1") and collector (e.g. "snowcannon-0.1.0")

Add support for sending event as multiple GETs

Currently the SnowPlow tracker uses a GET on an image tag to send an event. This has a limitation around URL length in IE - see http://support.microsoft.com/kb/208427

Potentially for mobile/server clients we can use POST in the future (with a non-CloudFront collector), but browsers have a couple of limitations which rule out POST:

You can't do POSTs asynchronously, which is a key advantage of the GET image tag approach
You can't do POSTs cross-domain from JavaScript

A solution, then, for big events, is to split them over multiple GETs (probably only in IE). And then the ETL phase would be responsible for stitching them back together.

Rotate through ice1.png-ice10.png, to avoid occasional browser caching issue

To avoid the occasional 403 from CloudFront.

ETL should tolerate new querystring fields

With the current Hive Deserializer (0.4.8/0.4.9), if a row which has a new querystring field in it, the whole row is nullified. This is an overreaction - the field should be silently ignored instead.

This scenario could easily happen if an older collector is run on newer tracker data.

In other words: not sure the behaviour in the BadQsFieldTest specification is fair.

Hive serde/EQL breaks with line breaks in any field

If the page title contains an (escaped) line break, then this line break is unescaped and written out to the event output file, breaking the data row onto two lines.

Need to update the serde so that line breaks (and possibly the row delimiter) are not unescaped.

Rename ice.png to i and change file to a GIF

That gives us an extra 6 characters for data! Plus a 1x1 GIF is a bit smaller than a PNG, and there is no code for generating a 1x1 transparent PNG (versus GIF) easily available (because everybody creates GIFs).

Need to update:

The install guide - there is a ticket to fix this anyway: #25
Change the static ice.png file in /1-collectors/cloudfront to i, and make it GIF format
The snowplow.js JavaScript

Note this will be a breaking change for anybody who has uploaded ice.png and is using our hosted JavaScript.

Check if there's a octet-stream bug with ice.png

Chrome giving warnings:

Resource interpreted as Image but transferred with MIME type application/octet-stream:

Update ETL to handle new ecommerce tracking capabilities

JavaScript v0.6 has new ecommerce tracking fields. These need to be handled in our ETL and make their way into the final flatfile structure.

In EmrEtlRunner, only copy the first two hours of files from the following day

See discussion here for rationale:

72ab49c#commitcomment-2024343