snowplow / snowplow-rdb-loader Goto Github PK

View Code? Open in Web Editor NEW

31.0 15.0 16.0 6.12 MB

Stores Snowplow enriched events in Redshift, Snowflake and Databricks

License: Other

Scala 100.00%

spark snowplow scala redshift

snowplow-rdb-loader's Introduction

Relational Database Loader

Introduction

This project contains applications required to load Snowplow data into various data warehouses.

It consists of two types of applications: Transformers and Loaders

Transformers

Transformers read Snowplow enriched events, transform them to a format ready to be loaded to a data warehouse, then write them to respective blob storage.

There are two types of Transformers: Batch and Streaming

Stream Transformer

Stream Transformers read enriched events from respective stream service, transform them, then write transformed events to specified blob storage path. They write transformed events in periodic windows.

There are two different Stream Transformer applications: Transformer Kinesis and Transformer Pubsub. As one can predict, they are different variants for GCP, AWS and Azure.

Batch Transformer

It is a Spark job. It only works with AWS services. It reads enriched events from a given S3 path, transforms them, then writes transformed events to a specified S3 path.

Loaders

Transformers send a message to a message queue after they are finished with transforming some batch and writing it to blob storage. This message contains information about transformed data such as where it is stored and what it looks like.

Loaders subscribe to the message queue. After a message is received, it is parsed, and necessary bits are extracted to load transformed events to the destination. Loaders construct necessary SQL statements to load transformed events then they send these SQL statements to the specified destination.

At the moment, we have loader applications for Redshift, Databricks and Snowflake.

Find out more

Technical Docs	Setup Guide	Roadmap & Contributing

Technical Docs	Setup Guide	Roadmap

Copyright and license

Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)

snowplow-rdb-loader's People

Contributors

Stargazers

Watchers

Forkers

onespot nunb saj1th manjeet1198 aldemirenes polyswarm scala-steward tpeltola marcin-j drphrozen dkbrkjni cn-gunabalan miloszszymczak dylanschoenmakers mrsrinivas isabella232

snowplow-rdb-loader's Issues

RDB Loader: make loading shredded data always required

Currently RDB Loader accepts --skip shred which forces it to skip all shredded types.

RDB Loader: use prepared statements

Migrated from snowplow/snowplow#2217

Add ability for targets to be loaded in parallel

Migrated from snowplow/snowplow#393

Scala Tracker problem when running inside Docker container

Here's the error:

RDB Loader successfully completed following steps: [Discover, Load, Analyze]
INFO: Logs successfully dumped to S3 [s3://foo]
[INFO] [08/22/2017 17:58:18.018] [snowplow-scala-tracker-akka.actor.default-dispatcher-4] [akka://snowplow-scala-tracker/user/IO-HTTP/group-0] Message [spray.can.Http$Connect] from Actor[akka://snowplow-scala-tracker/user/IO-HTTP/host-connector-0/0#52160612] to Actor[akka://snowplow-scala-tracker/user/IO-HTTP/group-0#1063404807] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
Exception in thread "Thread-2" akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://snowplow-scala-tracker/user/IO-HTTP#1209416068]] after [1000 ms]
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
        at java.lang.Thread.run(Thread.java:748)

This repo is meant to contain RDB Shredder as well

Common: postgres support for shredded events

Migrated from snowplow/snowplow#3204

RDB Loader: add CLI argument to load specific folder

Migrated from snowplow/snowplow#389

Would be little bit more complicated since we moved it to EMR cluster, but still feasible with Dataflow Runner.

Common: migrate CHANGELOG from snowplow/snowplow

@alexanderdean do you think it should include entries for StorageLoader? Because otherwise it would be quite strange to start CHANGELOG from 0.12.0.

RDB Shredder: turn into SBT submodule

Both RDB Loader and RDB Shredder going to be released together (by lucky coincidence they both now versioned 0.12.0).

RDB Loader can hang for many hours

If it's helpful - the hanging seems to occur after the load transaction.

Common: add AWS staging credentials to .travis.yml

AWS_STAGING_ACCESS_KEY_ID
AWS_STAGING_SECRET_ACCESS_KEY

RDB Shredder is now part of this repository and cross-batch deduplication tests require access to DynamoDB as per snowplow/snowplow#3114

Allow for setting wlm_query_slot_count for Redshift imports

Redshift/postgresql has massive amounts of knobs to turn for optimizing query performance and one recommended by AWS is increasing wlm_query_slot_count in a session before running a query that needs extra CPU and memory.

Docs:
http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html

Would be awesome to allow users to set this in the redshift config. We saw massive (10x faster vacuums) improvements when increasing ours to 45, but we need to do that manually for now.

Allow config.yml to be loaded from ZooKeeper

Migrated from snowplow/snowplow#1229

RDB Loader: bump redshift-jdbc to 1.2.8.1005

High priority given https://snowplow.zendesk.com/agent/tickets/5872

--dry-run option prints tab separator in COPY command as four spaces

RDB Loader: fix JSONPath cache resolution bug

Migrated from snowplow/snowplow#3384 (comment)

Cannot confirm yet this is a bug, but worth to look into it.

RDB Loader: fail load if entry exists in manifest

Migrated from snowplow/snowplow#2281

RDB Loader: add --dry-run option

This should discover data on S3 and print all COPY/INSERT statements.

RDB Loader: execute manifest insert in same transaction as load

@alexanderdean, I think I copied this behavior (two distinct transactions) from StorageLoader, but wasn't single transaction a whole point of having manifest in the same storage target?

RDB Loader: fix failure for bucket names with periods

Migrated from snowplow/snowplow#1838

Cannot confirm yet that bug exists, because I think I successfully used periods in bucket names. Also internals changed significantly since original issue, so likely it isn't problem anymore.

RDB Shredder: bump scala-common-enrich to 0.27.0

sister ticket to snowplow/snowplow#3427

Use derived_tstamp as the primary tstamp in Redshift

Currently we use the collector_tstamp for the root_tstamp value.
It would be preferable to use the derived_tstamp once all our client side trackers support generating a dvce_sent_tstamp. (Because that point from an analytics perspective you're only interested in the derived_tstamp.

We need to figure out how we migrate from collector_tstamp -> derived_tstamp e.g. what happens for old users who have events without derived_tstamp values.

Migrate CHANGELOG from snowplow/snowplow

Add the option to run ANALYZE on selected columns

Migrated from snowplow/snowplow#2538

RDB Loader: improve log output

Migrated from snowplow/snowplow#3369

Consider adding an array_index field into shredded tables

Background: https://groups.google.com/forum/#!topic/snowplow-user/0Vi2bhfXDPQ

When we have nested shredding of array types, unless we have an array_index field in shredded tables, then our shredding process will inevitably be lossy: there will be nowhere to record the original order of the elements in the array.