Git Product home page Git Product logo

snowplow-rdb-loader's Introduction

Relational Database Loader

Build Status Release License

Introduction

This project contains applications required to load Snowplow data into various data warehouses.

It consists of two types of applications: Transformers and Loaders

Transformers

Transformers read Snowplow enriched events, transform them to a format ready to be loaded to a data warehouse, then write them to respective blob storage.

There are two types of Transformers: Batch and Streaming

Stream Transformer

Stream Transformers read enriched events from respective stream service, transform them, then write transformed events to specified blob storage path. They write transformed events in periodic windows.

There are two different Stream Transformer applications: Transformer Kinesis and Transformer Pubsub. As one can predict, they are different variants for GCP, AWS and Azure.

Batch Transformer

It is a Spark job. It only works with AWS services. It reads enriched events from a given S3 path, transforms them, then writes transformed events to a specified S3 path.

Loaders

Transformers send a message to a message queue after they are finished with transforming some batch and writing it to blob storage. This message contains information about transformed data such as where it is stored and what it looks like.

Loaders subscribe to the message queue. After a message is received, it is parsed, and necessary bits are extracted to load transformed events to the destination. Loaders construct necessary SQL statements to load transformed events then they send these SQL statements to the specified destination.

At the moment, we have loader applications for Redshift, Databricks and Snowflake.

Find out more

Technical Docs Setup Guide Roadmap & Contributing
i1 i2 i3
Technical Docs Setup Guide Roadmap

Copyright and license

Copyright (c) 2012-present Snowplow Analytics Ltd. All rights reserved.

Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)

snowplow-rdb-loader's People

Contributors

alexanderdean avatar alexbenny avatar benfradet avatar benjben avatar chuwy avatar colmsnowplow avatar dennisatspaceape avatar dilyand avatar dkbrkjni avatar drphrozen avatar fblundun avatar github-actions[bot] avatar istreeter avatar kingo55 avatar lmath avatar marcin-j avatar oguzhanunlu avatar pondzix avatar rgabo avatar richo avatar saj1th avatar scala-steward avatar shermozle avatar spenes avatar staymanhou avatar stdfalse avatar voropaevp avatar zcei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snowplow-rdb-loader's Issues

Scala Tracker problem when running inside Docker container

Here's the error:

RDB Loader successfully completed following steps: [Discover, Load, Analyze]
INFO: Logs successfully dumped to S3 [s3://foo]
[INFO] [08/22/2017 17:58:18.018] [snowplow-scala-tracker-akka.actor.default-dispatcher-4] [akka://snowplow-scala-tracker/user/IO-HTTP/group-0] Message [spray.can.Http$Connect] from Actor[akka://snowplow-scala-tracker/user/IO-HTTP/host-connector-0/0#52160612] to Actor[akka://snowplow-scala-tracker/user/IO-HTTP/group-0#1063404807] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
Exception in thread "Thread-2" akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://snowplow-scala-tracker/user/IO-HTTP#1209416068]] after [1000 ms]
        at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
        at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
        at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
        at java.lang.Thread.run(Thread.java:748)

Allow for setting wlm_query_slot_count for Redshift imports

Redshift/postgresql has massive amounts of knobs to turn for optimizing query performance and one recommended by AWS is increasing wlm_query_slot_count in a session before running a query that needs extra CPU and memory.

Docs:
http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html

Would be awesome to allow users to set this in the redshift config. We saw massive (10x faster vacuums) improvements when increasing ours to 45, but we need to do that manually for now.

Use derived_tstamp as the primary tstamp in Redshift

Currently we use the collector_tstamp for the root_tstamp value.
It would be preferable to use the derived_tstamp once all our client side trackers support generating a dvce_sent_tstamp. (Because that point from an analytics perspective you're only interested in the derived_tstamp.

We need to figure out how we migrate from collector_tstamp -> derived_tstamp e.g. what happens for old users who have events without derived_tstamp values.

RDB Loader: add a message for failed consistency check

Consistency check algorithm makes 5 attempts to compare lists of S3 keys. If 5th comparison still gives inconsistent result - loader just continues to load. We can warn users that it failed.

Won't help to fix anything, but at least users will be aware why do they receive cryptic Redshift errors.

RDB Loader: bump to 0.13.0

I decided to split changelog items into RDB Loader and RDB Shredder categories as per #27.

We can unite them later when they will start to share code.

RDB Loader: make logkey optional

--logkey is EmrEtlRunner-specific and doesn't make lot of sense for Dataflow Runner or CLI run as it'll print everything to stdout anyway.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.