Git Product home page Git Product logo

turbocow's Introduction

TurboCow

This is a framework library to help with common data validation and enrichment activities in Spark. It allows you to define your enrichment & validation steps in JSON so you don't have to write as much code to do these common ETL activities.

Currently only Spark 1.5 is supported. We have plans to upgrade our cluster soon, and 1.6 and 2.0 updates will follow soon after.

Setup

To use this library in your SBT project, add the following line to your build.sbt:

"com.cars.bigdata" %% "turbocow" % "{VERSION}" 

... where {VERSION} is the version you wish to use. See tags in this project to determine which one you want. (git tag -n)

There is example code in the ExampleApp.scala file in this project, that you can use as a template to get started.

Overview

The framework supports "Actions" that are defined in the JSON file. Each action list works on one field of an input data set. The data set is processed by rows in an RDD. When finished with those actions, the RDD is converted to a Dataframe which is then written out in Avro format. (Only avro is supported currently.)

The code flow is as follows:

  1. Read in the input files (JSON format) into an RDD.
  2. Process the JSON files using ActionEngine.processJsonRDD(), which returns RDD[Map[String, String]] * This transformes the JSON into enriched & validated Map structures.
  3. You can transform the RDD into a DataFrame using function convertEnrichedRDDToDataFrame().
  4. Further custom processing can occur on the DataFrame.
  5. When done, you can write out the dataframe to Avro using AvroOutputWriter.write(). * Before writing, you may wish to set default values according to the avro schema, which can be done with function DataFrameUtil.setDefaultValues().
  6. AvroOutputWriter.writeEnrichedRDD() to writes out the dataframe in Avro format.

Action Engine

The ActionEngine is what performs the row processing.

Actions

See notes/configuration-schema.json for detailed examples of each JSON action and how to write the JSON.

Lookup ("lookup")

Here's how lookup works:

The framework searches in the config file for every Lookup action, determines exactly what fields to cache from which tables, and creates an in-memory map of Row objects where the key is the index fields needed. There are multiple maps, because we may need different keys, but they all point to the same Row objects (to save memory). The cached tables are then broadcast to every spark executor and 'queried' when a Lookup action is performed.

In the "onPass" and "onFail" sections, you can run any action list. You can even reject the record here (or even in OnPass) by adding a "reject" actionType (see below).

Check

This is the validation action. It supports validation using the following operators:

Unary operators:

  • empty
  • non-empty
  • null
  • non-null
  • numeric
  • non-numeric

Binary operators (two operands):

  • equals
DataFrames

Note that the project now supports using DataFrames for processing. For projects that started out using the ActionEngine and RDDs, there is a function that allows you to transform the enriched RDD into a DataFrame (convertEnrichedRDDToDataFrame()), and then you can process the enriched data as a DataFrame.

Reject ("reject")

Rejection is a separate action, but it has ramifications in the framework beyond its immediate effect in the action chain. When a reject action runs... TODO

Testing

When writing tests for your application code or the framework code, if you "import test.SparkTestContext._" you will get access to global spark, sql, and hive contexts to use in your tests. Do not create new contexts. This allows you to separate the tests into different files.

turbocow's People

Contributors

invadergir avatar kgorthy avatar

Stargazers

sandeep kandekar avatar

Watchers

Mac Heller-Ogden avatar Darrell Pratt avatar Greg Jopa avatar Puru Hemnani avatar Lucas Øverland avatar John Matthews avatar George Ohler avatar James Cloos avatar  avatar Eddie Moya avatar Michael Michalak avatar Aster avatar John-Charles D. Sokolow avatar Jeremy Chambers avatar Jahnavi avatar Aris Dumbrys avatar  avatar Deep Mistry avatar Alex Krautmann avatar  avatar  avatar Keith Kreissl avatar Gaurav Bansode avatar sandeep kandekar avatar Dennis Sherman avatar  avatar Jim Smith avatar

Forkers

invadergir

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.