Git Product home page Git Product logo

forklift's Introduction

Forklift

Welcome to Forklift

If you're building a 21st century data pipeline on Hadoop, and orchestrating it with Airflow, you'll quickly become overwhelmed. How do you move a table from Redshift to HDFS? How do you clean up data from S3? How do you convert from TSV to Parquet? How do you quickly rename columns? Forklift solves all of these problems. It's the first ETL ("Extract-Transform-Load") suite for Hadoop and Airflow. So sit back, clone the repo, and let Forklift do the heavy lifting.

Setup

  1. Run pip install -e git+ssh://[email protected]/jerzygangi/forklift.git on all nodes in the cluster
  2. Launch pyspark
  3. Type from forklift import Forklift
  4. Have fun!

What can Forklift do?

Move data around

A common ETL task is to move data between stores your warehouse or lake. HDFS, Redshift, SparkSQL, PostgreSQL, S3, CSV/TSV, and Parquet are all natively supported. Watch how easy Forklift.move() makes it to Move Data:

# Let's move some data from Redshift into HDFS
move_from = {"url": "jdbc:redshift://db1.example.com/cars"}
move_to = {"url": "hdfs:///warehouse/cars.parquet"}

# Make a forklift, and move it
from forklift import Forklift
Forklift(sqlContext).move(move_from, move_to)

A list of arguments to pass to each warehouse adapter can be found by:

# For CSV and TSV files
from forklift.adapters.dsv import DSVAdapter
DSVAdapter.read_options()
DSVAdapter.write_options()

# For Parquet files in HDFS or S3
from forklift.adapters.parquet import ParquetAdapter
ParquetAdapter.read_options()
ParquetAdapter.write_options()

# For Postgres tables
from forklift.adapters.postgresql import PostgreSQLAdapter
PostgreSQLAdapter.read_options()
PostgreSQLAdapter.write_options()

# For Redshift tables
from forklift.adapters.redshift import RedshiftAdapter
RedshiftAdapter.read_options()
RedshiftAdapter.write_options()

# For SparkSQL tables
from forklift.adapters.sparksql import SparkSQLAdapter
SparkSQLAdapter.read_options()
SparkSQLAdapter.write_options()

Read & write dataframes

The same engine, Warehouse, that powers Forklift.move() can be used to quickly Read and Write DataFrames:

# Tell Forklift where to save the DataFrame
save_to = {"url": "hdfs:///warehouse/cars.parquet"}

# And save it
from forklift.warehouse import Warehouse
Warehouse().write(my_dataframe, save_to)

Sanitize and normalize a dataframe

Another common ETL task is force a myriad schemas into a common schema, and cast datatypes along the way. In Forklift, this process is called Normalize and Sanitize:

# Define what columns to include, and what to name them
cat ~/my_remappings.json
{
  "remappings":{
    "timestamp": "date",
    "url": "domain",
    "clicks": "clicks"
  }
}

and in pyspark:

# Create a dataframe
df1 = sqlContext.read.parquet("hdfs:///website_data.parquet")

# Create a Spark output schema
from pyspark.sql.types import *
def my_schema():
  return StructType([
    StructField("date", DateType(), True),
    StructField("domain", StringType(), True),
    StructField("clicks", LongType(), True)
  ])

# Tell Forklift how you want to cast datatypes
from forklift.cell_caster import CastProcessor
class MyCaster(CastProcessor):
	def cast_all_cells(self, value):
		if isinstance(value, float):
			return Decimal(value)
		else:
			return value
	def cast_domain(self, domain):
		return "http://" + domain

# Normalize and sanitize the dataframe
from forklift import Forklift
from forklift.stages import *
fork = Forklift(sqlContext)
df1_normalized_and_sanitized = fork.normalize_and_sanitize(df1, my_schema(), "~/my_remappings.json", MyCaster, [NS_ALL])

Make an Excel file from dataframes

XLSBuilder builds an Excel file, with a tab for each Spark dataframe:

xls = XLSBuilder("my_file.xls")
df1 = sqlContext.read.parquet("hdfs:///data1")
xls.addTab(df1, "Data 1")
df2 = sqlContext.read.parquet("hdfs:///data2")
xls.addTab(df2, "Data 2")
xls.build()

N.B. XLSBuilder collects and writes dataframes on the master node of your Spark cluster. If you attempt to use large dataframes with XLSBuilder, undefined behavior may occur. This is not considered a bug, since desktop usage of Excel is itself limited to small, n MB .xls files. Therefore, XLSBuilder should be used with small dataframes.

forklift's People

Contributors

emmanueljob avatar jerzygangi avatar jlmille4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

forklift's Issues

Decorate is deleting data

Currently, the decorate functionality uses an inner join. This is illogical, since you never want to delete data from the source.

Therefore, change the inner join to a left join.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.