Git Product home page Git Product logo

aws_glue_etl_docker's Introduction

AWS Glue ETL in Docker and Jupyter

This project is a helper for creating scripts that run in both AWS Glue, Jupyter notebooks, and in docker containers with spark-submit. Glue supports running Zepplin notebooks against a dev endpoint, but for quick dev sometimes you just want to run locally against a subset of data and don't want to have to pay to keep the dev endpoints running.

Glue Shim

Glue has specific methods to load and save data to s3 which won't work when running in a jupyter notebook. The glueshim provides a higher level api to work in both scenarios.

from aws_glue_etl_docker import glueshim
shim = glueshim.GlueShim()

params = shim.arguments({'data_bucket': "examples"})
pprint(params)


files = shim.get_all_files_with_prefix(params['data_bucket'], "data/")
print(files)

data = shim.load_data(files, 'example_data')
data.printSchema()
data.show()

shim.write_parquet(data, params['data_bucket'], "parquet", None, 'parquetdata' )
shim.write_parquet(data, params['data_bucket'], "parquetpartition", "car", 'partitioneddata' )

shim.write_csv(data, params['data_bucket'],"csv", 'csvdata')

shim.finish()

Local environment

Running locally is easiest in a docker container

  1. Copy data locally, and map that folder to your docker container to the /data// path.
  2. Start docker container, map your local notebook directory to /home/jovyan/work

Example Docker command docker run -p 8888:8888 -v "$PWD/examples":/home/jovyan/work -v "$PWD":/data jupyter/pyspark-notebook

Installing package in Jupyter

import sys
!{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker

AWS Deployment

For deployment to AWS, this library must be packaged and put into S3. You can use the helper script deploytos3.sh to package and copy.

Usage ```./deploytos3.sh s3://example-bucket/myprefix/aws-glue-etl-jupyter.zip

Then when starting the glue job, use your S3 zip path in the Python library path configuration

Bookmarks

The shim is currently setup to delete any data in the output folder so that if you run with bookmarks enabled and then need to reprocess the entire dataset and

Converting Workbook to Python Script

aws_glue_etl_docker can also be used as a cli tool to clean up Jupyter metadata from a workbook or convert it to a python script.

Clean

The clean command will open all workbooks in a given path and remove any metadata, output and execution information. This keeps the workbooks cleaner in source control

aws_glue_etl_docker clean --path /dir/to/workbooks

Build

The build command will open all workbooks in a given path and convert them to python scripts. Build will convert any markdown cells to multiline comments. This command will not convert any cells that contain #LOCALDEV or lines that start with ! as in !{sys.executable} -m pip install git+https://github.com/purecloudlabs/aws_glue_etl_docker

aws_glue_etl_docker build --path /dir/to/workbooks

aws_glue_etl_docker's People

Contributors

inindevevangelists avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.