Git Product home page Git Product logo

lab-s3-to-etl's Introduction

Sparkify S3 to AWS/Redshift ETL

A Lab Project on Cloud Data Warehouse

This lab project builds a data warehouse based on AWS Redshift. A data warehouse is a large store of data collected from a wide range of sources within a company and used to guide management decisions.

The lab exercises an Extract Transform Load (ETL) analytical pipeline for a fictional music streaming startup called Sparkify.

The data are extracted from 3 datasets which reside in Amazon S3 before being transformed and loaded into a Redshift cluster as depicted below.

Alt text

Fig.1 - Sparkify S3 to Redshift ETL

Project Datasets

The data of the pipeline come from two datasets located in Amazon S3:

  • Song Dataset: Contains the songs and some JSON metadata
    • s3://udacity-dend/song_data
  • Log Dataset: Consists of log files related to the user activities (also in in JSON format).
    • s3://udacity-dend/log_data
    • s3://udacity-dend/log_json_path.json

Extract-Transform-Load Pipeline

The setup and operation of the ETL pipeline is controlled by four Python scripts (create_cluster.py, create_tables.py, etl.py, delete_cluster.py) a Python module (sql_queries.py) and a configuration file (myDWH.cfg).

During the ETL processing, data is loaded from the above S3 datasets into staging tables of the Redshift cluster. Next, the following star schema suitable for data analytics is built from SQL statements defined in sql_queries.py and further executed by the srcipt etl.py.

Alt text

Fig.2 - Start schema

How To Run the ETL Pipeline

STEP-1: AWS access key and secret key

  • Add your AWS access key and AWS secret key in the myDWH.cfg file or initialize them via the AWS CLI with aws configure.

STEP-2: Create a Redshift cluster on AWS

Note: Before you create a cluster as defined in the setion '[CLUSTER]' of the myDWH.cfg file, you may want to specify the AWS region that you want to operate in.

  • Create a new cluster by runninn the command:
    • python create_cluster.py.

STEP-3: Connect to Redshift cluster and create the DB tables

  • Run python create_tables.py.

STEP-4: Connect to Redshift cluster and populate the ETL tables

  • Run python etl.py.

STEP-5: Try some queries on the cluster

Open a terminal and connect to the database of the Redshift cluster with the command:

  • psql -h <CLUSTER_ENDPOINT> -p <CLUSTER_DB_PORT> -U <CLUSTER_DB_USER> -d <CLUSTER_DB_NAME>
    • Use the connection parameters defined in myDWH.cfg.
    • Upon request a password, use the one defined in myDWH.cfg.
  • Example:
    • psql -h dwhcluster.c4p6b3uqdbp8.us-west-2.redshift.amazonaws.com -p 5439 -U dwhuser -d dwh

STEP-6: Pause your cluster

During development time, you can send your cluster to a 'PAUSE' state via the AWS / Redshift console. When the cluster is in this state, you won't be charged and you can later restore it in short time via the AWS / Redshift console.

STEP-7: Delete your Redshift cluster on AWS

  • If you no longer need your cluster, you can delete it by runninn the command:
    • python delete_cluster.py.

lab-s3-to-etl's People

Contributors

fab7 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.