Git Product home page Git Product logo

iam-mhaseeb / skytrax-data-warehouse Goto Github PK

View Code? Open in Web Editor NEW
131.0 8.0 26.0 1.37 MB

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

License: MIT License

Python 100.00%
python python3 database data-visualization data-analysis data-warehouse data-warehousing redshift airflow docker

skytrax-data-warehouse's Introduction

Skytrax Data Warehouse

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

skytrax-warehouse

Architecture

Architecture

Data Warehouse Consists of various modules:

Overview

Data is obtained from here. The data collected is stored on local disk and is timely moved to the Landing Bucket on AWS S3. ETL jobs are written in SQL and scheduled in airflow to run every hour to keep data fresh in cloud data warehouse.

Data Modeling

Following are the fact and dimension tables created:

Dimension Table

aircrafts
airlines
passengers
airports
lounges

Fact Tables

fact_ratings

ETL Flow

  • Data Collected from here is moved to landing zone s3 buckets.
  • ETL job has s3 module which copies data from landing zone to stagging in Redshift.
  • Once the data is moved to Redshift, a task in airflow is triggered which reads the data from stagging area and apply transformation.
  • Using the Redshift staging tables and UPSERT operation is performed on the dimensional & fact Data Warehouse tables to update the data.
  • ETL job execution is completed once the Data Warehouse is updated.
  • Airflow DAG runs the data quality check on Warehouse tables between the ETL job to ensure right data.
  • Dag execution completes once the Data Warehouse is updated.

Environment Setup

Hardware Used

Redshift: For Redshift I used 2 Node cluster with Instance Types dc2.large

Setting Up Infrastructure

Run the following commands in terminal to setup whole infrastructure locally:

  1. git clone https://github.com/iam-mhaseeb/Skytrax-Data-Warehouse
  2. cd Skytrax-Data-Warehouse
  3. Considering you have docker service installed and running run docker-compose up. It will take sometime to pull latest images & install everything automatically in docker.

Setting up Redshift

You can follow the AWS Guide to run a Redshift cluster.

How to run

Airflow

Make sure docker containers are running. Open the Airflow UI by hitting http://localhost:8080 in browser and setup required connections.

You should be able to see skytrax_etl_pipeline Dag like in pictures below:

Skytrax Pipeline DAG DAG View

You can explore dag further in different views like below:

DAG View: DAG

DAG Tree View: DAG Tree

DAG Gantt View: DAG Gantt View

Metabase

Make sure docker containers are running. Open the Metabase UI by hitting http://localhost:3000 in browser & setup your metabase account and database.

You should be able to play with data after running dag successfully like I made dashboard in pictures below:

Dashboard1: DAG

Dashboard2: DAG Tree

Scenarios

  • Data increase by 100x. read > write. write > read

    • Redshift: Analytical database, optimized for aggregation, also good performance for read-heavy workloads
    • Introduce EMR cluster size to handle bigger volume of data
  • Pipelines would be run on 7am daily. how to update dashboard? would it still work?

    • DAG is scheduled to run every hour and can be configured to run every morning at 7 AM if required.
    • Data quality operators are used at appropriate position. In case of DAG failures email triggers can be configured to let the team know about pipeline failures.
  • Make it available to 100+ people

    • We can set the concurrency limit for your Amazon Redshift cluster. While the concurrency limit is 50 parallel queries for a single period of time, this is on a per cluster basis, meaning you can launch as many clusters as fit for you business.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details

skytrax-data-warehouse's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.