Git Product home page Git Product logo

khushal2405 / daily-incremental-load-etl-pipeline-for-ecommerce-company-using-aws-lambda-and-apache-airflow Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 1.0 20 KB

Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and AWS EMR cluster, Deployed using Apache airflow in a docker container.

Python 100.00%
aws aws-ec2 aws-emr aws-emr-clusters aws-glue aws-lambda aws-s3 glue-jobs lambda lambda-functions lambda-trigger oltp postgres postgresql redshift redshift-aws docker docker-compose docker-container dockerfile

daily-incremental-load-etl-pipeline-for-ecommerce-company-using-aws-lambda-and-apache-airflow's Introduction

Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and Apache-airflow

Pipeline Diagram

Daily Incremental load ETL pipeline for Ecommerce company using AWS Lambda and AWS EMR cluster, Deployed using Apache airflow. Here, we are extracting Daily data from OLTP database(postgreSQL) using date and loading it into AWS S3 buckets. Triggering a insert job flow steps into EMR cluster using Lambda function for processing and Transformation and saving the processed data to S3 where it will further triger the Lambda function.

STEP 1 : Extracting Data from PostgreSQL and loading it into AWS s3 buckets !!!!

  • for extraction I used Airflow Dags to extract and then load data into S3. We have to extract Daily Data from six SQL tables dated accoding to the day the row was inserted into the database (I have only considered orders and order-items table for daily incremental load)
  • Airflow was deployed using docker along with all of it's other components(like redis and postgresdb for metadata).
  • Because the incremental load is for ecommerce data I scheduled the Dag to run daily at 11 PM.

Airflow DAG for dailt Incremental load

Note:

  • Rather than using another postgres database as a dummy OLTP, I used airflow's postgres db that it uses to save metadata. It might save you some time
  • We can also schedule a cron job instead of using airflow Dags to orchestrate these tasks of extraction of data daily.(Airflow is good for visualization)

STEP 2 : Triggering of Lambda function

  • When the orders file gets loaded into the S3 bucket, it triggers a Lambda function.
  • This Lambda function insers the job to the idle running EMR cluster for processing and Transformation of data according to the requirement.

AWS-labmda-s3-trigger

Note

  • the trigger was set such that after detecting any creation of object in the S3 bucket, it runs the lambda function written in python that submits the job to EMR cluster using it's command-runner.jar feature provided by AWS in EMR clusters.

STEP 3 :

  • After completion of processing and transformation by running pyspark script in EMR, it writes the transformed data back to S3.
  • This data is saved according to the extraction date so that it would be easier to pick the data for further loading into any data warehouse(like redshift, Hbase, Hive, etc). In this way we can make thie data available to dashboarding teams for their client serving analytics platforms.

future scope

  • rather than using EMR clusters, I can write a Glue job that gets triggered by lambda and then load it into Redshift data warehouse using copy command.
  • By using Glue I will be paying for only the resources that the glue scripts uses rather than provisioning lot of EMR clusters nodes saving the cost for unused ec2 instance resources.(Will try this in the next commit!!)

ideas for improvement

  • we can introduce Data vaildation and Data quality check jobs after processing and loading of data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.