Git Product home page Git Product logo

tickit-data-lake-demo's Introduction

DevOps, GitOps, and DataOp for Analytics on AWS

Test DAGs

Sync DAGs

Source code for the following blogs and videos

Video demonstration: Building a Simple Data Lake on AWS. Build a simple data lake on AWS using a combination of services, including Amazon MWAA, AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, and Amazon S3.

Video demonstration: Building a Data Lake with Apache Airflow. Programmatically build a simple Data Lake on AWS using Amazon Managed Workflows for Apache Airflow, AWS Glue, and Amazon Athena.

Video demonstration: Lakehouse Automation on AWS with Apache Airflow. Programmatically load data into and upload data from Amazon Redshift using Apache Airflow.

Blog post: DevOps for DataOps: Building a CI/CD Pipeline for Apache Airflow DAGs. Build an effective CI/CD pipeline to test and deploy your Apache Airflow DAGs to Amazon MWAA using GitHub Actions.

Architectures/Workflows

DevOps for DataOps

Data Lake Architecture

Building a Data Lake with Apache Airflow

Data Lake Architecture

Lakehouse Automation on AWS with Apache Airflow

Redshift Architecture

TICKIT Sample Database

Amazon Redshift TICKIT Sample Database

Instructions for "Building a Simple Data Lake on AWS"

TICKIT Tables

  • tickit.saas.category
  • tickit.saas.event
  • tickit.saas.venue
  • tickit.crm.users
  • tickit.date
  • tickit.listing
  • tickit.sales

Naming Conventions

+-------------+--------------------------------------------------------------------+
| Prefix      | Description                                                        |
+-------------+--------------------------------------------------------------------+
| _source     | Data Source metadata only (org. call _raw in video)                |
| _raw        | Raw/Bronze data from data sources (org. call _converted in video)  |
| _refined    | Refined/Silver data - raw data with initial ELT/cleansing applied  |
| _aggregated | Gold/Aggregated data - aggregated/joined refined data              |
+-------------+--------------------------------------------------------------------+

AWS CLI Commands

There were two small changes made to the source code, as compared to the video demonstration, to help clarify the flow of data in the demonstration. The prefix for the (7) data source AWS Glue Data Catalog table’s prefix was switched from raw_ from source_. Also, the (7) Raw/Bronze AWS Glue Data Catalog table’s prefix was switched from converted_ to raw_. The final data flow is 1) source_, 2) raw_, 3) refined_, and 4) agg_ (aggregated).

DATA_LAKE_BUCKET="your-data-lake-bucket"

aws s3 rm "s3://${DATA_LAKE_BUCKET}/tickit/" --recursive

aws glue delete-database --name tickit_demo

aws glue create-database \
  --database-input '{"Name": "tickit_demo", "Description": "Track sales activity for the fictional TICKIT web site"}'

aws glue get-tables \
  --database-name tickit_demo \
  --query "TableList[].Name" \
  --output table

aws glue start-crawler --name tickit_postgresql
aws glue start-crawler --name tickit_mysql
aws glue start-crawler --name tickit_mssql

aws glue get-tables \
  --database-name tickit_demo \
  --query "TableList[].Name" \
  --expression "source_*"  \
  --output table

aws glue start-job-run --job-name tickit_public_category_raw
aws glue start-job-run --job-name tickit_public_date_raw
aws glue start-job-run --job-name tickit_public_event_raw
aws glue start-job-run --job-name tickit_public_listing_raw
aws glue start-job-run --job-name tickit_public_sales_raw
aws glue start-job-run --job-name tickit_public_users_raw
aws glue start-job-run --job-name tickit_public_venue_raw

aws glue start-job-run --job-name tickit_public_category_refine
aws glue start-job-run --job-name tickit_public_date_refine
aws glue start-job-run --job-name tickit_public_event_refine
aws glue start-job-run --job-name tickit_public_listing_refine
aws glue start-job-run --job-name tickit_public_sales_refine
aws glue start-job-run --job-name tickit_public_users_refine
aws glue start-job-run --job-name tickit_public_venue_refine

aws glue get-tables \
  --database-name tickit_demo \
  --query "TableList[].Name" \
  --output table

aws s3api list-objects-v2 \
  --bucket ${DATA_LAKE_BUCKET} \
  --prefix "tickit/" \
  --query "Contents[].Key" \
  --output table

tickit-data-lake-demo's People

Contributors

edgelytical avatar garystafford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tickit-data-lake-demo's Issues

upgrade click version

When running the run_local script, got the following error:

~/Projects/airflow/cicd/devopsworld-airflow/demos/demo-02-localdev/git-precommit-demo
\n⌛ Starting SQLFluff tests...
~/Projects/airflow/cicd/devopsworld-airflow/demos/demo-02-localdev/git-precommit-demo/dags ~/Projects/airflow/cicd/devopsworld-airflow/demos/demo-02-localdev/git-precommit-demo
Traceback (most recent call last):
..
..
    from .core import ParameterSource
ImportError: cannot import name 'ParameterSource' from 'click.core' (/Users/ricsue/opt/anaconda3/lib/python3.8/site-packages/click/core.py)

Missing dependency in requirements_local

When running the local test script as part of the pre-commit hooks, I needed to install typing-extensions==4.3.0 as I was getting the following error:

dags/__init__.py::BLACK SKIPPED (could not import 'black': cannot import name 'TypeGuard' from 'typing_extensions'...) [  6%]
dags/data_lake__01_clean_and_prep_demo.py::BLACK 
..
..
INTERNALERROR>     for name, fixture in item.funcargs.items():
INTERNALERROR> AttributeError: 'BlackItem' object has no attribute 'funcargs'

workaround to achieve multi-tenancy using ci/cd pipeline

If users from multiple aws accounts have DAG in a single mwaa environment, It is important to restrict the users based on what aws resources they can access based on DAG level. @garystafford could we do this by adding a validation step in the ci/cd pipeline to check if the DAG policies are met by the users who write the DAGs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.