Git Product home page Git Product logo

arqlab-data-platform's Introduction

Language-Support: Stable

A Data Platform

This solution helps to build and deploy data lake infrastructure on AWS using AWS CDK. CDK stands for Cloud Development Kit. It is an open source software development framework to define your cloud application resources using familiar programming languages.

This solution helps you:

  1. deploy data lake infrastructure on AWS using CDK
  2. increase the speed of prototyping, testing, and deployment of new ETL workloads

This repository contains only the S3 Zones stack as a sample. If you require the full version of the code, please contact us here


Contents


Data Platform

In this section explains the Data Platform architecture and its infrastructure.


Architecture Overview

The Data Lake can have multiple producers which ingest files into the landing zone bucket. The architecture uses AWS Lambda, Glue ETL and AWS Step Functions for orchestration and scheduling of ETL workloads to clean, validate and generate rich metadata. AWS Glue Catalog will be used to store the metadata.

AWS Glue Studio will be used to extract and transform the data into a relational data model and load it into a Data Mart (PostgreSQL or Aurora DB).

We use Amazon Athena for interactive queries and analysis. Also uses various AWS services for logging, monitoring, security, authentication, authorisation, notification, build, and deployment.

Conceptual Data Lake


Infrastructure

Now we have the Data Lake design, let's deploy its infrastructure. It includes the following resources:

  1. Amazon Virtual Private Cloud (VPC)
  2. Subnet
  3. Security Groups
  4. Route Table(s)
  5. VPC Endpoints
  6. Key Management Service
  7. Amazon S3 buckets for:
    1. Landing Zone (Staging)
    2. Gold Zone (Production/conformed layer)
    3. log buckets
  8. A PostgreSQL DB for a Data Mart and auditing DB
  9. API Gateway

Glue ETL and Lambda functions are created through a separate pipeline.

Figure below represents the infrastructure resources we provision for Data Lake.

![Data Lake Infrastructure Architecture](add diagram here)

Deployment Architecture

We 2 AWS accounts and that can be used as follows:

  1. Dev account for dev and test data lake
  2. Prod account for production data lake

Figure below represents the deployment model.

Data Lake Infrastructure Deployment

There are few interesting details to point out here:

  1. Data Lake infrastructure source code is organized into three branches in GitHub - dev, test, and main(prod).
  2. Each branch is mapped to a target environment. This way, code changes made to the branches are deployed iteratively to their respective target environment.
  3. From CDK perspective, we apply the standard bootstrapping principles which is explained in subsequent sections.

Continuous delivery of infrastructure using GitHub Actions

Figure below illustrates the continuous delivery of data lake infrastructure.

NOTE: This is not implemented yet.

Data Lake Infrastructure continuous delevery

There are few interesting details to point out here:

  1. The DevOps administrator checks in the code to the repository.
  2. GitHub Actions listens to commit events on the source code repositories.
  3. Code changes made to the dev branch of the repo are automatically deployed to the dev environment of the data lake.
  4. Code changes to the test branch of the repo are automatically deployed to the test environment.
  5. Code changes to the main branch of the repo are automatically deployed to the prod environment.

Source code structure

Table below explains how this source ode structured:

File / Folder Description
data-platform.ts Application entry point.
data-platform-stack.ts Pipeline deploy stage entry point.
iam_stack Contains all resources to created IAM Roles
s3_stack Stack creates S3 buckets - LandingZone, GoldZone and Log buckets. This also creates AWS KMS Keys to enabled server side encryption for all buckets.
lambda_stack Contains resources for lambda functions
tagging Program to tag all provisioned resources.
vpc_stack Contains all resources related to the VPC used by Data Lake infrastructure and services. This includes: VPC, Security Groups, and VPC Endpoints( Gateway Endpoint).
glue_stack Contains script to provision AWS Glue resources
config Contains all configurations
step_functions Contains all resources related to AWS Step Functions
lib The cdk code for deploying stacks
resources This folder has static resources such as architecture diagrams, developer guide etc.
glue_etl_job_auditor Folder to keep lambada code for auditing Glue ETL Jobs
glue_etl_script Contains a copy of glue ETL scripts
assets Contains assets for S3 deployment (predefined prefixes and default files)

Automation scripts

This repository has the following automation scripts to complete steps before the deployment:

# Script Purpose
1 bootstrap_deployment_account.sh Used to bootstrap deployment account (currently this method is not being used)
2 bootstrap_target_account.sh Used to bootstrap target environments for example dev, test, and production.

Prerequisites

This section has various steps you need to perform before you deploy data lake resources on AWS.


Software installation

  1. AWS CLI - make sure you have AWS CLI configured on your system. If not, refer to Configuring the AWS CLI for more details.

  2. AWS CDK - install compatible AWS CDK version

    npm install -g aws-cdk
  3. Python - make sure you have Python SDK installed on your system. We recommend Python 3.7 and above.

Other requirements

  1. AWS accounts. recommend to have at least three accounts for dev, test, and prod accounts. To test this solution with one target environment for e.g. dev, refer to developer_instructions.md for detailed instructions.

  2. Number of branches on your GitHub repo - the main to start with and dev and test branches can be added at the beginning or after the first deployment of data lake infrastructure in PROD environment.

  3. Administrator privileges - you need to administrator privileges to bootstrap AWS environments and complete initial deployment. Usually, these steps can be performed by a DevOps administrator of your team. After these steps, you can revoke administrative privileges. Subsequent deployments are based on self-mutating natures of CDK Pipelines.

  4. AWS Region selection - if possible use the same AWS region (e.g. eu-west-2) for dev, test, and prod accounts for simplicity.


AWS environment bootstrapping

Environment bootstrap is standard CDK process to prepare an AWS environment ready for deployment. Follow the steps:

Important:

  1. This command is based on the feature Named Profiles.

  2. Configure AWS CLI to use AWS SSO to login using SSO via CLI.

  3. Before you bootstrap dev account, set environment variable

    export ENV=<enviroment e.g. Dev, Test or Prod>
    export AWS_ACCOUNT=<aws account>
    export AWS_REGION=<region>
  4. Bootstrap dev account

    Important: Your configured environment must target the Dev account

    npm run cdk bootstrap aws://<dev account>/<region>

Application configuration

Before we deploy our resources we must provide the manual variables and upon deployment the CDK Pipelines will programmatically export outputs for managed resources. Follow the below steps to setup your custom configuration:

  1. Note: You can safely commit these values to your repository

  2. Go to config and make sure the values under local_mapping Record within the function get_local_configuration are correct for the environments.


GitHub Actions integration

GitHub Actions requires a AWS IAM user Access Key and Secret Access Key to deploy stack to the environment. This access token is stored in GitHub Secrets. For security reason, the key must be rotated periodically. Follow the below steps:

  1. Note: Do NOT commit these values to your repository

  2. Go to GitHub repository and on Settings tab, and create below entries _ AWS_ACCESS_KEY_ID _ AWS_SECRET_ACCESS_KEY GitHub Settings


Deployment

This section explains the steps to deploy the stack in different environments.


Deploying for the first time

Configure your AWS profile to target the Deployment account as an Administrator and perform the following steps:

  1. Open command line (terminal)
  2. Go to project root directory where cdk.json and app.py exist
  3. Run the command cdk ls
  4. Expected output: It lists CDK Pipelines and target account stacks on the console. A sample is below:
  Dev/DevDataLakeInfrastructureS3BucketZones
  Dev/DevDataLakeInfrastructureVpc
  Test/TestDataLakeInfrastructureS3BucketZones
  Test/TestDataLakeInfrastructureVpc
  1. Run the command for deploy all stacks to all environment

    npm run cdk-deploy-all
    
  2. Run below command to deploy a specific stack to specific env

    npm run cdk deply <stack id>
    
    e.g. npm run cdk deploy Dev/DevDataLakeInfrastructureS3BucketZones
    
    
  3. Expected outputs:

    In the DEV account's CloudFormation console, you will see the following stacks are completed successfully

    cdk_deploy_output_dev_account_cfn_stacks


Additional resources

Clean up

  1. Delete stacks using the command npm run cdk destroy --all. When you see the following text, enter y, and press enter/return.

    Are you sure you want to delete: TestDataLakeKInfrastructurePipeline, ProdDataLakeInfrastructurePipeline, DevDataLakeInfrastructurePipeline (y/n)?

    Note: This operation deletes stacks only in central deployment account

  2. To delete stacks in development account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:

    1. Dev-DataLakeInfrastructureVpc
    2. Dev-DataLakeInfrastructureS3BucketZones
    3. Dev-DataLakeInfrastructureIam

    Note:

    1. Deletion of Dev-DevDataLakeCDKInfrastructureS3BucketZones will delete the S3 buckets (raw, conformed, and purpose-built). This behavior can be changed by modifying the retention policy in s3_stack
  3. To delete stacks in test account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:

    1. Test-DataLakeInfrastructureVpc
    2. Test-DataLakeInfrastructureS3BucketZones
    3. Test-DataLakeInfrastructureIam

    Note:

    1. The S3 buckets (raw, conformed, and purpose-built) have retention policies attached and must be removed manually when they are no longer needed.
  4. To delete stacks in prod account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:

    1. Prod-DataLakeInfrastructureVpc
    2. Prod-DataLakeInfrastructureS3BucketZones
    3. Prod-DataLakeInfrastructureIam

    Note:

    1. The S3 buckets (raw, conformed, and purpose-built) have retention policies attached and must be removed manually when they are no longer needed.
  5. Optional:

    1. If you are not using AWS CDK for other purposes, you can also remove CDKToolkit stack in each target account.

    2. Note: The asset S3 bucket has a retention policy and must be removed manually.

  6. For more details refer to AWS CDK Toolkit


AWS CDK

Refer to CDK Instructions for detailed instructions


Developer guide

Refer to Developer guide for more details of this project.

arqlab-data-platform's People

Contributors

bchathoth-wt avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.