A Data Platform

This solution helps to build and deploy data lake infrastructure on AWS using AWS CDK. CDK stands for Cloud Development Kit. It is an open source software development framework to define your cloud application resources using familiar programming languages.

This solution helps you:

deploy data lake infrastructure on AWS using CDK
increase the speed of prototyping, testing, and deployment of new ETL workloads

This repository contains only the S3 Zones stack as a sample. If you require the full version of the code, please contact us here

Data Platform
- Architecture
- Infrastructure
Deployment Architecture
Prerequisites
Deployment
- Deploying for the first time
Data lake ETL jobs
Additional resources

Data Platform

In this section explains the Data Platform architecture and its infrastructure.

Architecture Overview

The Data Lake can have multiple producers which ingest files into the landing zone bucket. The architecture uses AWS Lambda, Glue ETL and AWS Step Functions for orchestration and scheduling of ETL workloads to clean, validate and generate rich metadata. AWS Glue Catalog will be used to store the metadata.

AWS Glue Studio will be used to extract and transform the data into a relational data model and load it into a Data Mart (PostgreSQL or Aurora DB).

We use Amazon Athena for interactive queries and analysis. Also uses various AWS services for logging, monitoring, security, authentication, authorisation, notification, build, and deployment.

Infrastructure

Now we have the Data Lake design, let's deploy its infrastructure. It includes the following resources:

Amazon Virtual Private Cloud (VPC)
Subnet
Security Groups
Route Table(s)
VPC Endpoints
Key Management Service
Amazon S3 buckets for:
1. Landing Zone (Staging)
2. Gold Zone (Production/conformed layer)
3. log buckets
A PostgreSQL DB for a Data Mart and auditing DB
API Gateway

Glue ETL and Lambda functions are created through a separate pipeline.

Figure below represents the infrastructure resources we provision for Data Lake.

![Data Lake Infrastructure Architecture](add diagram here)

Deployment Architecture

We 2 AWS accounts and that can be used as follows:

Dev account for dev and test data lake
Prod account for production data lake

Figure below represents the deployment model.

There are few interesting details to point out here:

Data Lake infrastructure source code is organized into three branches in GitHub - dev, test, and main(prod).
Each branch is mapped to a target environment. This way, code changes made to the branches are deployed iteratively to their respective target environment.
From CDK perspective, we apply the standard bootstrapping principles which is explained in subsequent sections.

Continuous delivery of infrastructure using GitHub Actions

Figure below illustrates the continuous delivery of data lake infrastructure.

NOTE: This is not implemented yet.

There are few interesting details to point out here:

The DevOps administrator checks in the code to the repository.
GitHub Actions listens to commit events on the source code repositories.
Code changes made to the dev branch of the repo are automatically deployed to the dev environment of the data lake.
Code changes to the test branch of the repo are automatically deployed to the test environment.
Code changes to the main branch of the repo are automatically deployed to the prod environment.

Source code structure

Table below explains how this source ode structured:

File / Folder	Description
data-platform.ts	Application entry point.
data-platform-stack.ts	Pipeline deploy stage entry point.
iam_stack	Contains all resources to created IAM Roles
s3_stack	Stack creates S3 buckets - LandingZone, GoldZone and Log buckets. This also creates AWS KMS Keys to enabled server side encryption for all buckets.
lambda_stack	Contains resources for lambda functions
tagging	Program to tag all provisioned resources.
vpc_stack	Contains all resources related to the VPC used by Data Lake infrastructure and services. This includes: VPC, Security Groups, and VPC Endpoints( Gateway Endpoint).
glue_stack	Contains script to provision AWS Glue resources
config	Contains all configurations
step_functions	Contains all resources related to AWS Step Functions
lib	The cdk code for deploying stacks
resources	This folder has static resources such as architecture diagrams, developer guide etc.
glue_etl_job_auditor	Folder to keep lambada code for auditing Glue ETL Jobs
glue_etl_script	Contains a copy of glue ETL scripts
assets	Contains assets for S3 deployment (predefined prefixes and default files)

Automation scripts

This repository has the following automation scripts to complete steps before the deployment:

#	Script	Purpose
1	bootstrap_deployment_account.sh	Used to bootstrap deployment account (currently this method is not being used)
2	bootstrap_target_account.sh	Used to bootstrap target environments for example dev, test, and production.

Prerequisites

This section has various steps you need to perform before you deploy data lake resources on AWS.

Software installation

AWS CLI - make sure you have AWS CLI configured on your system. If not, refer to Configuring the AWS CLI for more details.
AWS CDK - install compatible AWS CDK version
```
npm install -g aws-cdk
```
Python - make sure you have Python SDK installed on your system. We recommend Python 3.7 and above.

Other requirements

AWS accounts. recommend to have at least three accounts for dev, test, and prod accounts. To test this solution with one target environment for e.g. dev, refer to developer_instructions.md for detailed instructions.
Number of branches on your GitHub repo - the main to start with and dev and test branches can be added at the beginning or after the first deployment of data lake infrastructure in PROD environment.
Administrator privileges - you need to administrator privileges to bootstrap AWS environments and complete initial deployment. Usually, these steps can be performed by a DevOps administrator of your team. After these steps, you can revoke administrative privileges. Subsequent deployments are based on self-mutating natures of CDK Pipelines.
AWS Region selection - if possible use the same AWS region (e.g. eu-west-2) for dev, test, and prod accounts for simplicity.

AWS environment bootstrapping

Environment bootstrap is standard CDK process to prepare an AWS environment ready for deployment. Follow the steps:

Important:

This command is based on the feature Named Profiles.
Configure AWS CLI to use AWS SSO to login using SSO via CLI.

Before you bootstrap dev account, set environment variable

export ENV=<enviroment e.g. Dev, Test or Prod>
export AWS_ACCOUNT=<aws account>
export AWS_REGION=<region>

Bootstrap dev account

Important: Your configured environment must target the Dev account
```
npm run cdk bootstrap aws://<dev account>/<region>
```

Application configuration

Before we deploy our resources we must provide the manual variables and upon deployment the CDK Pipelines will programmatically export outputs for managed resources. Follow the below steps to setup your custom configuration:

Note: You can safely commit these values to your repository
Go to config and make sure the values under local_mapping Record within the function get_local_configuration are correct for the environments.

GitHub Actions integration

GitHub Actions requires a AWS IAM user Access Key and Secret Access Key to deploy stack to the environment. This access token is stored in GitHub Secrets. For security reason, the key must be rotated periodically. Follow the below steps:

Note: Do NOT commit these values to your repository
Go to GitHub repository and on Settings tab, and create below entries _ AWS_ACCESS_KEY_ID _ AWS_SECRET_ACCESS_KEY

Deployment

This section explains the steps to deploy the stack in different environments.

Deploying for the first time

Configure your AWS profile to target the Deployment account as an Administrator and perform the following steps:

Open command line (terminal)
Go to project root directory where cdk.json and app.py exist
Run the command cdk ls
Expected output: It lists CDK Pipelines and target account stacks on the console. A sample is below:

  Dev/DevDataLakeInfrastructureS3BucketZones
  Dev/DevDataLakeInfrastructureVpc
  Test/TestDataLakeInfrastructureS3BucketZones
  Test/TestDataLakeInfrastructureVpc

Run the command for deploy all stacks to all environment
```
npm run cdk-deploy-all
```

Run below command to deploy a specific stack to specific env

npm run cdk deply <stack id>

e.g. npm run cdk deploy Dev/DevDataLakeInfrastructureS3BucketZones

Expected outputs:

In the DEV account's CloudFormation console, you will see the following stacks are completed successfully

Additional resources

Clean up

Delete stacks using the command npm run cdk destroy --all. When you see the following text, enter y, and press enter/return.
```
Are you sure you want to delete: TestDataLakeKInfrastructurePipeline, ProdDataLakeInfrastructurePipeline, DevDataLakeInfrastructurePipeline (y/n)?
```
Note: This operation deletes stacks only in central deployment account
To delete stacks in development account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:
1. Dev-DataLakeInfrastructureVpc
2. Dev-DataLakeInfrastructureS3BucketZones
3. Dev-DataLakeInfrastructureIam
Note:
1. Deletion of Dev-DevDataLakeCDKInfrastructureS3BucketZones will delete the S3 buckets (raw, conformed, and purpose-built). This behavior can be changed by modifying the retention policy in s3_stack
To delete stacks in test account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:
1. Test-DataLakeInfrastructureVpc
2. Test-DataLakeInfrastructureS3BucketZones
3. Test-DataLakeInfrastructureIam
Note:
1. The S3 buckets (raw, conformed, and purpose-built) have retention policies attached and must be removed manually when they are no longer needed.
To delete stacks in prod account, log onto Dev account, go to AWS CloudFormation console and delete the following stacks:
1. Prod-DataLakeInfrastructureVpc
2. Prod-DataLakeInfrastructureS3BucketZones
3. Prod-DataLakeInfrastructureIam
Note:
1. The S3 buckets (raw, conformed, and purpose-built) have retention policies attached and must be removed manually when they are no longer needed.
Optional:
1. If you are not using AWS CDK for other purposes, you can also remove CDKToolkit stack in each target account.
2. Note: The asset S3 bucket has a retention policy and must be removed manually.
For more details refer to AWS CDK Toolkit

AWS CDK

Refer to CDK Instructions for detailed instructions

Developer guide

Refer to Developer guide for more details of this project.

bchathoth-wt / arqlab-data-platform Goto Github PK

arqlab-data-platform's Introduction