AWS CMAPSS ML Lab

A lab to demo AWS data and machine learning services

Agenda

Intros
Login to event engine
Discuss architecture
Work through lab
Questions
Wrap up, next steps

So what have we deployed

VPC - A secure network for our resources to run where we can control access from the public internet
Created some S3 buckets for our raw, etl and processed data
Created a Cloud9 deployment we can use as a terminal environment, a handy terminal enviroment to run scripts.
Created some glue resources which we'll use to perform our ETL
Created various IAM roles which can control parts of our infrastructure based on the least access principle.
Created a SageMaker notebook to do some data analysis and submit SageMaker training jobs from

Important

This lab has been deployed to the us-east-1 region (North Virginia), make sure you're access resources from this region!

Architecture

Lab Exercise

Part 1 - Publish data onto our delivery stream

This will just be a simple python script, but imagine we're collecting this data in realtime over IOT core, publishing the data onto kinesis stream and using firehose to buffer the data onto S3 for us!
Go to "Cloud9" in the AWS console page
Open the only Cloud9 instance (a micro EC2 instance)
We need to install the python boto3 package

sudo pip install boto3

Next we need to clone this git repo so we can use the resources

git clone https://github.com/darrenbrien/aws-bb-cmapss.git

Publish some of our data onto a delivery stream

cd aws-bb-cmapss

`python src/publisher.py 'data/train_*' <- quotes matter or you'll just publish a single file :(

Part 2 - Run the glue crawler and perform some ETL

Navigate into the S3 area of the console review the contents and structure of the submissions bucket
We can see our data has arrived, we haven't really done much with it yet, just simulate a realtime (ish) data arrival
AWS glue is a serverless ETL service built on top of Apache Spark. Glue is great for working with arbitraryily large datasets as it support parallelism and dividing work amongst workers.

Before we can perform etl we need to know what shape (columns, types, partitions) the data has, this is what a glue Crawler can do for us

Navigate to AWS Glue in the Console, click "Crawlers", take a look at the cmapss_submissions crawler

The sole job of a glue crawler is to discover the metadata associated with data repository.

Tick the crawler and select run, this should take ~1 minute.

Our crawler should have discovered some of the metadata of our csv data which firehose has saved to S3 for us

Note we haven't looked at this data yet, our crawler has discovered the partitions and column structure of the data

Now we know a little about the submissions data, we can perform some ETL to make it usable

A little about the dataset

We're collecting data in realtime about some engine tests that are going on
We collect data until the engine "fails" and requires maintenance
Maybe Nasa can get back to the Moon if they can do a better job of predicting when engines are close to failing
That way they could perform maintenance before the engine fails, keeping it online more of the time!
Today we'll collect this data, structure it so that we can run a machine learning model over the data and show what our predictions are for engines we haven't seen fail yet!

Navigate to the jobs section of the glue console.
Open the script tab, we'll each need to make some changes specific to our own AWS environment.
Go back to the Cloud9 window and open the glue_job.py file (should be the same), we need to edit it to reference your s3 bucket in your environment with the "curated" pattern
Go to the Cloud9 terminal

aws s3 ls

You should have a bucket with curated in the name datalake-curated-dataset-123456789-us-east-1-qwertyu

Copy this bucket name and paste it over the similarly named s3 bucket in the glue_job.py file
Now we need to make this file available for AWS Glue, to do this copy lets first create a new bucket

aws s3 mb s3://aws-glue-$(openssl rand -hex 5)

Important

Your bucket must start with the aws-glue prefix or bad things will happen

Back in the glue console click Action => Save As

Enter the s3://<the-bucket-you-just-created

Click save

select run-job, the job should take about 4-5 minutes to complete, including start up time
Now lets run our second glue crawler to so we can query this data in athena

Part 3 - Use AWS Athena to create a train / validation data

Machine learning models working well when they generalize to "unseen" data
When a model fits training data with 99% accuracy and unseen evaluation data with 10% accuracy are said to be "overfitted" to the training data.
We want to create a training and evaluation set which helps our model not to overfit.

In Cloud9 open the ctas_training_evaluation_file.sql, we'll need to replace the s3 paths in this query to reflect your data. Leave everything else the same.
Open the athena console and paste in the query, be careful to avoid typos

otherwise you may have to delete some s3 objects which may be created incorrectly
this "create table as command" is not idempotent

This query splits our data into 3 and will use two thirds for training and one third for evaluation. Ideally we'd get a similiar level of performance on both datasets.

Part 4 - SageMaker Notebook, Training Jobs and inference end points

80% of Machine Learning is working with data, thats why most of this lab is using data tools.

Open up SageMaker from the console and navigate the SageMakerNotebooks.
Click the jupyterlab link, this is a jupyternotebook.
We need to pull our git repo here (again)

Click terminal

cd SageMaker

run

git clone https://github.com/darrenbrien/aws-bb-cmapss.git

In the left hand panel you should now seethe aws-bb-cmapss folder, click through into src
Open both ipynb file eda and model
eda.ipynb is a little buggy, usually when building a model from scratch data scientists perform and Exploritary Data Analysis to try to help understand the data better. You can see how the jupyter environment can be useful to interate quickly and understand you data with tables and charts in a REPL environment.
Now lets train a SageMaker Model open model.ipynb
We'll work through this notebook together and finish up the lab with a model inference end point we can send new data to!

dineshrathee12 / aws-bb-cmapss Goto Github PK

aws-bb-cmapss's Introduction

AWS CMAPSS ML Lab

Agenda

So what have we deployed

Important

Architecture

Lab Exercise

Part 1 - Publish data onto our delivery stream

Part 2 - Run the glue crawler and perform some ETL

A little about the dataset

Important

Part 3 - Use AWS Athena to create a train / validation data

Part 4 - SageMaker Notebook, Training Jobs and inference end points

aws-bb-cmapss's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent