AWS Machine Learning Cheat Sheet

Overview
Data Engineering
- Data Repositories
  - AWS Lake Formation
  - AWS S3
- Data Ingestion
- Data Transformation
Exploratory Data Analysis
Modelling
ML Implementation and Operations

Overview

Duration	Questions	Formats
170 mins	65 questions	Multiple choice & multiple response

Data Engineering

Data Repositories

AWS Lake Formation

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days.

AWS S3

Backbone for AWS ML
Eleven 9 durability
Decoupling of storage (S3) to compute (EC2, Amazon Athena, Amazon Redshift Spectrum)
Support all file formats
Partition (ex. by date) to speed up range queries

Storage Classes	Use cases	Availability Zones	Access Time	Retrieval Fee
S3 Standard	For active, frequently accessed data	>= 3	Milliseconds	None
S3 Intelligent Tiering	For data with changing access patterns	>= 3	Milliseconds	None
S3 Standard-IA	For infrequently accessed data	>= 3	Milliseconds	per GB
S3 One -IA	For re-creatable, less accessde data	1	Milliseconds	per GB
S3 Glacier	For archive data	>=3	Minutes	per GB
S3 Glacier Deep Archive	For lowest storage cost	>=3	Hours	per GB

Amazon S3 with Amazon SageMaker

S3 Lifecycle Rules

Transition actions: objects transitioned to another storage class.
Expiration actions: objects deleted

S3 Encryption for Objects

SSE-S3: encrypt using keys by AWS
SSE-KMS: use AWS Key Management Service to manage encryption
SSE-C: manage own keys

S3 Security

User based: which API calls are allowed for a specific user
Resource based: bucket policies

Amazon FSx for Lustre

When your training data is already in Amazon S3 and you plan to run training jobs several times using different algorithms and parameters, consider using Amazon FSx for Lustre, a file system service. FSx for Lustre speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds. The first time you run a training job, FSx for Lustre automatically copies data from Amazon S3 and makes it available to Amazon SageMaker. You can use the same Amazon FSx file system for subsequent iterations of training jobs, preventing repeated downloads of common Amazon S3 objects.

Amazon S3 with Amazon EFS

Alternatively, if your training data is already in Amazon Elastic File System (Amazon EFS), we recommend using that as your training data source. Amazon EFS has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with including different fields or labels in their dataset. For example, a data scientist can use a Jupyter notebook to do initial cleansing on a training set, launch a training job from Amazon SageMaker, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

Amazon EBS

Amazon Elastic Block Store (EBS) is an easy to use, high-performance, block-storage service designed for use with Amazon Elastic Compute Cloud (EC2) for both throughput and transaction intensive workloads at any scale.

Data Ingestion

Batch Processing

For batch ingestions to the AWS Cloud, you can use services like AWS Glue, an ETL (extract, transform, and load) service that you can use to categorize your data, clean it, enrich it, and move it between various data stores. AWS Database Migration Service (AWS DMS) is another service to help with batch ingestions. This service reads from historical data from source systems, such as relational database management systems, data warehouses, and NoSQL databases, at any desired interval. You can also automate various ETL tasks that involve complex workflows by using AWS Step Functions.

Stream Processing

Stream processing, which includes real-time processing, involves no grouping at all. Data is sourced, manipulated, and loaded as soon as it is created or recognized by the data ingestion layer. This kind of ingestion is less cost-effective, since it requires systems to constantly monitor sources and accept new information. But you might want to use it for real-time predictions using an Amazon SageMaker endpoint that you want to show your customers on your website or some real-time analytics that require continually refreshed data, like real-time dashboards.

Amazon Kinesis

Video Streams: ingest and analyze video and audio data.
Data Streams: use Kinese Producer Library to ingest data and use Kinesis Client Library to develop custom cunsumer applicaitons that can process data from KDS.
Data Firehose: batch, compress, and execute custom transformation logic (AWS Lambda) data.
Data Analytics: process and transform data through KDS or KDF using SQL near-real time.

AWS Glue

AWS Glue is a serverless ETL service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

Unified Data Catalog

Amazon MSK (Managed Streaming for Apache Kafka)

Amazon MSK is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

Data Transformation

Amazon EMR

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

Exploratory Data Analysis

Preprocessing

Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning.

Feature Engineering

Dimension Reduction
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Principal Component Analysis (PCA)
Visualization

Modelling

Amazon SageMaker

Amazon SageMaker Estimators

Local mode: without loading training data
Pipe mode: improve loading time

Amazon SageMaker DeepAR Forecasting Algoirthm

The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).

Amazon EC2 P3 Instances

Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications.

Hyperparameters Tuning

Grid Search
Random Search
Amazon SageMaker Automated Hyperparameter Tuning

Metrics

Precision: TP/(TP+FP)

Recall/Sensitivity: TP/(TP+FN)

Specificity: TN/(TN+FP)

ML Implementation and Operations

Queues

Amazon SQS

Amazon CloudWatch

AWS CloudTrail

AWS CloudTrail captures API calls and related events made by or on behalf of your AWS account and delivers the log files to an Amazon S3 bucket that you specify.

ryanxjhan / aws-machine-learning-cheat-sheet Goto Github PK