Git Product home page Git Product logo

aws-machine-learning-cheat-sheet's Introduction

AWS Machine Learning Cheat Sheet

LinkedIn

Table of Contents

Duration Questions Formats
170 mins 65 questions Multiple choice & multiple response
  1. AWS Lake Formation

    AWS Lake Formation is a service that makes it easy to set up a secure data lake in days.

3

  1. AWS S3
  • Backbone for AWS ML
  • Eleven 9 durability
  • Decoupling of storage (S3) to compute (EC2, Amazon Athena, Amazon Redshift Spectrum)
  • Support all file formats
  • Partition (ex. by date) to speed up range queries

2

Storage Classes Use cases Availability Zones Access Time Retrieval Fee
S3 Standard For active, frequently accessed data >= 3 Milliseconds None
S3 Intelligent Tiering For data with changing access patterns >= 3 Milliseconds None
S3 Standard-IA For infrequently accessed data >= 3 Milliseconds per GB
S3 One -IA For re-creatable, less accessde data 1 Milliseconds per GB
S3 Glacier For archive data >=3 Minutes per GB
S3 Glacier Deep Archive For lowest storage cost >=3 Hours per GB

Amazon S3 with Amazon SageMaker

4

S3 Lifecycle Rules

  • Transition actions: objects transitioned to another storage class.
  • Expiration actions: objects deleted

S3 Encryption for Objects

  • SSE-S3: encrypt using keys by AWS
  • SSE-KMS: use AWS Key Management Service to manage encryption
  • SSE-C: manage own keys

S3 Security

  • User based: which API calls are allowed for a specific user
  • Resource based: bucket policies
  1. Amazon FSx for Lustre

1

When your training data is already in Amazon S3 and you plan to run training jobs several times using different algorithms and parameters, consider using Amazon FSx for Lustre, a file system service. FSx for Lustre speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds. The first time you run a training job, FSx for Lustre automatically copies data from Amazon S3 and makes it available to Amazon SageMaker. You can use the same Amazon FSx file system for subsequent iterations of training jobs, preventing repeated downloads of common Amazon S3 objects.

  1. Amazon S3 with Amazon EFS

5

Alternatively, if your training data is already in Amazon Elastic File System (Amazon EFS), we recommend using that as your training data source. Amazon EFS has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with including different fields or labels in their dataset. For example, a data scientist can use a Jupyter notebook to do initial cleansing on a training set, launch a training job from Amazon SageMaker, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

  1. Amazon EBS

    Amazon Elastic Block Store (EBS) is an easy to use, high-performance, block-storage service designed for use with Amazon Elastic Compute Cloud (EC2) for both throughput and transaction intensive workloads at any scale.

6

  1. Batch Processing

    For batch ingestions to the AWS Cloud, you can use services like AWS Glue, an ETL (extract, transform, and load) service that you can use to categorize your data, clean it, enrich it, and move it between various data stores. AWS Database Migration Service (AWS DMS) is another service to help with batch ingestions. This service reads from historical data from source systems, such as relational database management systems, data warehouses, and NoSQL databases, at any desired interval. You can also automate various ETL tasks that involve complex workflows by using AWS Step Functions.

7

  1. Stream Processing

8

Stream processing, which includes real-time processing, involves no grouping at all. Data is sourced, manipulated, and loaded as soon as it is created or recognized by the data ingestion layer. This kind of ingestion is less cost-effective, since it requires systems to constantly monitor sources and accept new information. But you might want to use it for real-time predictions using an Amazon SageMaker endpoint that you want to show your customers on your website or some real-time analytics that require continually refreshed data, like real-time dashboards.

Amazon Kinesis

9

  1. Video Streams: ingest and analyze video and audio data.
  2. Data Streams: use Kinese Producer Library to ingest data and use Kinesis Client Library to develop custom cunsumer applicaitons that can process data from KDS.
  3. Data Firehose: batch, compress, and execute custom transformation logic (AWS Lambda) data.
  4. Data Analytics: process and transform data through KDS or KDF using SQL near-real time.

AWS Glue

AWS Glue is a serverless ETL service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

  1. ETL

9

  1. Unified Data Catalog

11

Amazon MSK (Managed Streaming for Apache Kafka)

Amazon MSK is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

Amazon EMR

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

12

13

  1. Preprocessing

Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning.

  1. Feature Engineering

    Dimension Reduction

    • t-Distributed Stochastic Neighbor Embedding (t-SNE)
    • Principal Component Analysis (PCA)
  2. Visualization

Amazon SageMaker

14

Amazon SageMaker Estimators

  • Local mode: without loading training data
  • Pipe mode: improve loading time

Amazon SageMaker DeepAR Forecasting Algoirthm

The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).

Amazon EC2 P3 Instances

Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications.

Hyperparameters Tuning

  • Grid Search
  • Random Search
  • Amazon SageMaker Automated Hyperparameter Tuning

Metrics

Precision: TP/(TP+FP)

Recall/Sensitivity: TP/(TP+FN)

Specificity: TN/(TN+FP)

Queues

Amazon SQS

16

Amazon CloudWatch

17

AWS CloudTrail

AWS CloudTrail captures API calls and related events made by or on behalf of your AWS account and delivers the log files to an Amazon S3 bucket that you specify.

aws-machine-learning-cheat-sheet's People

Contributors

ryanxjhan avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

sehtab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.