Git Product home page Git Product logo

sagemaker-distributed-training-workshop's Introduction

Distributed Training Workshop on Amazon SageMaker

Welcome to the art and science of optimizing neural networks at scale! In this workshop you'll get hands-on experience working with our high performance distributed training libraries to achieve the best performance on AWS.

Workshop Content

Today you'll walk through two hands-on labs. The first one focuses on data parallelism, and the second one is about model parallelism.

Prerequisites

This lab is self-contained. All of the content you need is produced by the notebooks themselves or included in the directory. However, if you are in an AWS-led workshop you will most likely use the Event Engine to manage your AWS account.

If not, please make sure you have an AWS account with a SageMaker Studio domain created. In this account please request a service limit increase for the ml.g4dn.12xlarge instance type within SageMaker training.

Other helpful links

If you're interested in learning more about distributed training on Amazon SageMaker, here are some helpful links in your journey.

  • Preparing data for distributed training. This blog post introduces different modes of working with data on SageMaker training.
  • Distributing tabular data. This example notebook uses a built-in algorithim, TabTransformer, to provide state of the art transformer neural networks for tabular data. TabTrasnformer runs on multiple CPU-based instances.
  • SageMaker Training Compiler. This feature enables faster training on smaller cluster sizes, decreasing the overall job time by as much as 50%. Find example notebooks for Hugging Face and TensorFlow models here, including GPT2, BERT, and VisionTransformer. Training compiler is also common in hyperparameter tuning, and can be helpful in finding the right batch size.
  • Hyperparameter tuning. You can use SageMaker hyperparamter tuning, including our Syne Tune project, to find the right hyperparameters for your model, including learning rate, number of epochs, overall model size, batch size, and anything else you like. Syne Tune offers multi-objective search.
  • Hosting distributed models with DeepSpeed on SageMaker In this example notebook we demonstrate using SageMaker hosting to deploy a GPT-J model using DeepSpeed.
  • Shell scripts as SageMaker entrypoint. Want to bring a shell script so you can add any extra modifications or non-pip installable packages? Or use a wheel? No problem. This link shows you how to use a bash script to run your program on SageMaker Training.

Top papers and case studies

Some relevant papers for your reference:

  1. SageMaker Data Parallel, aka Herring. In this paper we introduce a custom high performance computing configuration for distributed gradient descent on AWS, available within Amazon SageMaker Training.
  2. SageMaker Model Parallel. In this paper we propose a model parallelism framework available within Amazon SageMaker Training to reduce memory errors and enable training GPT-3 sized models and more! See our case study achieving 32 samples / second with 175B parameters on SageMaker over 140 p4d nodes.
  3. Amazon Search speeds up training by 7.3x on SageMaker. In this blog post we introduce two new features on Amazon SageMaker: support for native PyTorch DDP and PyTorch Lightning integration with SM DDP. We also discuss how Amazon Search sped up their overall training time by 7.3x by moving to distributed training.

Upcoming book

If you'd like to read my upcoming book on the topic, check it out on Amazon here!. It's coming out April 2023.

sagemaker-distributed-training-workshop's People

Contributors

amazon-auto avatar emilywebber avatar hasanp87 avatar kanwaljitkhurmi avatar mathephysicist avatar sheldonlsides avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sagemaker-distributed-training-workshop's Issues

Pytorch ligtening example in LAB_1 not running as expected

I am running the lab 1 example as it is. Everything goes fine and training succeeds. But when I check the training logs, Its is all happening on [1,mpirank:0,algo-1]. I am passing the instance_count as two and can see there are two hosts [algo-1 and algo-2]. Each has 8 gpu on each so the mpirank goes from 0-15, but all training logs have just [1,mpirank:0,algo-1]. Below is the sample from log.

[1,mpirank:0,algo-1]<stdout>:#015Epoch 0: 50% 1/2 [00:00<00:00, 6.54it/s, loss=2.29, v_num=0] [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0it [00:00, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 100% 1/1 [00:00<00:00, 1113.73it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 0: 100% 2/2 [00:00<00:00, 12.33it/s, loss=2.29, v_num=0] [1,mpirank:0,algo-1]<stdout>:#015Epoch 0: 100% 2/2 [00:00<00:00, 12.33it/s, loss=2.29, v_num=0] [1,mpirank:0,algo-1]<stdout>:#015Epoch 0: 100% 2/2 [00:00<00:00, 12.10it/s, loss=2.29, v_num=0, val_acc=0.166] [1,mpirank:0,algo-1]<stdout>:#015 #033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 0: 100% 2/2 [00:00<00:00, 12.05it/s, loss=2.29, v_num=0, val_acc=0.166] [1,mpirank:0,algo-1]<stdout>:#015Epoch 0: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.166] #015Epoch 1: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.166] [1,mpirank:0,algo-1]<stdout>:#015Epoch 1: 50% 1/2 [00:00<00:00, 35.14it/s, loss=2.29, v_num=0, val_acc=0.166] [1,mpirank:0,algo-1]<stdout>:#015Epoch 1: 50% 1/2 [00:00<00:00, 9.28it/s, loss=2.29, v_num=0, val_acc=0.166] [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0it [00:00, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 100% 1/1 [00:00<00:00, 1333.22it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 1: 100% 2/2 [00:00<00:00, 17.19it/s, loss=2.29, v_num=0, val_acc=0.166] [1,mpirank:0,algo-1]<stdout>:#015Epoch 1: 100% 2/2 [00:00<00:00, 17.18it/s, loss=2.29, v_num=0, val_acc=0.166] [1,mpirank:0,algo-1]<stdout>:#015Epoch 1: 100% 2/2 [00:00<00:00, 16.85it/s, loss=2.29, v_num=0, val_acc=0.206] [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015 [1,mpirank:0,algo-1]<stdout>:#033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 1: 100% 2/2 [00:00<00:00, 16.77it/s, loss=2.29, v_num=0, val_acc=0.206] [1,mpirank:0,algo-1]<stdout>:#015Epoch 1: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.206] #015Epoch 2: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.206] [1,mpirank:0,algo-1]<stdout>:#015Epoch 2: 50% 1/2 [00:00<00:00, 34.22it/s, loss=2.29, v_num=0, val_acc=0.206] [1,mpirank:0,algo-1]<stdout>:#015Epoch 2: 50% 1/2 [00:00<00:00, 33.82it/s, loss=2.29, v_num=0, val_acc=0.206] [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0it [00:00, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0% 0/1 [00:00<?, ?it/s]#033[A[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 100% 1/1 [00:00<00:00, 1283.05it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 2: 100% 2/2 [00:00<00:00, 52.55it/s, loss=2.29, v_num=0, val_acc=0.206] [1,mpirank:0,algo-1]<stdout>:#015Epoch 2: 100% 2/2 [00:00<00:00, 47.22it/s, loss=2.29, v_num=0, val_acc=0.246] [1,mpirank:0,algo-1]<stdout>:#015 #033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 2: 100% 2/2 [00:00<00:00, 46.59it/s, loss=2.29, v_num=0, val_acc=0.246] [1,mpirank:0,algo-1]<stdout>:#015Epoch 2: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.246] #015Epoch 3: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.246] [1,mpirank:0,algo-1]<stdout>:#015Epoch 3: 50% 1/2 [00:00<00:00, 35.53it/s, loss=2.29, v_num=0, val_acc=0.246] [1,mpirank:0,algo-1]<stdout>:#015Epoch 3: 50% 1/2 [00:00<00:00, 34.17it/s, loss=2.29, v_num=0, val_acc=0.246] [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0it [00:00, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 0% 0/1 [00:00<?, ?it/s][1,mpirank:0,algo-1]<stdout>:#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 100% 1/1 [00:00<00:00, 1230.36it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 3: 100% 2/2 [00:00<00:00, 52.96it/s, loss=2.29, v_num=0, val_acc=0.246] [1,mpirank:0,algo-1]<stdout>:#015Epoch 3: 100% 2/2 [00:00<00:00, 47.93it/s, loss=2.29, v_num=0, val_acc=0.277][1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015 #033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 3: 100% 2/2 [00:00<00:00, 47.29it/s, loss=2.29, v_num=0, val_acc=0.277] [1,mpirank:0,algo-1]<stdout>:#015Epoch 3: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.277] #015Epoch 4: 0% 0/2 [00:00<?, ?it/s, loss=2.29, v_num=0, val_acc=0.277] [1,mpirank:0,algo-1]<stdout>:#015Epoch 4: 50% 1/2 [00:00<00:00, 35.43it/s, loss=2.29, v_num=0, val_acc=0.277] [1,mpirank:0,algo-1]<stdout>:#015Epoch 4: 50% 1/2 [00:00<00:00, 34.41it/s, loss=2.28, v_num=0, val_acc=0.277] [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0it [00:00, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 100% 1/1 [00:00<00:00, 1197.69it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 4: 100% 2/2 [00:00<00:00, 52.35it/s, loss=2.28, v_num=0, val_acc=0.277] [1,mpirank:0,algo-1]<stdout>:#015Epoch 4: 100% 2/2 [00:00<00:00, 48.23it/s, loss=2.28, v_num=0, val_acc=0.305] [1,mpirank:0,algo-1]<stdout>:#015 #033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 4: 100% 2/2 [00:00<00:00, 47.55it/s, loss=2.28, v_num=0, val_acc=0.305] [1,mpirank:0,algo-1]<stdout>:#015Epoch 4: 0% 0/2 [00:00<?, ?it/s, loss=2.28, v_num=0, val_acc=0.305] #015Epoch 5: 0% 0/2 [00:00<?, ?it/s, loss=2.28, v_num=0, val_acc=0.305] [1,mpirank:0,algo-1]<stdout>:#015Epoch 5: 50% 1/2 [00:00<00:00, 35.41it/s, loss=2.28, v_num=0, val_acc=0.305] [1,mpirank:0,algo-1]<stdout>:#015Epoch 5: 50% 1/2 [00:00<00:00, 34.12it/s, loss=2.28, v_num=0, val_acc=0.305] [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0it [00:00, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation: 0% 0/1 [00:00<?, ?it/s]#033[A[1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 0% 0/1 [00:00<?, ?it/s]#033[A [1,mpirank:0,algo-1]<stdout>: [1,mpirank:0,algo-1]<stdout>:#015Validation DataLoader 0: 100% 1/1 [00:00<00:00, 1276.42it/s]#033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 5: 100% 2/2 [00:00<00:00, 52.82it/s, loss=2.28, v_num=0, val_acc=0.305] [1,mpirank:0,algo-1]<stdout>:#015Epoch 5: 100% 2/2 [00:00<00:00, 48.07it/s, loss=2.28, v_num=0, val_acc=0.333] [1,mpirank:0,algo-1]<stdout>:#015 #033[A [1,mpirank:0,algo-1]<stdout>:#015Epoch 5: 100% 2/2 [00:00<00:00, 47.45it/s, loss=2.28, v_num=0, val_acc=0.333] [1,mpirank:0,algo-1]<stdout>:#015Epoch 5: 0% 0/2 [00:00<?, ?it/s, loss=2.28, v_num=0, val_acc=0.333] [1,mpirank:0,algo-1]<stdout>:#015Epoch 6: 0% 0/2 [00:00<?, ?it/s, loss=2.28, v_num=0, val_acc=0.333] [1,mpirank:0,algo-1]<stdout>:#015Epoch 6: 50% 1/2 [00:00<00:00, 35.15it/s, loss=2.28, v_num=0, val_acc=0.333] [1,mpirank:0,algo-1]<stdout>:#015Epoch 6: 50% 1/2 [00:00<00:00, 34.69it/s, loss=2.28, v_num=0, val_acc=0.333] [1,mpirank:0,algo-1]<stdout>:

Lab1 training failed at estimator.fit

I'm running lab1 on SageMaker.
Image: Pytorch 1.13 Python 3.9 CPU optimized
Kernel: Python3.9
Instance: ml.t3.medium

Here's the error message when running estimator.fit

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
Cell In[17], line 3
      1 # Passing True will halt your kernel, passing False will not. Both create a training job.
      2 # here we are defining the name of the input train channel. you can use whatever name you like! up to 20 channels per job.
----> 3 estimator.fit(wait=True, inputs = {'train':s3_train_path})

File /opt/conda/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline.<locals>.wrapper(*args, **kwargs)
    342         return context
    344     return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 346 return run_func(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:1341, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
   1339 self.jobs.append(self.latest_training_job)
   1340 if wait:
-> 1341     self.latest_training_job.wait(logs=logs)

File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:2680, in _TrainingJob.wait(self, logs)
   2678 # If logs are requested, call logs_for_jobs.
   2679 if logs != "None":
-> 2680     self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2681 else:
   2682     self.sagemaker_session.wait_for_job(self.job_name)

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:5766, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)
   5745 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
   5746     """Display logs for a given training job, optionally tailing them until job is complete.
   5747 
   5748     If the output is a tty or a Jupyter cell, it will be color-coded
   (...)
   5764         exceptions.UnexpectedStatusException: If waiting and the training job fails.
   5765     """
-> 5766     _logs_for_job(self, job_name, wait, poll, log_type, timeout)

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:7995, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)
   7992             last_profiler_rule_statuses = profiler_rule_statuses
   7994 if wait:
-> 7995     _check_job_status(job_name, description, "TrainingJobStatus")
   7996     if dot:
   7997         print()

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:8048, in _check_job_status(job, desc, status_key_name)
   8042 if "CapacityError" in str(reason):
   8043     raise exceptions.CapacityError(
   8044         message=message,
   8045         allowed_statuses=["Completed", "Stopped"],
   8046         actual_status=status,
   8047     )
-> 8048 raise exceptions.UnexpectedStatusException(
   8049     message=message,
   8050     allowed_statuses=["Completed", "Stopped"],
   8051     actual_status=status,
   8052 )

UnexpectedStatusException: Error for Training job shuxucao-ddp-mnist-2024-03-19-03-40-53-406: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "TypeError: Descriptors cannot be created directly.
 If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
 If you cannot immediately regenerate your protos, some other possible workarounds are
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
 
 More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
 File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
 File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
 File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
 # may not use this file except in compliance with the License. A copy of
 File "<frozen importlib._bootstrap>", line 991, in _find_and_load
 File "<frozen zipimport>", line 259, in load_module
 File 

The installed pip package protobuf is 3.20.2. Should I run this lab at python3.8?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.