ray-project / ray-educational-materials Goto Github PK

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

License: Apache License 2.0

Jupyter Notebook 96.43% Python 3.57%

deep-learning distributed-machine-learning ray-distributed ray-tune ray ray-train ray-data ray-serve generative-ai llm

ray-educational-materials's Introduction

Ray Educational Materials

Welcome to a collection of education materials focused on Ray, a distributed compute framework for scaling your Python and machine learning workloads from a laptop to a cluster.

Recommended Learning Path

Module	Description
Overview of Ray	An Overview of Ray and entire Ray ecosystem.
Introduction to Ray AI Runtime	An Overview of the Ray AI Runtime.
Ray Core: Remote Functions as Tasks	Learn how arbitrary functions to be executed asynchronously on separate Python workers.
Ray Core: Remote Objects	Learn about objects that can be stored anywhere in a Ray cluster.
Ray Core: Remote Classes as Actors, part 1	Work with stateful actors.
Ray Core: Remote Classes as Actors, part 2	Learn "Tree of Actors" pattern.
Ray Core: Ray API best practices	Learn Ray patterns & anti-patterns and best practices.
Scaling batch inference	Learn about scaling batch inference in computer vision with Ray.
Optional: Batch inference with Ray Datasets	Bonus content for scaling batch inference using Ray Datasets.
Scaling model training	Learn about scaling model training in computer vision with Ray.
Ray observability part 1	Introducing the Ray State API and Ray Dashboard UI as tools for observing the Ray cluster and applications.
LLM model fine-tuning and batch inference	Fine-tuning a Hugging Face Transformer (FLAN-T5) on the Alpaca dataset. Also includes distributed hyperparameter tuning and batch inference.
Multilingual chat with Ray Serve	Serving a Hugging Face LLM chat model with Ray Serve. Integrating multiple models and services within Ray Serve (language detection and translation) to implement multilingual chat.

Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

Ray documentation
Official Ray site Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.
Join the community on Slack Find friends to discuss your new learnings in our Slack space.
Use the discussion board Ask questions, follow topics, and view announcements on this community forum.
Join a meetup group Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.
Open an issue Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.
Become a Ray contributor We welcome community contributions to improve our documentation and Ray framework.

ray-educational-materials's People

Contributors

Stargazers

Watchers

ray-educational-materials's Issues

[Suggestion]: incorporate feedback from "Overview of Ray" dry run

Please share your suggestion here

Here are a list of small changes to make based off of feedback from the "Overview of Ray" dry run:

include an object store visualization under the section "Put data in the object store"
change the naming of training and testing set components to be more readable
redirect use case links to YouTube videos rather than our site
lower the number of models to be trained
start with n_estimators as 8 and then increment in 8 to achieve a more satisfying convergence

[Suggestion]: Improve Readability of Ray Serve Use Case Image

Please share your suggestion here

The collection of diagrams for the Ray Serve use case under the section "Mutli-model composition for model serving" is illegible and cluttered. Replace this image with a more readable diagram whenever it becomes available.

[Suggestion]: add descriptions on how many Actors are needed given my cluster

Please share your suggestion here

Help Ray users understand how they can estimate number of Actors and compute needed to achieve performant batch prediction. Mention the following:

actor defaults (1 cpu) and how to change it
how to assign GPU to actors
total number of actors as a function of number of cpus or gpus in the cluster.
for large cluster mention good practice of limiting the number of CPUs made available on the head node (docs).

[Suggestion]: batch prediction module: merge Actors and ActorPool sections

Please share your suggestion here

Merge Actors and ActorPool approaches into one.

As ActorPool is a utility, it can be presented as a convenience wrapper that it easy to work with. It provides load balancing and Actors management so that Ray user does not need to implement it themselves (as presented in the Actors section).

Ray Website "Try It Out" Quick Start with Ray AIR Colab Error on Import

Notebook with bug

https://colab.research.google.com/github/ray-project/ray-educational-materials/blob/main/Introductory_modules/Quickstart_with_Ray_AIR_Colab.ipynb

What happened?

Description
Running the "try it out" colab on the website fails with import error.
AttributeError: 'NoneType' object has no attribute 'replace'
Using the latest version of xgboost-ray (0.1.18) fix the problem.

Link
https://colab.research.google.com/github/ray-project/ray-educational-materials/blob/main/Introductory_modules/Quickstart_with_Ray_AIR_Colab.ipynb

Environment info

ray==2.3.0 xgboost_ray==0.1.15

Issue Severity

Low: Minor problem.

[Bug]:

Notebook with bug

LLM_finetuning_and_batch_inference.ipynb

What happened?

Get the following errors while running the following cell
trainer = HuggingFaceTrainer( trainer_init_per_worker=trainer_init_per_worker, scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu), datasets={ "train": train_dataset, "evaluation": validation_dataset, }, run_config=RunConfig( checkpoint_config=CheckpointConfig( num_to_keep=1, checkpoint_score_attribute="eval_loss", checkpoint_score_order="min", ), ), preprocessor=batch_preprocessor, )

Environment info

ray 2.8 python3.9

Issue Severity

High: It blocks me from completing my task.

[Bug]: Introduction to Ray AIR Serve Code Snippet Wrong

Notebook with bug

Introduction to Ray AIR

What happened?

The serve code snippet is the tune one, and should be swapped out.

Environment info

n/a

Issue Severity

Low: Minor problem.

[Suggestion]: No link to ray dashboard as instructed

Please share your suggestion here

hi,

I dont see any link to ray dashboard as instructed:

[Bug]: Halt due to resources are not available

Example 3: How to use Ray distributed tasks for image transformation and computation

What happened?

When I run the "run_distribued"， I had the following errors:

In my case I set the batch to 100 but even I set it to 35, the errors raised too.

I am new to Ray and can not figure out what is going on . What resouces are unavailable and why does the syestm halt?

Environment info

System: Centos 7
CPUs: 128
Ray: 2.3
python 3.9

Issue Severity

None

[Suggestion]: Better examples for ML practitioner

Please share your suggestion here

ML practitioner examples -> add scalable training and parallel training examples. Training many models in parallel

[suggestion] batch inference module - merge sections to better present Ray AIR

Please share your suggestion here

Merge Datasets and BatchPredictor approaches into one: "Distributed batch inference with Ray AIR".

Datasets approach is more basic; BatchPredictor is more specialized, easy to use and feature rich as it also:

supports various predictos (TorchPredictor, HFPredictor)
handles framework native batch conversions
give an options to resume operations from AIR checkpoint to prediction, selection / keep columns, etc.

Note in this section that BatchPredictor calls dataset.map_batches() under the hood. From that perspective they are similar.

[Bug]: UnidentifiedImageError in Ray_Core/Ray_Core_1_Remote_Functions.ipynb

Notebook with bug

Ray_Core_1_Remote_Functions.ipynb

What happened?

Running this cell gives

UnidentifiedImageError: cannot identify image file '**/ray-educational-materials/Ray_Core/task_images/stennis.jpg'

Environment info

Ray: 2.3.1
Python: 3.10.12
OS: Ubuntu 22.04

Issue Severity

Minor

[Bug]: Broken link to Dask in the Overview of Ray notebook

Notebook with bug

https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb

What happened?

The link to "Dask on Ray" takes you to a 404 page:

https://docs.ray.io/en/latest/data/dask-on-ray.html

It looks like it should be this page:

https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html

Environment info

N/A

Issue Severity

Low: Minor problem.

[Bug]: Ray_Core/ray_core_1_remote_functions.ipynb invalid import

Notebook with bug

https://github.com/ray-project/ray-educational-materials/blob/main/Ray_Core/Ray_Core_1_Remote_Functions.ipynb

What happened?

The first line of example 3 includes the following import: import tasks_helper_utils as t_utils. But, tasks_helper_utils is not a real library.

Environment info

ray, version 2.7.0, Python 3.11.5, MacOS Monterey 12.2.1

Issue Severity

Low: Minor problem.

[Suggestion]: Ray use cases section should split simple scaling vs advanced use cases

Please share your suggestion here

Currently the list of use cases in https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb contains the following:

Exoshuffle
Building a custom feature engineering library
Alpa
RLlib / FIFA
Multi-model serving
RL training / Riot
ML platform / Shopify
ML platform / Spotify

This is skewed toward advanced use cases, which I don't think accurately reflects the entire target audience of Ray. I think it would be productive to break this down into two categories:

Scaling simple ML workloads
- Batch inference on CPUs and GPUs (Core / Data)
- Parallel training of many small models / Distributed training of large models (Core / Train)
- Managing parallel experiments and hyperparameter tuning (Tune)
- Serving model pipelines or multiple models (Serve)
- Reinforcement Learning (RLlib)
- ML platform use cases (Shopify, Spotify)
Implementing advanced ML workloads
- Alpha
- Exoshuffle
- Custom feature eng library
- RL training / Riot / FIFA

[Suggestion]: Reorganize this repository under consistent directories centered around workflows.

Please share your suggestion here

As the number of different notebooks grows, it becomes more and more difficult to surface what it is that users are interested in. Right now, the directories are named around either relevant library (e.g. "Ray Core") or around type of data (e.g. "Computer_vision_workloads").

At the very least, these conventions should be consistent, and ideally, centered around workflows that developers would relate to. In addition, the README should increase in quality to better describe this repository as well as direct attention and traffic to the relevant modules more quickly.

[Bug]: Failing to read AWS S3 file(s)

Notebook with bug

https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Introduction_to_Ray_AI_Runtime.ipynb

What happened?

Failed to execute following python code:

# Read Parquet file to Ray Dataset.
dataset = ray.data.read_parquet(
    "s3://anyscale-training-data/intro-to-ray-air/nyc_taxi_2021.parquet"
)




### Environment info



Python version: 3.11.3
Ray version: 2.5.0



### Issue Severity

High: It blocks me from completing my task.

[Suggestion]: It's better to test the exmaples in the educational materials

Please share your suggestion here

predictions_dataset = predictor.predict(data=dataset, batch_size=1)

If I run on a GPU server, this line will raise a RayTaskError. It seems

the returned segmentation_maps_postprocessed has to be put into CPU numpy and the `num_gpus_per_worker=1' has to be set. It took me much time to realize the example has that issue. For a newbie, even a minor issue may lead to confusion.

Thanks

[Suggestion]: add "Part 3" to the Overview of Ray

Please share your suggestion here

Add Part 3, that will consist of small coding exercises:

Work with Object store

add object with ray.put()
print returned object reference
use ray.get() to access value of the object.
Mention that tasks and actors return futures that are references as well.

Compute pi digits
Use this docs example to show highly_parallel computational job - compute pi digits.

[Suggestion]: Batch inference module: improve comparison table at the end of module

Please share your suggestion here

Add more rows to the table:

level of control
exposed or hidden parallelism
stateless / stateful
pre and post processing options
how much you should know about ray to use it?
flexibility and ease-of-use dimensions

Add more content to the table:

ideal use case
when to use? why?

What's the meaning on these senstence of "Part 5: Distributed batch inference with Ray Core API"

When using Ray, you can pass objects as arguments to remote functions. Ray will automatically store these objects in the local object store (on the worker node where the function is running) using the ray.put() function. This makes the objects available to all local tasks. However, if the objects are large, this can be inefficient as the objects will need to be copied every time they are passed to a remote function.

To improve performance, you can explicitly store both the model and feature extractor in the object store by using ray.put(). This avoids the need to create multiple copies of the objects.

I am confused on the words on : ray.put()

"However, if the objects are large, this can be inefficient as the objects will need to be copied every time they are passed to a remote function "
"To improve performance, you can explicitly store both the model and feature extractor in the object store by using ray.put(). This avoids the need to create multiple copies of the objects."

which sentence should I follow ?

[Bug]: ray.air checkpoints has moved to ray.train checkpoints

Notebook with bug

Computer_vision_workloads/Semantic_segmentation/Scaling_batch_inference.ipynb

What happened?

Import as well as other dependencies need to be fixed for chekpoint related changes.

#from ray.air import Checkpoint
from ray.train import Checkpoint

Futher Checkpoint.from_dict() does not work as:

AttributeError: The new ray.train.Checkpoint class does not support from_dict(). Instead, only directories are supported.

Environment info

Ray 2.10.0
Python 3.10.13
Ubuntu

Issue Severity

None

[Suggestion]: NLP_workloads/Text_generation /LLM_finetuning_and_batch_inference.ipynb Preprocessor

Please share your suggestion here

Should the labels in preprocess_function be enconded output? It seems to used input_ids as label instead of output.

ray-project / ray-educational-materials Goto Github PK

ray-educational-materials's Introduction

Ray Educational Materials

Recommended Learning Path

Connect with the Ray community

ray-educational-materials's People

Contributors

Stargazers

Watchers

Forkers

ray-educational-materials's Issues

Please share your suggestion here

Please share your suggestion here

Please share your suggestion here

Please share your suggestion here

Notebook with bug

What happened?

Environment info

Issue Severity

Notebook with bug

What happened?

Environment info

Issue Severity

Notebook with bug

What happened?

Environment info

Issue Severity

Please share your suggestion here

What happened?

Environment info

Issue Severity

Please share your suggestion here

Please share your suggestion here

Notebook with bug

What happened?

Environment info

Issue Severity

Notebook with bug

What happened?

Environment info

Issue Severity

Notebook with bug

What happened?

Environment info

Issue Severity

Please share your suggestion here

Please share your suggestion here

Notebook with bug

What happened?

Please share your suggestion here

Please share your suggestion here

Please share your suggestion here

Notebook with bug

What happened?

Environment info

Issue Severity

Please share your suggestion here

Recommend Projects

Recommend Topics

Recommend Org