ray-project / ray_shuffling_data_loader Goto Github PK

A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed training of machine learning models.

License: Apache License 2.0

Shell 6.92% Python 93.08%

ray_shuffling_data_loader's Introduction

Deprecated

This library has been deprecated, with our learnings absorbed into the much more general Ray Datasets library (docs, code).

Ray-based Shuffling Data Loader

A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed training of machine learning models.

Installation

You can install latest master via:

pip install git+https://github.com/ray-project/ray_shuffling_data_loader.git@main#egg=ray_shuffling_data_loader

Usage

This shuffling data loader exposes a generic ShufflingDataset abstraction that takes a list of input files and shuffling configuration, and yields batch_size-sized GPU batches via an iterator. This abstraction is framework-agnostic.

We also expose a class deriving from Torch IterableDataset, for distributed Torch training use. This dataset abstraction also takes a feature and label column spec, used for converting the Pandas DataFrame GPU batches to Torch tensors.

from shuffling_data_loader_ray.torch_dataset import TorchShufflingDataset

# Input and shuffling configuration for the Torch shuffling dataset.
# Files containing training dataset.
filenames = ["s3://foo/bar"]
# Number of training epochs.
num_epochs = 10
# Number of model training workers.
num_trainers = 1
# Size of a GPU batch.
batch_size = 25000
# Number of reducers for the shuffler to use.
num_reducers = 8
# Maximum number of epoch shuffling runs that are allowed to run concurrently.
max_concurrent_epochs = 2
# The rank of this trainer. This can typically be retrieved via your
# distributed training framework, e.g. `hvd.rank()` for Horovod.
rank = 0

# Spec for feature and label columns, used to convert Pandas DataFrame GPU batches
# into Torch tensors.
# The column names in your Parquet data files.
feature_columns = ["col_0_name", "col_1_name"]
# The Torch types of your columns, e.g. `torch.float64`.
feature_types = [col_0_type, col_1_type]
# The label column name in your Parquet data files.
label_column = "label_col_name"
# The Torch type of your label column, e.g. `torch.float64`.
label_type = label_col_type

# Construct a Torch shuffling dataset that yields shuffled batch_size-sized
# GPU batches, transforming those Pandas DataFrame GPU batches into Torch tensors.
# The shuffling will be kicked off upon construction of this dataset.
ds = TorchShufflingDataset(
    filenames,
    num_epochs,
    num_trainers,
    batch_size,
    rank,
    num_reducers=num_reducers,
    max_concurrent_epochs=max_concurrent_epochs,
    feature_columns=feature_columns,
    feature_types=feature_types,
    label_column=label_column,
    label_type=label_type)

# Train a model on the yielded GPU batches.
for epoch in range(num_epochs):
    # We must set the epoch in order for the dataset to give us the GPU
    # batches for the right epoch.
    ds.set_epoch(epoch)
    for batch_idx, (data, target) in enumerate(ds):
        # Do a model training step on this GPU batch.

ray_shuffling_data_loader's People

Contributors

Stargazers

Watchers

Forkers

chongxiaoc vvksh richardliaw krfricke distributed-deep-learning

ray_shuffling_data_loader's Issues

[Shuffle] Support num_mappers != num_files

Right now, the number of shuffle mappers is equal to the number of input files, with each mapper reading a single input file. However:

For very large files, we may want multiple mappers reading segments of a file in parallel for the sake of memory utilization and/or maximizing read throughput.
For many very small files, we may want multiple files to be read by a single mapper for the sake of reducing task overhead and/or possibly maximizing read throughput.

(1) has already been brought up by a user as a concrete need. We should therefore support this by allowing num_mappers != num_files, and letting num_mappers be set by the user with a best-effort default as a fallback.

Implementation Notes

(2) (i.e. num_files > num_mappers) won't involve much more than the typical np.array_split() pattern, e.g.

for files in np.array_split(filenames, num_mappers):
    reducer_parts = shuffle_map.options(num_returns=num_reducers).remote(files, ...)

(1) (i.e. num_files < num_mappers), however, will require us to read a file starting at a certain offset; given that the Pandas read_parquet() API doesn't expose that functionality, the most likely course of action would be using the pyarrow Parquet reading facilities directly. Here is an example of such usage, although we will most likely want to use a ParquetDataset with split_row_groups=True and rely on the row group granularity for the MVP.

[Benchmarks] Add benchmarking scripts.

[Docs] Add user guide.

Add user guide to docs. This should include:

Walkthrough of installation, setup, and example.
Cataloguing of all configuration details, e.g. what to set num_reducers to, why max_concurrent_epochs should almost always be 2, etc.

[Testing] Add unit tests.

Add unit tests for the following:

Shuffler (shuffle.py)
BatchQueue (batch_queue.py)
ShufflingDataset (dataset.py)
TorchShufflingDataset (torch_dataset.py)

Implementation Notes

ShufflingDataset and TorchShufflingDataset testing could initially be as simple as running the example in the main block of dataset.py and torch_dataset.py, respectively, although these don't do much to assert correctness of the yielded batches.
See these tests for easily portable examples for testing BatchQueue.

[Docs] Add contributing README

Add a contributing README to repo.

[feature] support TensorFlow dataset binding in ray_data_loader

We need to build a connector to TF dataset iterator.

Impl idea from @clarkzinzow:
We’d take the base shuffling dataset, create a ShufflingTFDataset that converts each batch dataframe to feature and target tensors, and then pass that ShufflingTFDataset as the generator to tf.data.Dataset.from_generator, which can then be used as your typical TensorFlow dataset:

ds = tf.data.Dataset.from_generator(ShufflingTFDataset(filenames, num_epochs, num_trainers, batch_size, dataframe_to_tensor_spec))
for batch_idx, (features, targets) in enumerate(ds):
    print(f"Processing batch {batch_idx}!")

I can’t see any obvious issues with doing this except for mapping TensorFlow’s distributed dataset + data-parallel training paradigms to our current rank-based shuffling dataset, where we kick off the shuffle from the rank-0 training worker and give each worker an independent queue of batches. The latter should be doable from via getting the replica ID from tf.distribute.get_replica_context() during iteration and using that to access the correct queue, but the former paradigm may need to be tweaked.

ray-project / ray_shuffling_data_loader Goto Github PK

ray_shuffling_data_loader's Introduction

Deprecated

Ray-based Shuffling Data Loader

Installation

Usage

ray_shuffling_data_loader's People

Contributors

Stargazers

Watchers

Forkers

ray_shuffling_data_loader's Issues

Implementation Notes

Implementation Notes

Recommend Projects

Recommend Topics

Recommend Org