Git Product home page Git Product logo

ray_shuffling_data_loader's Introduction

Deprecated

This library has been deprecated, with our learnings absorbed into the much more general Ray Datasets library (docs, code).

Ray-based Shuffling Data Loader

A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed training of machine learning models.

Installation

You can install latest master via:

pip install git+https://github.com/ray-project/ray_shuffling_data_loader.git@main#egg=ray_shuffling_data_loader

Usage

This shuffling data loader exposes a generic ShufflingDataset abstraction that takes a list of input files and shuffling configuration, and yields batch_size-sized GPU batches via an iterator. This abstraction is framework-agnostic.

We also expose a class deriving from Torch IterableDataset, for distributed Torch training use. This dataset abstraction also takes a feature and label column spec, used for converting the Pandas DataFrame GPU batches to Torch tensors.

from shuffling_data_loader_ray.torch_dataset import TorchShufflingDataset

# Input and shuffling configuration for the Torch shuffling dataset.
# Files containing training dataset.
filenames = ["s3://foo/bar"]
# Number of training epochs.
num_epochs = 10
# Number of model training workers.
num_trainers = 1
# Size of a GPU batch.
batch_size = 25000
# Number of reducers for the shuffler to use.
num_reducers = 8
# Maximum number of epoch shuffling runs that are allowed to run concurrently.
max_concurrent_epochs = 2
# The rank of this trainer. This can typically be retrieved via your
# distributed training framework, e.g. `hvd.rank()` for Horovod.
rank = 0

# Spec for feature and label columns, used to convert Pandas DataFrame GPU batches
# into Torch tensors.
# The column names in your Parquet data files.
feature_columns = ["col_0_name", "col_1_name"]
# The Torch types of your columns, e.g. `torch.float64`.
feature_types = [col_0_type, col_1_type]
# The label column name in your Parquet data files.
label_column = "label_col_name"
# The Torch type of your label column, e.g. `torch.float64`.
label_type = label_col_type

# Construct a Torch shuffling dataset that yields shuffled batch_size-sized
# GPU batches, transforming those Pandas DataFrame GPU batches into Torch tensors.
# The shuffling will be kicked off upon construction of this dataset.
ds = TorchShufflingDataset(
    filenames,
    num_epochs,
    num_trainers,
    batch_size,
    rank,
    num_reducers=num_reducers,
    max_concurrent_epochs=max_concurrent_epochs,
    feature_columns=feature_columns,
    feature_types=feature_types,
    label_column=label_column,
    label_type=label_type)

# Train a model on the yielded GPU batches.
for epoch in range(num_epochs):
    # We must set the epoch in order for the dataset to give us the GPU
    # batches for the right epoch.
    ds.set_epoch(epoch)
    for batch_idx, (data, target) in enumerate(ds):
        # Do a model training step on this GPU batch.

ray_shuffling_data_loader's People

Contributors

chongxiaoc avatar clarkzinzow avatar krfricke avatar richardliaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ray_shuffling_data_loader's Issues

[Shuffle] Support num_mappers != num_files

Right now, the number of shuffle mappers is equal to the number of input files, with each mapper reading a single input file. However:

  1. For very large files, we may want multiple mappers reading segments of a file in parallel for the sake of memory utilization and/or maximizing read throughput.
  2. For many very small files, we may want multiple files to be read by a single mapper for the sake of reducing task overhead and/or possibly maximizing read throughput.

(1) has already been brought up by a user as a concrete need. We should therefore support this by allowing num_mappers != num_files, and letting num_mappers be set by the user with a best-effort default as a fallback.

Implementation Notes

(2) (i.e. num_files > num_mappers) won't involve much more than the typical np.array_split() pattern, e.g.

for files in np.array_split(filenames, num_mappers):
    reducer_parts = shuffle_map.options(num_returns=num_reducers).remote(files, ...)

(1) (i.e. num_files < num_mappers), however, will require us to read a file starting at a certain offset; given that the Pandas read_parquet() API doesn't expose that functionality, the most likely course of action would be using the pyarrow Parquet reading facilities directly. Here is an example of such usage, although we will most likely want to use a ParquetDataset with split_row_groups=True and rely on the row group granularity for the MVP.

[Docs] Add user guide.

Add user guide to docs. This should include:

  • Walkthrough of installation, setup, and example.
  • Cataloguing of all configuration details, e.g. what to set num_reducers to, why max_concurrent_epochs should almost always be 2, etc.

[Testing] Add unit tests.

Add unit tests for the following:

  1. Shuffler (shuffle.py)
  2. BatchQueue (batch_queue.py)
  3. ShufflingDataset (dataset.py)
  4. TorchShufflingDataset (torch_dataset.py)

Implementation Notes

  • ShufflingDataset and TorchShufflingDataset testing could initially be as simple as running the example in the main block of dataset.py and torch_dataset.py, respectively, although these don't do much to assert correctness of the yielded batches.
  • See these tests for easily portable examples for testing BatchQueue.

[feature] support TensorFlow dataset binding in ray_data_loader

We need to build a connector to TF dataset iterator.

Impl idea from @clarkzinzow:
We’d take the base shuffling dataset, create a ShufflingTFDataset that converts each batch dataframe to feature and target tensors, and then pass that ShufflingTFDataset as the generator to tf.data.Dataset.from_generator, which can then be used as your typical TensorFlow dataset:

ds = tf.data.Dataset.from_generator(ShufflingTFDataset(filenames, num_epochs, num_trainers, batch_size, dataframe_to_tensor_spec))
for batch_idx, (features, targets) in enumerate(ds):
    print(f"Processing batch {batch_idx}!")

I can’t see any obvious issues with doing this except for mapping TensorFlow’s distributed dataset + data-parallel training paradigms to our current rank-based shuffling dataset, where we kick off the shuffle from the rank-0 training worker and give each worker an independent queue of batches. The latter should be doable from via getting the replica ID from tf.distribute.get_replica_context() during iteration and using that to access the correct queue, but the former paradigm may need to be tweaked.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.