mosaicml / composer Goto Github PK

View Code? Open in Web Editor NEW

5.1K 51.0 408.0 14.25 MB

Supercharge Your Model Training

Home Page: http://docs.mosaicml.com

License: Apache License 2.0

Python 99.60% Dockerfile 0.37% Makefile 0.03%

deep-learning pytorch neural-networks ml-systems ml-efficiency ml-training machine-learning neural-network

composer's People

Contributors

Stargazers

Watchers

Forkers

hanlint brando90 jbloxham landanjs ajaysaini725 dblalock corymosaicml jacobfulano anisehsani zeeroocooll averylamp siriuslee abhi-mosaic zhanglianng ravi-mosaicml lupesko rishikumarray stanford-crfm cronblocks-ai techthiyanes af-74413592 swagshaw neildg celestialized sidney1994 murthyn anhnguyendepocen christindbose stjordanis ai-hub-deep-learning-fundamental joskid zonezone12 ianworley dlmgary moinnadeem murilo elvishelvis vwxyzjn mrcodechef geoffreyporto mcneela urnotpopcorn huaxingxu mathpopo chatchanan-v liangrongzhi136 xuliji pavithranrao joeupwu strategist922 pikus16 ma-zx geothinking zenithez imflash217 fellowship ofirpress icodein tuozhanjun fdoperezi mbrukman azgo14 psyrtsov arslan-z ronenj changdong-zhou a-jacobson razaulazam alextrott16 knighton dskhudia mvpatel2000 zhouleidcc milocress ejyuen ishanashastri growlix codestar12 vladd-i renesugar codeaudit kaihsiangl arita37 navyjeff caorenzhi iainwh 9cat zinyurita bandish-shah florescl nqn dylankipkemoi neleon linden-li hurricanejin hongbo-miao shinyfoil behradtoghi hanrui-wang bdutta19

composer's Issues

Lazy loading of non-core dependencies

All non-core dependencies should be lazily loaded, so one can use the library without having to install composer[all]

This likely means that functions that depend on a non-core dependency should import that dependency inside the function, not at module-level.

Launch DDP processes before initializing trainer

🚀 Feature Request

Our current trainer relaunches itself N times to create N processes for DDP. The problem with this is that DDP does so by rerunning the very script that launched the trainer in the first place. This is problematic for any user invoking DDP via a custom script, and also for testing.

The canonical solution to this problem is to provide a launch executable that wraps a user provided script to initialize a trainer. The launch executable runs the script N times to create N processes. This appears to be the direction that many ML frameworks, including DeepSpeed, are moving towards.

Motivation

This will simplify testing and allow us to accurately calculate coverage metrics. This is also essentially a prerequisite to integrating the trainer with DeepSpeed, which also uses an executable.

DeepSpeed Integration

🚀 Feature Request

Integration with DeepSpeed. The V0 use case is targeted only on data parallelism strategies like ZeRO.

Motivation

Necessary to train GPT models above 1.3B parameters.

Fix model surgery so `Event.INIT` can be removed from Trainer `init`

Right now model surgery does not work after the model parameters have been passed to an optimizer. As a result, we call the Event.INIT (which is used by model-modifying methods such as Blurpool and SqueezeExcite) call back in the Trainer __init__ before the optimizer is constructed rather than in the training loop.

This yields API complications because the user cannot pass a pre-constructed optimizer into the Trainer __init__.

We need to get surgery working properly and test it on Blurpool and SqueezeExcite to make sure there are no regressions.

Test support for Pytorch v1.10

Run regression tests on pytorch v1.10 (https://pytorch.org/blog/pytorch-1.10-released/)

Python 3.7 support (for use with colab)

🚀 Feature Request

add support for python 3.7 build

Motivation

I wanted to play with composer but was not able to install via pip because google Colab runs in a python 3.7 environment

Add `callback.run_event()`

Add a helper method in callback for run_event. This helper method would then call the correct method on callback. It would look something like:

class Callback:
  def run_event(self, state; State, logger; Logger, event: Event):
    if event == Event.TRAINING_START:
        self.training_start(state, logger)
    if event == Event.BEFORE_FORWARD:
        self.before_forward(state, logger)
    ...

Then, the engine would do callback.run_event(state, logger, event).

This would help clean up code in the following places:

RankZeroCallback: Instead of monkeypatching each callback function, it would simply override run_event.
RankZeroLogger: No need for a private _training_start method that is different from all of the other callbacks
Checkpointing tests: The EventCounterCallback basically does this, via monkeypatching

Add GLUE Benchmark for NLP

Synthetic Data Generation for Brats Dataset

Brats does not support synthetic data. It would be great to add support for it.

Add Colab Example

Add Example Jupyter notebook to the examples folder
Add "Open in Colab" to the README.md

Auto Grad Accum

🚀 Feature Request

The trainer can automatically determine the appropriate grad_accum to use based off of hardware properties.

Motivation

It is cumbersome to manually specify the grad accum for every hardware and model.

Implementation

while True:
  try:
      train_model()
  except CudaOOOM:
      state.grad_accum += 1

Implement ESAM

Efficient SAM (anonymous, 2022) is a proposed duo of SAM optimizations to reduce the throughput hit of SAM. The composer repo already supports an interval hyperparameter which has empirically been found to maintain much of the quality improvement of SAM while sacrificing little throughput, but it would be interesting to see if ESAM could enable setting lower values of interval.

Precision Handling Support with DeepSpeed

DeepSpeed currently crashes if you try using it to train RN50 with FP16 (FP32 works fine). The problem is that the model needs the input tensor to also be in FP16, but the dataloader does nothing to change the dtype of the batches it returns according to the current precision. This isn't a problem for NLP models because the dtypes of NLP batches are generally all integer types anyways, so those models already handle casting the batch types (or something like that, I'm a bit unclear on exactly what's happening).

My proposed fix is fairly hacky. I'd like to avoid having to add code to dataloaders, datasets, and models to handle FP16 precision settings. Instead, I'd like to have the trainer itself handle casting batches to FP16 as appropriate. The hacky part of this is that the trainer needs to be able to determine when this cast should be done, as for ImageNet, but not for NLP. There's no perfect way to do this. I'm going to try having it just cast any FP32 tensor it sees in loaded batches to FP16.

Support `loss.backward(create_graph=True)` in the Trainer

Methods such as AdaHessian or similar need support for this feature.

Fix the `load_model` test for unet and GPT

The unet and gpt models currently fail on tests/test_load.py due to something about the mock model.
They likely need a mock model of the appropriate type.
Need to debug and fix these tests.

Enable small "smoke test" runs

🚀 Feature Request

Add a --smoke-test flag or something similar.

Motivation

I would like to be able to start a run that simply checks one step of training and one step of validation to ensure as well as possible that the training pipeline is working. This will make it easier when running many runs in parallel, where a small bug in the validation loop can waste a lot of time and compute resources.

Algorithm Composability API

When running tests, we validate that algorithms run on each model type. Some algorithms are not compatible with some models (e.g. NLP algs on image classification models), so we manually hard-code this in the tests. It would be helpful to have a first-class API to get which models support which algorithms, and which algorithms support which models.

The engine could also use this information to perform a static analysis to detect runtime issues before they arise.

One possible design could be to have a ModelType that would work like this:

class ModelType(StringEnum):
    CLASSIFICATION = "Classification"
    NLP = "Nlp"
    ...

class BaseMosaicModel:
    model_type: ModelType   # would be set on each model
    ...

class Algorithm:
    @classmethod
    def get_supported_model_types(cls):
        return list(ModelType)    # can be overridden on each algorithm

Blob Store Uploading for the Run Directory

🚀 Feature Request

Add callbacks to upload the run directory to blob stores (s3, gcs)

Motivation

Currently, the run directory is only saved locally (or, uploaded to WANDB, but we're running into issues with that). When a K8S pod dies, we lose the run directory. We store logs, checkpoints, traces, etc... in the run directory, so this should be persisted.

[Optional] Implementation

This can be implemented via a callback, quite trivially. It would be best to delegate the directory monitoring / uploading to a subprocess (not sub thread), as not to use GIL time in the main training loop. While network I/O happens outside the GIL, other work related to uploading (e.g. computing file hashes) does occur within the GIL, so it would be best to offload this. However, an initial implementation can use a background thread.

For cross-cloud compatibility, going to use apache libcloud.

Training run processes do not stop at the end of training

Environment

mosaicml/research:latest docker container on 3080s.

To reproduce

Command:
python examples/run_mosaic_trainer.py -f composer/yamls/models/resnet50.yaml --loggers wandb --loggers.wandb.entity mosaic-ml --loggers.wandb.project landan-random --callbacks speed_monitor lr_monitor --callbacks.speed_monitor.window_size 100

I believe Cory saw hanging at the end of the CIFAR-10 benchmark as well, so that may be sufficient to reproduce the bug.

Expected behavior

All (sub)processes to be killed at the end of training.

Additional context

Training runs hang at the end of training. This means the processes will continue to run although training is complete.

Add Memory Monitor Callback

🚀 Feature Request

Add a callback to monitor memory statistics during training such as memory reserved by the caching allocator, number of malloc calls, number of free calls, etc...

Motivation

Having memory allocator statistics during training available is very helpful for debugging issues such as OOM and memory leaks.

Implementation

See: https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html#torch.cuda.memory_stats for the API that gives this information.

Remove curriculum learning algorithm

This was renamed to seq_length_warmup but for some reason curriculum_learning.py still exists.

run_mosaic_trainer.py to print help text when invoked without arguments

🚀 Feature Request

Instead of getting a ValueError (and stack trace) when running run_mosaic_trainer.py without arguments, it might be a bit more friendly to print out the help text from -h.

Motivation

Getting a good first impression (and not feeling accused) as a CLI user is good practice, and we can print out the help text easily.

Synthetic Data Generation

When testing, benchmarking, smoke testing, and profiling, it is helpful to be able to easily get synthetic data that can then be passed into the model.forward() function for any type of model. However, it is impossible to automatically read the input (tensor) shape off of the model graph, so we are currently manually specifying the input shape wherever we perform synthetic passes (e.g. in tests, when constructing the synthetic dataset, etc...)

Because different models have different input formats, it would be difficult to describe this via a static parameter such as input shape -- e.g. nlp models use an input dictionary. As such, generating a synthetic batch would be preferred.

Option A: Add `get_synethic_batch(batch_size)` on each BaseMosaicModel:

Proposed Example

class BaseMosaicModel:
    @abc.abstractclass
    def get_synethic_batch(self, batch_size: int, synthetic_data_distribution: SynethicDataDistributionEnum) -> Batch:
        # for ease of subclass implementation, a set of helper methods would be available
        pass

Then, anything that needs to perform a forward pass could do:

def my_profiling_script(model: BaseMosaicModel):
    batch = model.get_synethic_batch(batch_size=10)  # returns a batch size of 10 samples that the model can train on
    output = model(batch)

We could also generalize the synthetic dataset to do something like:

class SyntheticDataset:
   def __init__(self, model):
        self.model = model
        
   def __getitem__(self, i):
         return self.model.get_synthetic_batch(1)

Option B: Add a SyntheticDatasetGenerator

Instead of storing how to generate synthetic batch information on each model, this could instead be stored in a common registry-like design. For example:

class SyntheticDatasetGenerator:
    def get_synethic_dataset(self, model, *args, **kwargs):
        if isinstance(model, MNIST):
             return SyntheticDataset(input_shape=(1, 28, 28), *args, **kwargs)
        if isinstance(model, Resent):
             return SyntheticDataset(input_shape=(3, 224, 224), *args, **kwargs)

This option would require generalization of the SyntheticDataset to support NLP data.

Errors are not printed to stdout when using multi-gpu training

Anytime an error occurs while I am using multi-gpu training, the job crashes, but the error is not printed. I need to run the experiment with a single GPU to find what the error was.

Is there a way to fix this? It makes determining issue very difficult.

I can try to create an example with the current release if needed.

Graceful Trainer Cleanup upon `KeyboardInterrupt`

Trainers do not cleanup properly when KeyboardInterrupted. Should cleanup the model/possibly keep the model in a state where it can be partially trained but evaluated .fit() is exited early. Probably should gracefully exit and cleanup for interactive composer users

** To reproduce

from composer import algorithms, trainer, Trainer
from composer.core.types import Precision

hparams = trainer.load("classify_mnist_cpu")  # loads from composer/yamls/models/classify_mnist_cpu.yaml
hparams.algorithms = algorithms.load_multiple("blurpool", "label_smoothing")

# edit other properties in the hparams object
hparams.precision = Precision.FP32
hparams.grad_accum = 2
hparams.set_datadir("~/datasets")

trainer = Trainer.create_from_hparams(hparams)
trainer.fit()

then CTRL-C, Keyboard Escape
then

trainer.fit()

Produces

>>> trainer.fit()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
  File "/home/avery/mosaic/composer/composer/trainer/trainer.py", line 356, in fit
BrokenPipeError: [Errno 32] Broken pipe
    self._train_loop()
  File "/home/avery/mosaic/composer/composer/trainer/trainer.py", line 488, in _train_loop
    assert isinstance(original_model, BaseMosaicModel)
AssertionError

Remove `soft_cross_entropy`

Motivation

As of this update in pytorch: pytorch/pytorch#61044, we no longer need our implementation of soft_cross_entropy in composer/models/loss.py and should remove it in favor of the one in pytorch.

DDP Spawn `can only test a child process` error

** To reproduce

Steps to reproduce the behavior:
Produces a traceback in DDP spawn (cpu only), where workers crash (still trains fine)

from composer.trainer import TrainerHparams, Trainer

hparams = TrainerHparams.create('composer/yamls/models/classify_mnist_cpu.yaml')
hparams.set_datadir("~/datasets")
trainer = Trainer.create_from_hparams(hparams)

trainer.fit()

/home/avery/mosaic/composer/composer/utils/ddp.py:20: UserWarning: DDPDefaultValueWarning: RANK env var not set and process group not initialized; returning 0 for global rank.
  warnings.warn(f"DDPDefaultValueWarning: {env_var} env var not set"
Epoch 1:   3%|▎         | 1/29 [00:00<00:21,  1.32it/s]                                                                                      /home/avery/mosaic/composer/composer/utils/ddp.py:20: UserWarning: DDPDefaultValueWarning: WORLD_SIZE env var not set and process group not initialized; returning 1 for world size.
  warnings.warn(f"DDPDefaultValueWarning: {env_var} env var not set"
Epoch 1:   3%|▎         | 1/29 [00:00<00:21,  1.32it/s, loss/train=2.3191]                                                                   Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7ff33771e5e0>
Traceback (most recent call last):
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7ff33771e5e0>
Traceback (most recent call last):
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

Synthetic Data Generation for LM Datasets

LM datasets do not support synthetic data. It would be great to add support for it.

Fix flaky convergence unit test

The trainer convergence test is flaky right now. This is likely due to the fact that we are using a CNN for the test which does significant dimensionality reduction and is thus hard to reason about in terms of linear separability of gaussian data. A fix would be to convert the test into training logistic regression.

** To reproduce

Run the test many times on the same code (seems to fail once every ~50-100 times)

Expected behavior

The test behavior should be consistent (i.e. if it passes once on some code then it should always pass on that code).

Add supported docker image matrix to docker/README.md

Add in the officially supported docker images to docker/README.md

Originally posted by @ravi-mosaicml in #66 (comment)

Support for subset sampler

🚀 Feature Request

Add support for training on only a subset of the dataset on each epoch.

Motivation

During testing and profiling, it can be important to skip over the first epoch (e.g. to ignore io bandwidth), but it is usually not needed to train over the entire dataset. Only a small subset is needed.

Implementation

Add support for https://pytorch.org/docs/stable/_modules/torch/utils/data/sampler.html#SubsetRandomSampler.
It will be a bit more complicated to make a DDP version of this.

Configure Jenkins

Enable github actions for:

pytest CPU runner
formatting and type checking (yapf, pyright)
docker builds
docs builds

ZeRO Configuration with DeepSpeed

Supporting stage 3 is expected to be non-trivial, since we can no longer store a complete copy of the model on each node.

Proper seeding for DDP

If the seed is not set in hparams, it is randomly selected in __init__. Each DDP process, when it starts up, gets a different random seed.

The seed from the rank 0 process is saved in checkpoints

When resuming from a checkpoint, the seed from the rank 0 process is restored across all DDP processes.
This leads to inconsistent behavior, since the non-rank-0 process now resume with a different seed than they first trained with.

To fix: add the seed to the RNG state, and sync across all DDP processes

`eval_only` flag

🚀 Feature Request

For post-hoc measurements on different datasets, we want to be able to load a checkpoint and run --eval_only.

[Optional] Implementation

Add a --eval_only flag that loads from checkpoint and only runs eval. User would need to specify a new dataset/dataloader that differs from the checkpointed hparams.

Cityscapes + Deeplabv3 benchmark

🚀 Feature Request

Add a semantic segmentation benchmark based on the Cityscapes dataset and the Deeplabv3 architecture.

Motivation

Prior work

Our current segmentation benchmark is based on the Multimodal Brain Tumor Segmentation Challenge (BraTS) and the Unet architecture. There are a couple of reasons why we may want to add another segmentation benchmark:

Dataset size: BraTS has lower resolution (192 x 160) and a smaller number of training images (500) than we expect from other segmentation datasets. As of now, we can train on BraTS in ~3 minutes
Recognition: BraTS does not seem as recognizable in the ML community, so it may be difficult for people to interpret our results. Also, a frequently used dataset would be beneficial to the community since proposed methods can be easily compared to prior work using the dataset.
Domain: this may just be a me thing, but I think it would be helpful to have a dataset in a similar domain to the ImageNet benchmark. This could help in determining whether the success/failure of a method depends on the task or the domain.

Cityscapes

Cityscapes appears to be the second most common semantic segmentation benchmark (behind Pascal VOC), so evaluating methods on Cityscapes should be relevant to the community. Cityscapes image resolution is 1024 x 2048 and the training set contains 2,975 densely and 20,000 coarsely annotated images (not as many as we would like, but a start). Alternatively, we could use ADE20k or Pascal VOC segmentation if others feel strongly towards either dataset.

Deeplabv3

It would be easier to benchmark with Deeplabv3 since the hyperparameters and target performance on Cityscapes are known. As of now, we have no numbers on training time, so this will be unknown. For Unet, we would need to tune hyperparameters and would not be sure if we are achieving an expected performance.

Implementation

Simple implementation outline, but should be made more detailed:

Cityscapes DataloaderSpec: will try to use torchvision.datasets.cityscapes if it fits our use case
Deeplabv3 BaseMosaicModel: will try to use torchvision.models.segmentation.deeplabv3_resnet101 if it fits our use case
Implement intersection-over-union (IoU) metric for evaluation: should be easyish?
Dataset and model throughput profiling
Dataset and model card

Fix engine compile to ensure `selective_backprop` is first algorithm in `AFTER_DATALOADER` event

selective_backprop needs to be the first algorithm used in the AFTER_DATALOADER event because it prunes data samples and we only want to run other data-modification algorithms on the pruned set of data samples.

Auto-TCP Port Selection by default

Steps to reproduce the behavior:

Run .fit() twice on the same Trainer in a script or notebook (not cleaned up port usage for torch.distributed)

Expected behavior

Ideally the Trainer by default won't use a static port in TCPStore and instead select an open port to use for torch.distributed coordination.

Symbolic links in models directory are broken

For example: https://github.com/mosaicml/composer/tree/main/composer/models/resnet50

The links seem to point to themselves.

Support multiple `eval` datasets

🚀 Feature Request

For fine-tuning tasks (e.g. GLUE) and also many vision experiments, need to support multiple eval datasets. The metrics needed could be different across different datasets.

[Optional] Implementation

support for eval_dataloaders as a List
during the eval loop, run through multiple dataloaders and log the metrics for each dataset
to support different metrics, may need to either (1) store the metric with the datasets, or (2) have the model's metric function return different metrics depending on the dataset.

Remove deferred logging

With #65, the global rank is now known when the python process starts. Thus, for rank zero loggers, it is not necessary to wait until training start to initialize the logger. Instead, loggers should initialize on the INIT event, and process all logging calls immediately.
By convention, there will not be any calls to the loggers before the init event.

WandB error due to multiple artifacts with the same ID

When running a baseline resnet50 model on imagenet, I encountered this error:

wandb: ERROR Error while calling W&B API: Error 1062: Duplicate entry '6394579-1' for key 'unique_artifact_collection_membership_version' (<Response [409]>)
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 50, in _thread_body
    self._handle_event(event)
  File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 79, in _handle_event
    self._maybe_commit_artifact(job.artifact_id)
  File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 161, in _maybe_commit_artifact
    self._api.commit_artifact(artifact_id)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 2235, in commit_artifact
    response = self.gql(mutation, variable_values={"artifactID": artifact_id})
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/lib/retry.py", line 102, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 147, in execute
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.8/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 141, in execute
    return self.client.execute(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
    request.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://api.wandb.ai/graphql

I've asked the WandB folks and they think it's from an attempted upload of an artifact with the same ID as another. The recent addition of artifact uploading from run_directory seems to be causing this, so PR #89 will disable it by default, but we need to verify that artifact uploads are working as expected.

Remove seed from state

The seed is stored in the State object in the Trainer but instead it should be stored in the checkpoint_rng object.

Note that right now, if the user does not set a seed on trainer init, then a different seed is created on each process but only the rank 0 seed its saved in the checkpoint. We want to enforce each device using the same seed which will be addressed by #12

Checkpointing with DeepSpeed

Override `max_epochs` on resume from checkpoint with SSR

When resuming from a checkpoint, max_epochs currently defaults to the original max_epochs which prevents users from being able to train for more than the original max_epochs when resuming from a checkpoint.

It would be good to be able to resume from checkpoint and train for more epochs than the original max_epochs. However, we need to come up with a scheme to make this work with scale_schedule_ratio because scale schedule ratios are computed assuming that max_epochs does not change.

How should we go about handling this?

Implement `predict()` in `Trainer`

The Trainer class should have a predict() function as a convenience for a user who wants to run inference on a trained model.

GPT models differ between code and docs

The gpt2 models in the code (38m, 85m, 114m) are different from what's in the docs (52m, 83m, 125m). Also p sure d_attn for GPT2-52m is incorrect in the table in the docs.

Use GPUs in tests

🚀 Feature Request

We have a number of unit tests (ex. tests/trainer/test_trainer.py and tests/trainer/test_checkpoint.py which use GPUs as a part of the test. However, these tests are not run as a part of the GitHub actions tests, which results in the potential for GPU-related bugs. We should have a system in place which runs GPU tests before code can be merged into dev.

Motivation

There have been GPU-specific bugs in the past that were not caught because GPU tests do not run in our unit testing suite.

Implementation

Can use CircleCI for this.

More efficient microbatch DDP sync behavior when `find_unused_parameters` is True

Ordinarily, when training with gradient accumulation, we only need to do a DDP sync on the final microbatch, because synced gradients aren't needed until the optimizer runs at the end of the batch. However, the find_unused_parameters flag indicates that some algorithms (such as stochastic depth) may cause not all gradients to be generated. Critically, the set of unused parameters may vary between microbatches. Syncing on only the last microbatch may cause some parameters used in earlier microbatches but unused in the final microbatch to not be properly synced - resulting in severe quality degradations.

Our current solution to this issue is just to sync all microbatches when the find_unused_paramaters flag is set, but this incurs a throughput penalty of about 5%, depending on gradient accumulation setting. We would like to investigate whether it is possible to sync all parameters used in any microbatch, to avoid this throughput penalty.

WandB logger should have `every_n_batches`

In order to space out calls to the wandb client, we should support the same frequency settings as the FileLogger.

mosaicml / composer Goto Github PK

composer's People

Contributors

Stargazers

Watchers

Forkers

composer's Issues

🚀 Feature Request

Motivation

🚀 Feature Request

Motivation

🚀 Feature Request

Motivation

🚀 Feature Request

Motivation

Implementation

🚀 Feature Request

Motivation

🚀 Feature Request

Motivation

[Optional] Implementation

Environment

To reproduce

Expected behavior

Additional context

🚀 Feature Request

Motivation

Implementation

🚀 Feature Request

Motivation

Synthetic Data Generation

Option A: Add get_synethic_batch(batch_size) on each BaseMosaicModel:

Proposed Example

Option B: Add a SyntheticDatasetGenerator

Motivation

Expected behavior

🚀 Feature Request

Motivation

Implementation

🚀 Feature Request

[Optional] Implementation

🚀 Feature Request

Motivation

Prior work

Cityscapes

Deeplabv3

Implementation

Expected behavior

🚀 Feature Request

[Optional] Implementation

🚀 Feature Request

Motivation

Implementation

Recommend Projects

Recommend Topics

Recommend Org

Option A: Add `get_synethic_batch(batch_size)` on each BaseMosaicModel: