erlemar / pytorch_tempest Goto Github PK

View Code? Open in Web Editor NEW

201.0 8.0 21.0 116.68 MB

My repo for training neural nets using pytorch-lightning and hydra

Home Page: https://pytorch-tempest.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

pytorch-lightning hydra deep-learning training-pipeline hacktoberfest

pytorch_tempest's Introduction

tempest

This repository has my pipeline for training neural nets.

Main frameworks used:

The main ideas of the pipeline:

all parameters and modules are defined in configs;
prepare configs beforehand for different optimizers/schedulers and so on, so it is easy to switch between them;
have templates for different deep learning tasks. Currently, image classification and named entity recognition are supported;

Examples of running the pipeline: This will run training on MNIST (data will be downloaded):

>>> python train.py --config-name mnist_config model.encoder.params.to_one_channel=True

Running on MPS (M1 macbook)

python train.py --config-name mnist_config model.encoder.params.to_one_channel=True trainer.accelerator=mps +trainer.devices=1 optimizer=adan training.lr=0.001

The default run:

>>> python train.py

The default version of the pipeline is run on imagenette dataset. To do it, download the data from this repository: https://github.com/fastai/imagenette unzip it and define the path to it in conf/datamodule/image_classification.yaml path

pytorch_tempest's People

Contributors

Stargazers

Watchers

pytorch_tempest's Issues

Switch to Train/EvalResult

Train/EvalResult seems promising, but has some bugs. When everything if fixed, I should switch to it.

first run not work

If run command

python train.py --config-name mnist_config model.encoder.params.to_one_channel=True

I get an error

`LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | model | Net | 23.6 M
1 | loss | CrossEntropyLoss | 0
2 | metric | Accuracy | 0

23.6 M Trainable params
0 Non-trainable params
23.6 M Total params
94.253 Total estimated model params size (MB)
Epoch 0: 0%| | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 92, in
run_model()
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
run_and_report(
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
raise ex
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in
lambda: hydra.run(
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
return run_job(
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/hydra/core/utils.py", line 125, in run_job
ret.return_value = task_function(task_cfg)
File "train.py", line 88, in run_model
run(cfg)
File "train.py", line 61, in run
trainer.fit(model, dm)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torch/optim/adamw.py", line 65, in step
loss = closure()
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/dp.py", line 98, in training_step
return self.model(*args, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 77, in forward
output = super().forward(*inputs, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/home/joefox/data/nextcloud/projects/pytorch_tempest/src/lightning_classes/lightning_image_classification.py", line 54, in training_step
score = self.metric(logits.argmax(1), target)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torchmetrics/metric.py", line 190, in forward
self.update(*args, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torchmetrics/metric.py", line 249, in wrapped_func
return update(*args, **kwargs)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torchmetrics/classification/accuracy.py", line 231, in update
mode = _mode(preds, target, self.threshold, self.top_k, self.num_classes, self.multiclass)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torchmetrics/functional/classification/accuracy.py", line 36, in _mode
mode = _check_classification_inputs(
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torchmetrics/utilities/checks.py", line 288, in _check_classification_inputs
_check_num_classes_mc(preds, target, num_classes, multiclass, implied_classes)
File "/home/joefox/.pyenv/versions/hydra/lib/python3.8/site-packages/torchmetrics/utilities/checks.py", line 164, in _check_num_classes_mc
raise ValueError("The highest label in target should be smaller than num_classes.")
ValueError: The highest label in target should be smaller than num_classes.
Epoch 0: 0%| | 0/16 [00:00<?, ?it/s] `

What have I done wrong?

mnist_config.yaml is broken (private, to_one_channel)

Hi,

tried to run the MNIST example, but it looks like there are some minor problems with mnist_config

issue 7.1 (private/custom):

>>python train.py --config-name mnist_config 
Could not load private/custom.
Available options:
        default

I have tried to fix it by overriding it at the command line, but faced with the next problem.

issue 7.2:

>>python train.py --config-name mnist_config private=default

RuntimeError: Given groups=1, weight of size [64, 3, 7, 7], expected input[128, 1, 28, 28] to have 3 channels, but got 1 channels instead

The solution is to set parameter "model.encoder.params.to_one_channel" as True:

way1: in command line

python train.py --config-name mnist_config private=default model.encoder.params.to_one_channel=True

way2: update 1-2 files (root config to fix "private" node, and create standalone simple_model_mnist.yaml with correct value of encoder.params.to_one_channels.

way3 (?): override something in mnist_config.yaml

Which way is preferable?

I am confused about way2, because it's not very suitable have a several simple_model_xxx.yaml files
P.S. Perhaps a better way is to take info about the numbers of input channels from the datamodule

Thanks

Implement training on folds

Need to decide how to train on folds, calculate average metrics and make predictions

Use of hydra.utils.instantiate

Hi really like your template! Thanks for putting it out there.

I was wondering what your thoughts are on using hydras coming instantiate feature vs. the load_obj? Have you had a look at it or played around with it? It seems to work for me in some cases but can also cause weird effects. Would be curious on your thoughts.

Add documentation

Add documentation with sphinx

Basic docs are done, now it would be great, if they were automatically generated over all the code.

Implement parameter tuning

Maybe using hydra multirun is enough

Add new schedulers

It is always good to have more options to choose. So it would be a good idea to add more schedulers. The steps are the following:

in conf/scheduler add a config for a new scheduler
if this scheduler requires some other library, update requirements
run tests to check that everything works

Example: https://github.com/Erlemar/pytorch_tempest/blob/master/conf/scheduler/cyclic.yaml

# @package _group_
class_name: torch.optim.lr_scheduler.CyclicLR
step: step
params:
  base_lr: ${training.lr}
  max_lr: 0.1

# @package _group_ - default necessary line for hydra
class_name - full name/path to the object
params: parameters, which are overriden. If scheduler has more parameters than defined in config, then default values will be used.

There are 3 possible cases of adding a scheduler:

default pytorch scheduler. Simply add config for it.
schedulerfrom another library. Add this library to requirements, define config with class_name based on the library. For example cyclicLR.CyclicCosAnnealingLR
schedulerfrom custom class. Add class to src/scheduler and add config with full path to the class starting with src

Save best model weights instead of the last

Currently in train.py the model saved has weights from the last epoch. It would be better to load the best checkpoint and save model from it instead.

https://pytorch-lightning.readthedocs.io/en/latest/weights_loading.html

Implement training on stages

Often it is a good idea to train model on stages - for example with changes of loss, augs or other things. Need to implement it and be able to set it using configs.

https://forums.pytorchlightning.ai/t/what-is-the-best-way-to-train-on-stages/95/4

protobuf crash

HI- Thanks for putting this out there. I would love to see your refinements.
Currently I'm trying to run the MNIST example you have in the README and I'm getting the following crash when it tries to load the data. I tried changing the protobuf version a few times but it didn't seem to matter (v 3.8, 3.14, 3.11)

[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:393] Invalid file descriptor data passed to EncodedDescriptorDatabase::Add().
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1367] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
libc++abi.dylib: terminating with uncaught exception of type google::protobuf::FatalException: CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):

Add new optimizers

It is always good to have more options to choose. So it would be a good idea to add more optimizers. The steps are the following:

in conf/optimizer add a config for a new optimizer
if this optimizer requires some other library, update requirements
run tests to check that everything works with command pytest

Example: https://github.com/Erlemar/pytorch_tempest/blob/master/conf/optimizer/adamw.yaml

# @package _group_
class_name: torch.optim.AdamW
params:
  lr: ${training.lr}
  weight_decay: 0.001

# @package _group_ - default necessary line for hydra
class_name - full name/path to the object
params: parameters, which are overriden. If optimizer has more parameters than defined in config, then default values will be used.

There are 3 possible cases of adding an optimizer:

default pytorch optimizers. Simply add config for it.
optimizer from another library. Add this library to requirements, define config with class_name based on the library. For example adamp.AdamP
optimizer from custom class. Add class to src/optimizers and add config with full path to the class

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.