Git Product home page Git Product logo

inferno's Introduction

Inferno

https://travis-ci.org/inferno-pytorch/inferno.svg?branch=master

Inferno is a little library providing utilities and convenience functions/classes around PyTorch. It's a work-in-progress, but the releases from v0.4 on should be fairly stable!

Features

Current features include:
import torch.nn as nn
from inferno.io.box.cifar import get_cifar10_loaders
from inferno.trainers.basic import Trainer
from inferno.trainers.callbacks.logging.tensorboard import TensorboardLogger
from inferno.extensions.layers.convolutional import ConvELU2D
from inferno.extensions.layers.reshape import Flatten

# Fill these in:
LOG_DIRECTORY = '...'
SAVE_DIRECTORY = '...'
DATASET_DIRECTORY = '...'
DOWNLOAD_CIFAR = True
USE_CUDA = True

# Build torch model
model = nn.Sequential(
    ConvELU2D(in_channels=3, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    ConvELU2D(in_channels=256, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    ConvELU2D(in_channels=256, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    Flatten(),
    nn.Linear(in_features=(256 * 4 * 4), out_features=10),
    nn.LogSoftmax(dim=1)
)

# Load loaders
train_loader, validate_loader = get_cifar10_loaders(DATASET_DIRECTORY,
                                                    download=DOWNLOAD_CIFAR)

# Build trainer
trainer = Trainer(model) \
  .build_criterion('NLLLoss') \
  .build_metric('CategoricalError') \
  .build_optimizer('Adam') \
  .validate_every((2, 'epochs')) \
  .save_every((5, 'epochs')) \
  .save_to_directory(SAVE_DIRECTORY) \
  .set_max_num_epochs(10) \
  .build_logger(TensorboardLogger(log_scalars_every=(1, 'iteration'),
                                  log_images_every='never'),
                log_directory=LOG_DIRECTORY)

# Bind loaders
trainer \
    .bind_loader('train', train_loader) \
    .bind_loader('validate', validate_loader)

if USE_CUDA:
  trainer.cuda()

# Go!
trainer.fit()

To visualize the training progress, navigate to LOG_DIRECTORY and fire up tensorboard with

$ tensorboard --logdir=${PWD} --port=6007

and navigate to localhost:6007 with your browser.

Installation

Conda packages for python >= 3.6 for all distributions are availaible on conda-forge:

$ conda install -c pytorch -c conda-forge inferno

Future Features:

Planned features include:
  • a class to encapsulate Hogwild! training over multiple GPUs,
  • minimal shape inference with a dry-run,
  • proper packaging and documentation,
  • cutting-edge fresh-off-the-press implementations of what the future has in store. :)

Credits

All contributors are listed here_. .. _here: https://inferno-pytorch.github.io/inferno/html/authors.html

This package was partially generated with Cookiecutter and the audreyr/cookiecutter-pypackage project template + lots of work by Thorsten.

inferno's People

Contributors

abailoni avatar bstriner avatar constantinpape avatar dependabot[bot] avatar derthorsten avatar fynnbe avatar imagirom avatar manuelhaussmann avatar nasimrahaman avatar ottogin avatar steffen-wolf avatar svenpeter42 avatar vzinche avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inferno's Issues

Batches with Tags

I think it would be useful to have a nicer interface for input and output batches with multiple elements.

Currently it can be cumbersome to keep track of what is where in the batch, especially when using multiple transforms that add or remove elements from the batch, or when using multiple loss functions that act on different ground truth/predictions.

This would be a lot easier if elements in the batch could have tags (such as 'raw', 'segmentation', 'affinities'). Transforms and loss functions could use these tags to select what they act on, and also label their outputs.

I will probably implement this at least for myself, but doing so in a nice way while keeping the current functionality will be harder. So I am interested in whether this feature would be useful to others, and if someone has ideas on how to implement it.

IOU is broken

With the current pytorch (1.0), the IOU metric fails with

File "/home/pape/Work/software/conda/miniconda3/envs/torch10/lib/python3.7/site-packages/inferno/extensions/metrics/categorical.py", line 104, in forward
numerator = (flattened_prediction * onehot_targets).sum(-1)
RuntimeError: expected type torch.cuda.FloatTensor but got torch.cuda.LongTensor

I could fix this by casting onehot_targets to float before this line:
https://github.com/inferno-pytorch/inferno/blob/master/inferno/extensions/metrics/categorical.py#L104
But we should probably double check that this is the right thing to do.

Gradient clip?

Is there a way to register call back for gradient clip?

ImportError from torchvision

  • inferno version: 0.1.8 (installed via conda install -c pytorch -c conda-forge inferno)
  • Python version: 3.6.8
  • Operating System: Ubuntu 18.04

Description

When importing inferno, I got the error:

  File "<stdin>", line 1, in <module>
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/__init__.py", line 6, in <module>
    from . import io
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/io/__init__.py", line 1, in <module>
    from . import box
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/io/box/__init__.py", line 3, in <module>
    from .camvid import CamVid, get_camvid_loaders
  File "/home/sdamrich/anaconda3/envs/condaenv/inferno/io/box/camvid.py", line 9, in <module>
    from torchvision.datasets.folder import is_image_file, default_loader
ImportError: cannot import name 'is_image_file'

I use torchvision 0.2.1 and pytorch 1.0.1

What I Did

Commenting out the first two lines in

/home/sdamrich/anaconda3/envs/condaenv/inferno/io/box/__init__.py 

solved the issue for me.

Implement infinite training

As of now, Trainer does not work if max_num_epochs or max_num_iterations is not specified. Not providing either should result in the trainer training till interrupted (via Ctrl+C or SIGINT).

readthedocs vs self-build + githubpage hosted

Maintaining a building documentation on readthedocs is a pain in the ass:

  • Hard / Impossible to debug
  • No GPU => all examples which shall produce plots rely somehow on a cuda GPU. We will never get this on readthedocs => the example gallery will look poor on readthedocs

Building the docs by ourselfs and host them via https://pages.github.com/ is not very hard, It just means we need do this on a regular basis. But we get a supernice auto example gallery and it is not fragile at all.

@nasimrahaman what do you think?

Question about save_now logic

I think there is a logic bug is save_now. Please correct me if I am wrong.

https://github.com/inferno-pytorch/inferno/blob/master/inferno/trainers/basic.py#L484

The second condition is currently:

elif self._is_iteration_with_best_validation_score:
            return self._save_at_best_validation_score

Shouldn't that be:

elif self._save_at_best_validation_score:
            return self._is_iteration_with_best_validation_score

If you are only saving at the best score, then only save_now if you are the best score.

However, if you are currently at the best score and save at best is off, it will not save. Should be an easy fix just swapping those two variables.

Tensorboard logging fails with tensorboardX 1.4

The TensorboardLogger fails when logging images and using tensorboardX 1.4 with the
stack trace below.
Note that this error does not occur in tensorboardX 1.2.

  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 354, in log_image_or_volume_batch
    self.log_images(tag, image_list, step)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 395, in log_images
    self.writer.add_image(tag, img_tensor=image, global_step=step)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/tensorboardX/writer.py", line 412, in add_image
    self.file_writer.add_summary(image(tag, img_tensor), global_step, walltime)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/tensorboardX/summary.py", line 205, in image
    image = make_image(tensor, rescale=rescale)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/tensorboardX/summary.py", line 243, in make_image
    image = Image.fromarray(tensor)
  File "/home/pape/Work/software/conda/miniconda3/envs/torch41/lib/python3.6/site-packages/PIL/Image.py", line 2463, in fromarray
    raise TypeError("Cannot handle this data type")

Potential Collaboration

Description

You guys seem to be doing something very similar to the torchsample project and it seems to me that there's a ton of room for collaboration / code merging. Please consider at least pulling in some of the functionality from that library. Also, I really like the way the callbacks are structured in torchsample. I didn't see anything similar in inferno but I think it would be a good idea.

Thanks

Recursion error since there is no tensor/variable distinction

  • inferno version: 0.1.8 (installed via conda install -c pytorch -c conda-forge inferno)
  • Python version: 3.6.8
  • Operating System: Ubuntu 18.04

Description

Initialising a model raised an recursion error.
Specifically, in inferno/extensions/initializers/presets.py line 23

if isinstance(tensor, Variable):
    self.call_on_tensor(tensor.data)

the if clause is always true and one gets stuck in infinite recursion.

What I Did

Delete the lines

        if isinstance(tensor, Variable):
            self.call_on_tensor(tensor.data)
            return tensor

Extended Support for PyTorch v0.4+

This is an issue to track the next inferno release built around PyTorch 0.4. Below is a list of what is to come, feel free to populate it and/or suggest changes.

Core

  • Adapt to the Pytorch v0.4 paradigm and deprecate Variables,
  • Integrate zero-dimensional tensors (and get rid of all variable.data[0] in the codebase),
  • Integrate the new device-agnostic constructs (tensors.to(...) or model.to(...)),
  • Wrapper to manage gradient checkpointing,
  • Integrate support for reduce=False in all inferno-managed losses functions,

Visualization

General

  • Make the trainer class more modular without compromising on functionality. Break-up the oversized Trainer class to smaller classes to facilitate future support for multi-model trainers.

Updates

15 Aug 2018

To fully implement all 0.4+ features without bloating the codebase, we'd need to deprecate v0.3 and below, potentially invalidating a lot of code. I guess this can wait till v1.0.

Normalize channels separately?

  • inferno version: n/a
  • Python version: n/a
  • Operating System: n/a

Description

inferno/io/trasnform/generic.py
class Normalize(Transform)

def tensor_function(self, tensor):
    mean = np.asarray(tensor.mean()) if self.mean is None else self.mean
    std = np.asarray(tensor.std()) if self.std is None else self.std
    # Figure out how to reshape mean and std
    reshape_as = [-1] + [1] * (tensor.ndim - 1)
    # Normalize
    tensor = (tensor - mean.reshape(*reshape_as))/(std.reshape(*reshape_as) + self.eps)
    return tensor

Issue

I am not sure I'm getting the intentions here, but I guess this reshaping the mean and std part is meant to apply separate means and stds for channels, right?
In this case it looks like it wouldn't work if the mean and std were not supplied as arguments (tensor.mean() would return the mean of a flattened array by default?)
Was it meant like this?

Neuroglancer integration in tensorboard logger.

It would be great to have the neuroglancer viewer availble for 3D volumetric data during inference.
This would make data inspection much easier especially for data with multiple channels.

Better support for Tensorboard

This project looks like a good replacement for the manual tensorboard business we currently have going. It makes it much easier to integrate histograms, distributions, and even audio.

TensorboardLogger logs stale states for validation_*

The TensorboardLogger needs a end_of_validation_iteration method. Also, the _trainer_states_being_observed attribute needs to be split in two or more subsets (e.g. one for training and one for validation).

Clean up tests and set up Travis CI

  • Some tests require GPU, and they need to be unittest.skip-ped.
  • Need more travis-friendly tests for Trainer, perhaps with a dummy model on a dummy dataset with a dummy criterion and a dummy metric.

tensorboard logger asking for a detach()

  • inferno version:
    output from conda list | grep inferno
inferno                   v0.4.0                     py_0    conda-forge
inferno-pytorch           0.4.0                    pypi_0    pypi
  • Python version:
    3.7.4
  • Operating System:
    centOS

Description

I get this error (I show only the last line of the backtrace)

  File "/home/my_username/anaconda3/envs/my_project/lib/python3.7/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 292, in extract_images_from_batch
    batch = batch.float().numpy()
RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

Produced by running the following code (I post only the part that should be relevant)

    # ...
    trainer = Trainer(vae)
    trainer.save_to_directory(folder)
    trainer.cuda()
    trainer.build_criterion(vae.loss_function())
    trainer.build_optimizer('Adam', lr=0.001)
    trainer.save_every((1, 'epochs'))
    trainer.set_max_num_epochs(100)
    trainer.build_logger(TensorboardLogger(log_scalars_every=(1, 'iteration'),
                                           log_images_every=(1, 'iteration'),
                                           log_directory=folder))
    # ...

TensorboardLogger defaults don't make sense

There are some issues with the TensorboardLogger default arguments.
All log_X_every get the default argument None, which will be mapped to once every iteration.
This is problematic:

  • log_histograms_every set to once every iteration will lead to calling log_histogram and raise a NotImplementedError
  • log_images_every set to once every iteration can result in huge log-files, because it stores a lot of images.

Probably the best solution is to change the handling of None for log_images and log_histogram

Example does not make sense

I just checked the example we have in the README and I think it does
not make sense....
We are adding a Softmax and use CrossEntropyLoss (which combines Softmax and NLLLoss).

make GarbageCollection a default

we should make the behavior off the GarbageCollection callback default and make this callback obsolete.
We should proidve an API like the following and have reasonable defaults:

trainer.garbage_collect(collect_every=(1, 'iteration'))

Power users can disable gc via

trainer.garbage_collect(collect_every='never')

Add more examples:

We should add more examples for the following things:

  • infernos transformation pipeline, show / highlight the difference between torchvision and infernos transformations
  • usage of trainer with non-trivial dataset (smth where num_input>1 and num_output > 1)
  • load a model which was trained/saved via inferno and use this trained model to predict
  • ...

Can't run the basic example

  • inferno version: v0.3.1
  • Python version: 3.6.7
  • Operating System: macOS Mojave 10.14.4

Description

I have tried to run the script in the Readme, after having set the three directories that must be set and disabling CUDA. When running the script with python3 hello_world.py I got two errors, I made the first disappear (see below), but the second is still present. The expected behavior is to get no error.

What I Did

The full code is reported below, in a file called hello_world.py.

import torch.nn as nn
from inferno.io.box.cifar import get_cifar10_loaders
from inferno.trainers.basic import Trainer
from inferno.trainers.callbacks.logging.tensorboard import TensorboardLogger
from inferno.extensions.layers.convolutional import ConvELU2D
from inferno.extensions.layers.reshape import Flatten

# Fill these in:
LOG_DIRECTORY = 'log'
SAVE_DIRECTORY = 'save'
DATASET_DIRECTORY = 'data'
DOWNLOAD_CIFAR = True
USE_CUDA = False

# Build torch model
model = nn.Sequential(
    ConvELU2D(in_channels=3, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    ConvELU2D(in_channels=256, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    ConvELU2D(in_channels=256, out_channels=256, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    Flatten(),
    nn.Linear(in_features=(256 * 4 * 4), out_features=10),
    nn.LogSoftmax(dim=1)
)

# Load loaders
train_loader, validate_loader = get_cifar10_loaders(DATASET_DIRECTORY,
                                                    download=DOWNLOAD_CIFAR)

# Build trainer
trainer = Trainer(model) \
  .build_criterion('NLLLoss') \
  .build_metric('CategoricalError') \
  .build_optimizer('Adam') \
  .validate_every((2, 'epochs')) \
  .save_every((5, 'epochs')) \
  .save_to_directory(SAVE_DIRECTORY) \
  .set_max_num_epochs(10) \
  .build_logger(TensorboardLogger(log_scalars_every=(1, 'iteration'),
                                  log_images_every='never'),
                log_directory=LOG_DIRECTORY)

# Bind loaders
trainer \
    .bind_loader('train', train_loader) \
    .bind_loader('validate', validate_loader)

if USE_CUDA:
  trainer.cuda()

# Go!
trainer.fit()

I first created the three folders specified in the script with mkdir log, mkdir save, mkdir data. I then ran the script with python3 hello_world.py. I first got the error:

  File "hello_world.py", line 2, in <module>
    from inferno.io.box.cifar import get_cifar10_loaders
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/__init__.py", line 6, in <module>
    from . import io
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/io/__init__.py", line 4, in <module>
    from . import volumetric
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/io/volumetric/__init__.py", line 1, in <module>
    from .volume import VolumeLoader, HDF5VolumeLoader, TIFVolumeLoader
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/io/volumetric/volume.py", line 8, in <module>
    from ...utils import io_utils as iou
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/utils/io_utils.py", line 5, in <module>
    from scipy.misc import imsave
ImportError: cannot import name 'imsave'

which I could solve by running conda install -c anaconda scipy. I was not expecting this error because, since I installed inferno with conda, I expected all the dependencies to be already installed.

The second error that now I get is the following:

Traceback (most recent call last):
  File "hello_world.py", line 2, in <module>
    from inferno.io.box.cifar import get_cifar10_loaders
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/__init__.py", line 7, in <module>
    from . import trainers
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/__init__.py", line 1, in <module>
    from . import basic
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/basic.py", line 20, in <module>
    from .callbacks.logging.base import Logger
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/__init__.py", line 4, in <module>
    from .tensorboard import TensorboardLogger
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/inferno/trainers/callbacks/logging/tensorboard.py", line 1, in <module>
    import tensorboardX as tX
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/__init__.py", line 5, in <module>
    from .torchvis import TorchVis
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in <module>
    from .writer import SummaryWriter
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/writer.py", line 15, in <module>
    from .event_file_writer import EventFileWriter
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in <module>
    from .proto import event_pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in <module>
    from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in <module>
    from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in <module>
    from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
  File "/miniconda3/envs/my_test/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in <module>
    serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX\"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3')

How to fix it?

SaveAtBestValidationScore does not save the first computed value.

If one uses SaveAtBestValidationScore , the first computed validation score is wrongly never considered as the best and therefore not saved.

[INFO    ] Breaking to validate.                                                                                                         
[INFO    ] Validating.                                                                                                                   
[INFO    ] validate generator exhausted, breaking.                                                                                       
[INFO    ] Done validating. Logging results...                                                                                           
[INFO    ] Validation loss: 3.2442782860133543; validation error: None                                                                   
[INFO    ] Current smoothed validation score 3.2442782860133543 is not better than the best smoothed validation score 3.2442782860133543.

cannot import name 'is_image_file'

  • inferno version:v0.1.7
  • Python version:3.6
  • Operating System:ubuntu16

Description

I try to use inferno trainers, come with the output:
cannot import name 'is_image_file'

What I Did

$ python alderley_patchcwganp.py 
Traceback (most recent call last):
  File "alderley_patchcwganp.py", line 9, in <module>
    from inferno.trainers.basic import Trainer
  File "/home/jiangsht/anaconda2/envs/chi/inferno/__init__.py", line 6, in <module>
    from . import io
  File "/home/jiangsht/anaconda2/envs/chi/inferno/io/__init__.py", line 1, in <module>
    from . import box
  File "/home/jiangsht/anaconda2/envs/chi/inferno/io/box/__init__.py", line 3, in <module>
    from .camvid import CamVid, get_camvid_loaders
  File "/home/jiangsht/anaconda2/envs/chi/inferno/io/box/camvid.py", line 9, in <module>
    from torchvision.datasets.folder import is_image_file, default_loader
ImportError: cannot import name 'is_image_file'


Uniitests to agressive

@nasimrahaman I think the unit tests are to aggressive.

They tend to fail too often on travis.
I guess one should relax these tests a bit

self.assertLess(trainer.get_state('validation_error_averaged'), (1 - 1/self.NUM_CLASSES))
E       AssertionError: 0.9244186046511628 not less than 0.9

Clean up `extensions/layers`

We currently have a bunch of files in extensions/layers that implement somewhat redundant functionality:

  • building_blocks: Implements residual block in ResBlockBase and ResBlock
  • prefab: Implements residual block in ResidualBlock
  • res_unet: Implements residual u-net.
  • unet_base: Implements u-net base class.

I would vote to merge building_blocks and prefab and if possible also merge the residual block implementations in there. I like @DerThorsten suggestions to name the new file conv_blocks,
because this makes clear what's in there.

Regarding the unet:
Maybe put everything into a single unet file?

Momentum is not suitable for smoothing validation score

In the current implementation of validation smoothing, we use momentum.
This puts a very high importance on the first validation score.
E.g. for 3 validation scores [.75, .2, .1] the smoothed value would be something like 0.7.

I think using a sliding window with some decay would be more appropriate.

Require a save directory?

Several parts of the Trainer class require a location to save to but don't complain until it is too late.
Examples are (of course) when a save point is specified via save_every, but the trainer also defaults to saving after a validation run even without a necessary directory.

Error when trying to continue training a saved model

  • inferno version:
  • Python version:
  • Operating System:

Description

I build a model and saved it using

trainer.save_every((1, 'epochs'))
trainer.save_to_directory(folder)

When I rerun my Python script to load and continue training the previous model I get an error.

What I Did

This is my code.

def train(load=False, folder='out'):
    print('starting training')
    os.makedirs(folder, exist_ok=True)

    # setup logger
    Logger.instance().setup('log')

    vae = Vae()

    ds = MyDataset(root_folder=root_folder, training=True)
    train_loader = torch.utils.data.DataLoader(ds, batch_size=512, num_workers=16)

    # build trainer
    trainer = Trainer(vae)
    trainer.cuda()

    trainer.build_criterion(vae.loss_function())
    trainer.build_optimizer('Adam', lr=0.001)
    # trainer.validate_every((2, 'epochs'))
    trainer.save_every((1, 'epochs'))
    trainer.save_to_directory(folder)
    trainer.set_max_num_epochs(100)

    # bind loaders
    trainer.bind_loader('train', train_loader, num_inputs=1, num_targets=1)

    # bind callbacks
    trainer.register_callback(GarbageCollection())
    # trainer.register_callback(ShowMinimalConsoleInfo())

    if load:
        trainer.load()
    trainer.fit()

When calling train(load=True) I get the following error:

  File "main.py", line 104, in my_train
    trainer.fit()
  File "/data/l989o/anaconda3/envs/hemo/lib/python3.7/site-packages/inferno/trainers/basic.py", line 1336, in fit
    self.train_for(break_callback=lambda *args: self.stop_fitting(max_num_iterations,
  File "/data/l989o/anaconda3/envs/hemo/lib/python3.7/site-packages/inferno/trainers/basic.py", line 1410, in train_for
    batch = self.fetch_next_batch('train')
  File "/data/l989o/anaconda3/envs/hemo/lib/python3.7/site-packages/inferno/trainers/basic.py", line 1092, in fetch_next_batch
    self._loader_iters.update({from_loader: self._loaders[from_loader].__iter__()})
KeyError: 'train'

Any ideas how to fix it? Thanks.

Graph model fails to replicate on multiple devices.

As of this commit, the problem can be reproduced as follows:

import torch
from torch.autograd import Variable
import torch.nn as nn
from torch.nn.parallel.data_parallel import data_parallel
from inferno.extensions.containers.graph import Graph

input_shape = [8, 1, 3, 128, 128]
model = Graph()\
    .add_input_node('input')\
    .add_node('conv0', nn.Conv3d(1, 10, 3, padding=1), previous='input')\
    .add_node('conv1', nn.Conv3d(10, 1, 3, padding=1), previous='conv0')\
    .add_output_node('output', previous='conv1')

model.cuda()
input = Variable(torch.rand(*input_shape).cuda())
output = data_parallel(model, input, device_ids=[0, 1, 2, 3])

This raises:

RuntimeError: tensors are on different GPUs

Could this be due to this add_module?

Remove deprecated imsave

scipy.misc.imsave is deprecated and not part of recent scipy versions any more.
We still use it here

We should use skimage.io.imsave or imageio.imwrite instead.

cc @FynnBe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.