Git Product home page Git Product logo

pytorch-pfn-extras's Introduction

pytorch-pfn-extras

PyPI Docs License

Supplementary components to accelerate research and development in PyTorch.

Installation

pip install pytorch-pfn-extras

# Use `[onnx]` to use onnx submodule like:
#  pip install "pytorch-pfn-extras[onnx]"

### Optinal dependencies
# For PlotReport / VariableStatisticsPlot extensions
pip install matplotlib

# For IgniteExtensionsManager
pip install pytorch-ignite torchvision

# For CuPy interoperability (see: https://docs.cupy.dev/en/stable/install.html)
pip install cupy  # or cupy-cudaXXX

Requirements

  • Python 3.8+
  • PyTorch 1.10+

Optional dependencies:

  • CuPy 8.0+ for PyTorch/CuPy interoperatbility

Documentation

Refer to Read The Docs for the complete documentation.

Below are some quick-links to the most important features of the library.

Examples

Contribution Guide

You can contribute to this project by sending a pull request. After approval, the pull request will be merged by the reviewer.

Before making a contribution, please confirm that:

  • Code quality stays consistent across the script, module or package.
  • Code is covered by unit tests.
  • API is maintainable.

License

MIT License

pytorch-pfn-extras's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-pfn-extras's Issues

Saving snapshots or logs fail in case of unexpected previous failure

Here, in case log.bak remains, restarting the intermediately killed job always fails because rename(src, dst) always fail in HDFS when dst exists (HDFS is designed to always fail in this case). We might need ensuring the lack of the destination file, but there should be a lot of pitfall in unexpected process termination.

This bug was found in our internal HDFS cluster and we have corresponding issue internally.

API to handle filesystems that does not support append

Related to: #118

Some filesystems like HDFS or S3 (equivalents) do not support appending to an existing file. It would be beneficial for users to provide support for these file systems via Writer although keeping performance.

An initial idea: provide a BufferedWriter class that can be used as a mix-in to existing writers. The mixin writes to a local scratch disk, then sync it to the underlying HDFS/S3 filesystems in a background thread.

The `Linear` behaves different when the input is multi-D

Problem Statement
Chainer and Pytorch are handling Linear differently when the input has multi-D. In Chainer, we look at the first D, and the input is flattened before doing the actual forward; While Pytorch only looks at the last D.
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/chainer/links/connection/linear.py#L181
https://github.com/pytorch/pytorch/blob/d035d05080729c30636ff30fcc068de3c7e9badd/torch/nn/functional.py#L1676

Example

>>> chainer_linear = chainer.links.Linear(10, 10)
>>> torch_linear = torch.nn.Linear(10, 10)
>>> input = numpy.arange(100, dtype=numpy.float32).reshape(10, 2, 5)
>>> chainer_linear(input)
variable([[-1.27548418e+01, -6.84057713e-01,  4.33863544e+00,
            6.55931234e+00, -3.39445019e+00, -2.51949596e+00,
            1.03438854e+00,  7.69187212e-02,  1.93967378e+00,
            2.17406702e+00],
          [-3.20976524e+01,  2.56630421e+00,  2.20505409e+01,
            1.03132820e+01, -9.60312557e+00, -1.00995445e+01,
            1.22814524e+00, -2.46707058e+00,  1.31490164e+01,
           -4.19132054e-01],
          [-5.14404640e+01,  5.81666756e+00,  3.97624512e+01,
            1.40672522e+01, -1.58118019e+01, -1.76795921e+01,
            1.42190135e+00, -5.01105976e+00,  2.43583584e+01,
           -3.01233125e+00],
          [-7.07832718e+01,  9.06703091e+00,  5.74743538e+01,
            1.78212242e+01, -2.20204792e+01, -2.52596416e+01,
            1.61565781e+00, -7.55504990e+00,  3.55676994e+01,
           -5.60552549e+00],
          [-9.01260834e+01,  1.23173914e+01,  7.51862717e+01,
            2.15751915e+01, -2.82291527e+01, -3.28396912e+01,
            1.80941379e+00, -1.00990391e+01,  4.67770386e+01,
           -8.19872189e+00],
          [-1.09468895e+02,  1.55677538e+01,  9.28981705e+01,
            2.53291626e+01, -3.44378281e+01, -4.04197388e+01,
            2.00317073e+00, -1.26430283e+01,  5.79863853e+01,
           -1.07919216e+01],
          [-1.28811707e+02,  1.88181152e+01,  1.10610077e+02,
            2.90831299e+01, -4.06464958e+01, -4.79997826e+01,
            2.19692683e+00, -1.51870193e+01,  6.91957169e+01,
           -1.33851223e+01],
          [-1.48154510e+02,  2.20684814e+01,  1.28321976e+02,
            3.28371048e+01, -4.68551788e+01, -5.55798340e+01,
            2.39068103e+00, -1.77310085e+01,  8.04050751e+01,
           -1.59783182e+01],
          [-1.67497330e+02,  2.53188400e+01,  1.46033875e+02,
            3.65910721e+01, -5.30638542e+01, -6.31598854e+01,
            2.58443928e+00, -2.02749958e+01,  9.16144028e+01,
           -1.85715141e+01],
          [-1.86840134e+02,  2.85692101e+01,  1.63745789e+02,
            4.03450470e+01, -5.92725296e+01, -7.07399368e+01,
            2.77819300e+00, -2.28189812e+01,  1.02823753e+02,
           -2.11647110e+01]])
>>> torch_linear(torch.Tensor(input))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tianqi/.pyenv/versions/3.7.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tianqi/.pyenv/versions/3.7.4/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/tianqi/.pyenv/versions/3.7.4/lib/python3.7/site-packages/torch/nn/functional.py", line 1593, in linear
    output = input.matmul(weight.t())
RuntimeError: size mismatch, m1: [20 x 5], m2: [10 x 10] at /home/tianqi/repository/pytorch/aten/src/TH/generic/THTensorMath.cpp:41
>>>

Solution
Is there any plan to bridge such difference?
What I have been doing is adding a Flatten layer before the Linear, like:

class Flatten(torch.nn.Module):
    def __init__(self, n_batch_axes=1):
        self.n_batch_axes = n_batch_axes

    def forward(self, x):
        return torch.flatten(x, start_dim=self.n_batch_axes)

Create internal debug tools

We need some kind of logging mechanism for debug

When loading snapshots or verifying trigger checks having a
PPE_LOG_LEVEL environment variable that allows us to obtain DEBUG information
(snapshot not found, snapshot invalid, etc.) will be very useful.

Bug in when we use ProcessWriter with extensions.

problem statement

I run a code that includes the below fragment. Then, the prompt throws the below error.

    writer = writing.ProcessWriter(savefun=torch.save, out_dir=save_path)
    manager.extend(extensions.snapshot(writer=writer), trigger=(1, 'iteration'))
    manager.extend(extensions.snapshot(writer=writer, filename='gen_{.epoch}', target=generator.module), trigger=(10, 'iteration'))
    manager.extend(extensions.snapshot(), trigger=(10, 'epoch'))
    manager.extend(extensions.snapshot(filename='gen_{.epoch}', target=generator.module), trigger=(10, 'epoch'))    

error message

Traceback (most recent call last):
  File "main_train.py", line 232, in <module>
    train()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "main_train.py", line 226, in train
    Image.fromarray(x).save(f'{i}.png')
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/manager.py", line 390, in run_iteration
    self.run_extensions()
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/manager.py", line 272, in run_extensions
    entry.extension(self)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 397, in __call__
    self._make_snapshot(manager)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 422, in _make_snapshot
    writer(filename, outdir, serialized_target, savefun=self._savefun)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/writing.py", line 308, in __call__
    savefun, **self._kwds)
TypeError: create_worker() takes 4 positional arguments but 5 were given

The create_worker in StandardWriter accepts five arguments including self.
However, create_worker in ProccessWriter and ThreadWriter accept only four arguments.

self._worker = self.create_worker(filename, out_dir, target,
savefun, **self._kwds)
self._worker.start()
self._started = True
def create_worker(self, filename, out_dir, target, savefun, **kwds):

def create_worker(self, filename, out_dir, target, **kwds):
return multiprocessing.Process(
target=self.save,
args=(filename, out_dir, target, self._savefun),
kwargs=self._kwds)

Slack notification

It would be better if we can post experiment results (may be a media file) to Slack.

Feature request: LazyBatchNorm

In my project, I implemented LazyBatchNorm like this and I think pytorch-pfn-extras should have the similar feature:

class LazyBatchNorm1d(
    ppe.nn.modules.lazy.LazyInitializationMixin, torch.nn.BatchNorm1d
):
    """
    LazyBatchNorm1d is a lazy version of BatchNorm1d.  It does not require to
    specify the number of the input parameters beforehand.
    """

    lazy_parameter_names = ("weight", "bias")

    def __init__(self, num_features, *args, **kwargs):
        super().__init__(num_features or 0, *args, **kwargs)
        if num_features is None:
            self.num_features = None
            self.weight = ppe.nn.modules.lazy.UninitializedParameter()
            self.bias = ppe.nn.modules.lazy.UninitializedParameter()

    def forward(self, input):
        if isinstance(self.weight, ppe.nn.modules.lazy.UninitializedParameter):
            self.num_features = input.shape[-1]
            if self.affine:
                self.weight = torch.nn.Parameter(
                    self.weight.new_empty(self.num_features)
                )
                self.bias = torch.nn.Parameter(self.weight.new_empty(self.num_features))
            if self.track_running_stats:
                self.running_mean = torch.zeros(
                    self.num_features, device=self.running_mean.device
                )
                self.running_var = torch.ones(
                    self.num_features, device=self.running_var.device
                )
            self.reset_parameters()
        return super().forward(input)

    def reset_parameters(self):
        if self.lazy_parmeters_determined:
            super().reset_parameters()

Another discussion:
It would be good to have LazyBatchNorm instead of having LazyBatchNorm*d-s since lazy ones know its desired shapes. Any thoughts?

Recommended version specifications

Some projects seems to stick with the same version pytorch-pfn-extras==0.2.0
I think we need to declare our versioning rules (e.g. Z changes of version X.Y.Z does not introduce backward-breaking changes so that they can use pytorch-pfn-extras<0.3.0)

`BestValueTrigger` does not work well in mpi run like multi node setting

What happened

_DistributedSnapshot with BestValueTrigger gets stuck.

code

https://gist.github.com/dhgrs/56424106e00bafee9617b0a15a028c2c

command

CUDA_VISIBLE_DEVICES=0,1 mpiexec -N 2 python3 mnist.py

Why it causes

Reporter works in all mpi processes without all reduce operation so that BestValueTrigger check different values in each process. It causes that some processes are triggered but the others are not. _DistributedSnapshot waits for all mpi processes but some of them would never finish because are not triggered.

Workaround

Apply all reduce operation manually before reporting. But is this the best way? Will ppe support auto all reduce?

Do you have a plan to support TensorBoard

PyTorch supports TensorBoard naitively in torch.utils.tensorboard. But ppe's LogReport doesn't because tensorbord writers don't satisfy the requirement.

writer (writer object, optional): must be callable.
object to dump the log to. If specified, it needs to have a correct
`savefun` defined. The writer can override the save location in
the :class:`pytorch_pfn_extras.training.ExtensionsManager` object

So my question is would ppe support it? Or do we have any other way to do it?

`autoload=True` not working with models in the gpu

If we create the snapshot extension with autoload=True the model will not correctly load its state.

autoload=true loads the state in the CPU. but it is not executed until the first iteration and it will overwrite the device of the model. This requires us to call start_extensions manually and then do the device move for it to work

# model parameters will be moved to cpu at the beginning of the first iteration due autoload being executed there
...
manager.extend(extensions.snapshot(autoload=True)
# move the model, but this will be overwriten later
model.cuda()

for batch in train_loader:
    with manager.run_iteration():   # Snapshot load happens the first time this is executed
        model(batch.cuda())   #  Error! weights are in the cpu again

tabular dataset and transform with tuple as return value behave incorrectly

from pytorch_pfn_extras.dataset.tabular import DelegateDataset, from_data
import numpy as np
dataset = from_data(
    (
        ("img", [np.zeros((3, 32, 32))]),
    )
)
sizes = dataset.asdict().transform(
    ("size",), (((("img",), ("size",)), lambda img: img.shape[1:]),)
)
# expected: (32, 32)
# actual: (32,) 
print(sizes[0])

`import pytorch_pfn_extras` fails when the file name is profile.py

This is a bug report. `I'm using pytorch-pfn-extras==0.3.1 and I noticed that it fails import when the file name is profile.py. Here is the file.

import pytorch_pfn_extras

When the file name is hoge.py, it doesn't cause any errors. But when it's profile.py, it cause an error.

$ /usr/bin/python3 /repo/profile.py
Traceback (most recent call last):
  File "/repo/profile.py", line 1, in <module>
    import pytorch_pfn_extras
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/__init__.py", line 7, in <module>
    from pytorch_pfn_extras import training  # NOQA
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/training/__init__.py", line 6, in <module>
    from pytorch_pfn_extras.training import extensions  # NOQA
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/training/extensions/__init__.py", line 17, in <module>
    from pytorch_pfn_extras.training.extensions.print_report_notebook import PrintReportNotebook  # NOQA
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/training/extensions/print_report_notebook.py", line 3, in <module>
    from IPython.core.display import display
  File "/home/user/.local/lib/python3.7/site-packages/IPython/__init__.py", line 56, in <module>
    from .terminal.embed import embed
  File "/home/user/.local/lib/python3.7/site-packages/IPython/terminal/embed.py", line 17, in <module>
    from IPython.terminal.ipapp import load_default_config
  File "/home/user/.local/lib/python3.7/site-packages/IPython/terminal/ipapp.py", line 28, in <module>
    from IPython.core.magics import (
  File "/home/user/.local/lib/python3.7/site-packages/IPython/core/magics/__init__.py", line 21, in <module>
    from .execution import ExecutionMagics
  File "/home/user/.local/lib/python3.7/site-packages/IPython/core/magics/execution.py", line 24, in <module>
    import cProfile as profile
  File "/usr/lib/python3.7/cProfile.py", line 22, in <module>
    run.__doc__ = _pyprofile.run.__doc__
AttributeError: module 'profile' has no attribute 'run'

Problems with `share_memory` on `Lazy` modules

Problem Statement
Since the parameters in Lazy modules, like LazyLinear, changes on the first forward. If the model is shared with share_memory. The model may not be correctly shared as the new parameters will not be on the shared memory.

Examples

1. shared_model = some_lazy_model()
2. shared_model.share_memory()
3. fork to process A and B
4. process A: shared_model(input)... # forward and step
5. process B: test_model.load_state_dict(shared_model.state_dict()) # problems here!!!

The parameters changed in (4) are not on the shared memory which happened in (2), so (5) can not get updated results.

The problem was first noticed and reported by @shu65

Bug in sequential repeat when the layer has no parameters

problems Statement
In init mode, the repeated layer will be reset. In Pytorch, we used reset_parameters function to reset the parameters of layers, as here:

However, there are layers have no parameters nor reset_parameters, such as torch.nn.ReLU. An error is raised when the model contains such layer.

Error Message

pytorch_pfn_extras/nn/modules/extended_sequential.py:68: in repeat
    model_list.append(self._copy_model(mode))
pytorch_pfn_extras/nn/modules/extended_sequential.py:27: in _copy_model
    return _reset_parameters(copy.deepcopy(self))
pytorch_pfn_extras/nn/modules/extended_sequential.py:9: in _reset_parameters
    _reset_parameters(submodel)
pytorch_pfn_extras/nn/modules/extended_sequential.py:17: in _reset_parameters
    model.reset_parameters()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = ReLU(), name = 'reset_parameters'

    def __getattr__(self, name):
        if '_parameters' in self.__dict__:
            _parameters = self.__dict__['_parameters']
            if name in _parameters:
                return _parameters[name]
        if '_buffers' in self.__dict__:
            _buffers = self.__dict__['_buffers']
            if name in _buffers:
                return _buffers[name]
        if '_modules' in self.__dict__:
            modules = self.__dict__['_modules']
            if name in modules:
                return modules[name]
        raise AttributeError("'{}' object has no attribute '{}'".format(
>           type(self).__name__, name))
E       AttributeError: 'ReLU' object has no attribute 'reset_parameters'

DistributedSnapshot with NCCL

I want to use DistributedSnapshot with NCCL. However, current _get_ranks_from_env only accept environment variables for MPI and MV1.
In pytorch documentation, they use the environment variables: WORLD_SIZE, RANK, and LOCAL_RANK, for specifying a process.

def _get_ranks_from_env():
if 'OMPI_COMM_WORLD_SIZE' in os.environ:
# We are running Open MPI
comm_world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
comm_rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
comm_local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
elif 'MV2_COMM_WORLD_SIZE' in os.environ:
comm_world_size = int(os.environ['MV2_COMM_WORLD_SIZE'])
comm_rank = int(os.environ['MV2_COMM_WORLD_RANK'])
comm_local_rank = int(os.environ['MV2_COMM_WORLD_LOCAL_RANK'])
else:
comm_world_size = 1
comm_rank = 0
comm_local_rank = 0
return comm_world_size, comm_rank, comm_local_rank

Release automation

Automate the release process (publish to pypi) by using GitHub Actions.

Error writing snapshot backup file

self.fs.rename(dest, bak)

This line caused an error.
It seems .bak file already existed at the time of this call. I think such a situation can easily happen if the previous training session was terminated before removing it.

By the way, it also failed to load the snapshot of the previous session and started over from the epoch 1 (see the timestamps of the files listed below).
I don't know whether or not it's related to this issue.

epoch       iteration   elapsed_time  lr          train/loss  val/loss    val/top1    val/top5                                                                                                                                       
1           58          17.791        0.1         8.19181     31.9144     0           0.0119048                                                                                                                                      
2           116         40.637        0.1         7.68947     7.59687     0.00892857  0.00892857                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                   
  File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                                                                                                         
    return _run_code(code, main_globals, None,                                                                                                                                                                                       
  File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code                                                                                                                                                                    
    exec(code, run_globals)                                                                                                                                                                                                          
  File "mywork/__main__.py", line 40, in <module>                                      
    main()                                                                                                                                                                                                                           
  File "mywork/__main__.py", line 34, in main                                          
    args.func(args)                                                                                                                                                                                                                  
  File "mywork/train.py", line 422, in main                                            
    run_train(manager, model, device, train_loader, optimizer,                                                                                                                                                                       
  File "mywork/train.py", line 153, in run_train                                       
    ppe.reporting.report({'train/loss': loss.item()})                                                                                                                                                                                
  File "/usr/local/lib/python3.8/contextlib.py", line 120, in __exit__                                                                                                                                                               
    next(self.gen)                                                                                                                                                                                                                   
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/manager.py", line 430, in run_iteration                                                                                                         
    self.run_extensions()                                                                                                                                                                                                            
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/manager.py", line 299, in run_extensions                                                                                                        
    entry.extension(self)                                                                                                                                                                                                            
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 458, in __call__                                                                                                 
    self._make_snapshot(trainer)                                                                                                                                                                                                     
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 410, in _make_snapshot                                                                                           
    writer(filename, outdir, serialized_target, savefun=self._savefun)                                                                                                                                                               
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/writing.py", line 270, in __call__                                                             
    self.save(filename, out_dir, target, savefun, append, **self._kwds)
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/writing.py", line 213, in save                                                             
    self.fs.rename(dest, bak)                                                                                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/pfio/filesystems/hdfs.py", line 295, in rename                                                                                                                                        
    return self.connection.rename(src, dst)                                                                                                          
  File "/usr/local/lib/python3.8/site-packages/pyarrow/hdfs.py", line 86, in rename                                                                                        
    return super().rename(path, new_path)                                           
  File "pyarrow/io-hdfs.pxi", line 178, in pyarrow.lib.HadoopFileSystem.rename                                                                                                          
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status                                                                                                                                                
OSError: HDFS Rename failed, errno: 5 (Input/output error)                                                                                                                              
-rw-r--r--   3 niboshi niboshi  231673368 2021-05-24 22:40 hdfs:///.../last_snapshot
-rw-r--r--   3 niboshi niboshi  231991832 2021-05-24 21:57 hdfs:///.../last_snapshot.bak
-rw-r--r--   3 niboshi niboshi       554 2021-05-24 23:14 hdfs:///.../log
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 22:40 hdfs:///.../model_epoch_001
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 23:14 hdfs:///.../model_epoch_002
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 11:51 hdfs:///.../model_epoch_003
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 11:52 hdfs:///.../model_epoch_004
...

CI failures in torch19

  • test_tensorboard_writing (linux, win): #194
  • test_empty_shared_dataset (win): #194
  • test_transform[tuple-None-indices53-None-True] (linux): #194
  • test_annotate (linux, win) #196
  • test_apply_annotation (linux, win) #196
  • test_scoped_anchor (linux, win) #197
  • test_export_testcase (linux, win) #197
  • test_tensorboard_writer (linux, win): #194
  • test_queue_writer (linux): #194

linux: https://ci.preferred.jp/pytorch-pfn-extras.torch19-linux/73786/
win: https://ci.preferred.jp/pytorch-pfn-extras.torch19-win/73789/

Snapshot extensions can't deal with `reporting.Summary` when the tensor is on GPU

As described here, Summay.add can deal with a tensor on GPU.

def add(self, value, weight=1):
"""Adds a scalar value.
Args:
value: Scalar value to accumulate. It is either a NumPy scalar or
a zero-dimensional array (on CPU or GPU).
weight: An optional weight for the value. It is a NumPy scalar or
a zero-dimensional array (on CPU or GPU).
Default is 1 (integer).
"""

When calling add after snapshot extensions load summary from their snapshot, it causes an error because of multiple devices.

import pytorch_pfn_extras as ppe
import torch

SNAPSHOT_FILENAME = "snapshot"

tensor = torch.zeros(0, device="cuda")
summary = ppe.reporting.Summary()
summary.add(tensor)
torch.save(summary.state_dict(), SNAPSHOT_FILENAME)

manager = ppe.training.ExtensionsManager(
    {}, None, None, iters_per_epoch=1, out_dir="./"
)
snapshot = ppe.training.extensions.snapshot_object(
    summary,
    SNAPSHOT_FILENAME,
    autoload=True,
)

snapshot.initialize(manager)
summary.add(tensor)

Output:

Traceback (most recent call last):
  File "/tmp/a.py", line 21, in <module>
    summary.add(tensor)
  File "/usr/local/lib/python3.8/site-packages/pytorch_pfn_extras/reporting.py", line 272, in add
    self._x += weight * value
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.