pfnet / pytorch-pfn-extras Goto Github PK

View Code? Open in Web Editor NEW

268.0 268.0 51.0 2.26 MB

Supplementary components to accelerate research and development in PyTorch

Home Page: https://medium.com/pytorch/migration-from-chainer-to-pytorch-8ed92c12c8

License: MIT License

Python 98.55% Shell 0.67% Dockerfile 0.11% PowerShell 0.53% Batchfile 0.01% Makefile 0.14%

pytorch-pfn-extras's People

Contributors

Stargazers

Watchers

pytorch-pfn-extras's Issues

issue with autoload and `register_buffer`

When calling register_buffer() and the attribute does not exist yet, if a snapshot is reloaded then it crashes since autoload does not support strict=False

Add support for CuPy ndarray / Tensor conversion

https://github.com/chainer/chainer-pytorch-migration/blob/master/chainer_pytorch_migration/tensor.py

DDP and autoload conflicts

When resuming a snapshot using DDP, autoload takes wrong assumptions on the model wrapping.

`autoload=True` not working with models in the gpu

If we create the snapshot extension with autoload=True the model will not correctly load its state.

autoload=true loads the state in the CPU. but it is not executed until the first iteration and it will overwrite the device of the model. This requires us to call start_extensions manually and then do the device move for it to work

# model parameters will be moved to cpu at the beginning of the first iteration due autoload being executed there
...
manager.extend(extensions.snapshot(autoload=True)
# move the model, but this will be overwriten later
model.cuda()

for batch in train_loader:
    with manager.run_iteration():   # Snapshot load happens the first time this is executed
        model(batch.cuda())   #  Error! weights are in the cpu again

Add `py.typed`

Once we complete typing we should add it.

tabular dataset and transform with tuple as return value behave incorrectly

from pytorch_pfn_extras.dataset.tabular import DelegateDataset, from_data
import numpy as np
dataset = from_data(
    (
        ("img", [np.zeros((3, 32, 32))]),
    )
)
sizes = dataset.asdict().transform(
    ("size",), (((("img",), ("size",)), lambda img: img.shape[1:]),)
)
# expected: (32, 32)
# actual: (32,) 
print(sizes[0])

Slack notification

It would be better if we can post experiment results (may be a media file) to Slack.

Failed to rename a file in writing.py

The following error happens.

I fixed it.

Make require PyTorch>=1.8

The `Linear` behaves different when the input is multi-D

Problem Statement
Chainer and Pytorch are handling Linear differently when the input has multi-D. In Chainer, we look at the first D, and the input is flattened before doing the actual forward; While Pytorch only looks at the last D.
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/chainer/links/connection/linear.py#L181
https://github.com/pytorch/pytorch/blob/d035d05080729c30636ff30fcc068de3c7e9badd/torch/nn/functional.py#L1676

Example

>>> chainer_linear = chainer.links.Linear(10, 10)
>>> torch_linear = torch.nn.Linear(10, 10)
>>> input = numpy.arange(100, dtype=numpy.float32).reshape(10, 2, 5)
>>> chainer_linear(input)
variable([[-1.27548418e+01, -6.84057713e-01,  4.33863544e+00,
            6.55931234e+00, -3.39445019e+00, -2.51949596e+00,
            1.03438854e+00,  7.69187212e-02,  1.93967378e+00,
            2.17406702e+00],
          [-3.20976524e+01,  2.56630421e+00,  2.20505409e+01,
            1.03132820e+01, -9.60312557e+00, -1.00995445e+01,
            1.22814524e+00, -2.46707058e+00,  1.31490164e+01,
           -4.19132054e-01],
          [-5.14404640e+01,  5.81666756e+00,  3.97624512e+01,
            1.40672522e+01, -1.58118019e+01, -1.76795921e+01,
            1.42190135e+00, -5.01105976e+00,  2.43583584e+01,
           -3.01233125e+00],
          [-7.07832718e+01,  9.06703091e+00,  5.74743538e+01,
            1.78212242e+01, -2.20204792e+01, -2.52596416e+01,
            1.61565781e+00, -7.55504990e+00,  3.55676994e+01,
           -5.60552549e+00],
          [-9.01260834e+01,  1.23173914e+01,  7.51862717e+01,
            2.15751915e+01, -2.82291527e+01, -3.28396912e+01,
            1.80941379e+00, -1.00990391e+01,  4.67770386e+01,
           -8.19872189e+00],
          [-1.09468895e+02,  1.55677538e+01,  9.28981705e+01,
            2.53291626e+01, -3.44378281e+01, -4.04197388e+01,
            2.00317073e+00, -1.26430283e+01,  5.79863853e+01,
           -1.07919216e+01],
          [-1.28811707e+02,  1.88181152e+01,  1.10610077e+02,
            2.90831299e+01, -4.06464958e+01, -4.79997826e+01,
            2.19692683e+00, -1.51870193e+01,  6.91957169e+01,
           -1.33851223e+01],
          [-1.48154510e+02,  2.20684814e+01,  1.28321976e+02,
            3.28371048e+01, -4.68551788e+01, -5.55798340e+01,
            2.39068103e+00, -1.77310085e+01,  8.04050751e+01,
           -1.59783182e+01],
          [-1.67497330e+02,  2.53188400e+01,  1.46033875e+02,
            3.65910721e+01, -5.30638542e+01, -6.31598854e+01,
            2.58443928e+00, -2.02749958e+01,  9.16144028e+01,
           -1.85715141e+01],
          [-1.86840134e+02,  2.85692101e+01,  1.63745789e+02,
            4.03450470e+01, -5.92725296e+01, -7.07399368e+01,
            2.77819300e+00, -2.28189812e+01,  1.02823753e+02,
           -2.11647110e+01]])
>>> torch_linear(torch.Tensor(input))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tianqi/.pyenv/versions/3.7.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tianqi/.pyenv/versions/3.7.4/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/tianqi/.pyenv/versions/3.7.4/lib/python3.7/site-packages/torch/nn/functional.py", line 1593, in linear
    output = input.matmul(weight.t())
RuntimeError: size mismatch, m1: [20 x 5], m2: [10 x 10] at /home/tianqi/repository/pytorch/aten/src/TH/generic/THTensorMath.cpp:41
>>>

Solution
Is there any plan to bridge such difference?
What I have been doing is adding a Flatten layer before the Linear, like:

class Flatten(torch.nn.Module):
    def __init__(self, n_batch_axes=1):
        self.n_batch_axes = n_batch_axes

    def forward(self, x):
        return torch.flatten(x, start_dim=self.n_batch_axes)

Module to validate a shape and dtype?

Modules like tf.ensure_shape can be useful in PyTorch too.
https://www.tensorflow.org/api_docs/python/tf/ensure_shape

Make TensorBoardWriter no-op if tensorboard is not available

Saving snapshots or logs fail in case of unexpected previous failure

Here, in case log.bak remains, restarting the intermediately killed job always fails because rename(src, dst) always fail in HDFS when dst exists (HDFS is designed to always fail in this case). We might need ensuring the lack of the destination file, but there should be a lot of pitfall in unexpected process termination.

This bug was found in our internal HDFS cluster and we have corresponding issue internally.

Problems with `share_memory` on `Lazy` modules

Problem Statement
Since the parameters in Lazy modules, like LazyLinear, changes on the first forward. If the model is shared with share_memory. The model may not be correctly shared as the new parameters will not be on the shared memory.

Examples

1. shared_model = some_lazy_model()
2. shared_model.share_memory()
3. fork to process A and B
4. process A: shared_model(input)... # forward and step
5. process B: test_model.load_state_dict(shared_model.state_dict()) # problems here!!!

The parameters changed in (4) are not on the shared memory which happened in (2), so (5) can not get updated results.

The problem was first noticed and reported by @shu65

Support IterableDataset in ExtensionsManager

When using IterableDataset, the length of the dataset may not be determined, so users may not be able to specify iters_per_epoch.

`import pytorch_pfn_extras` fails when the file name is profile.py

This is a bug report. `I'm using pytorch-pfn-extras==0.3.1 and I noticed that it fails import when the file name is profile.py. Here is the file.

import pytorch_pfn_extras

When the file name is hoge.py, it doesn't cause any errors. But when it's profile.py, it cause an error.

$ /usr/bin/python3 /repo/profile.py
Traceback (most recent call last):
  File "/repo/profile.py", line 1, in <module>
    import pytorch_pfn_extras
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/__init__.py", line 7, in <module>
    from pytorch_pfn_extras import training  # NOQA
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/training/__init__.py", line 6, in <module>
    from pytorch_pfn_extras.training import extensions  # NOQA
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/training/extensions/__init__.py", line 17, in <module>
    from pytorch_pfn_extras.training.extensions.print_report_notebook import PrintReportNotebook  # NOQA
  File "/usr/local/lib/python3.7/dist-packages/pytorch_pfn_extras/training/extensions/print_report_notebook.py", line 3, in <module>
    from IPython.core.display import display
  File "/home/user/.local/lib/python3.7/site-packages/IPython/__init__.py", line 56, in <module>
    from .terminal.embed import embed
  File "/home/user/.local/lib/python3.7/site-packages/IPython/terminal/embed.py", line 17, in <module>
    from IPython.terminal.ipapp import load_default_config
  File "/home/user/.local/lib/python3.7/site-packages/IPython/terminal/ipapp.py", line 28, in <module>
    from IPython.core.magics import (
  File "/home/user/.local/lib/python3.7/site-packages/IPython/core/magics/__init__.py", line 21, in <module>
    from .execution import ExecutionMagics
  File "/home/user/.local/lib/python3.7/site-packages/IPython/core/magics/execution.py", line 24, in <module>
    import cProfile as profile
  File "/usr/lib/python3.7/cProfile.py", line 22, in <module>
    run.__doc__ = _pyprofile.run.__doc__
AttributeError: module 'profile' has no attribute 'run'

Create internal debug tools

We need some kind of logging mechanism for debug

When loading snapshots or verifying trigger checks having a
PPE_LOG_LEVEL environment variable that allows us to obtain DEBUG information
(snapshot not found, snapshot invalid, etc.) will be very useful.

Do you have a plan to support TensorBoard

PyTorch supports TensorBoard naitively in torch.utils.tensorboard. But ppe's LogReport doesn't because tensorbord writers don't satisfy the requirement.

pytorch-pfn-extras/pytorch_pfn_extras/training/extensions/log_report.py

Lines 57 to 60 in b2ddc45

  writer (writer object, optional): must be callable. 

  object to dump the log to. If specified, it needs to have a correct 

  `savefun` defined. The writer can override the save location in 

  the :class:`pytorch_pfn_extras.training.ExtensionsManager` object

So my question is would ppe support it? Or do we have any other way to do it?

Error writing snapshot backup file

pytorch-pfn-extras/pytorch_pfn_extras/writing.py

Line 216 in 1b265e7

self.fs.rename(dest, bak)

This line caused an error.
It seems .bak file already existed at the time of this call. I think such a situation can easily happen if the previous training session was terminated before removing it.

By the way, it also failed to load the snapshot of the previous session and started over from the epoch 1 (see the timestamps of the files listed below).
I don't know whether or not it's related to this issue.

epoch       iteration   elapsed_time  lr          train/loss  val/loss    val/top1    val/top5                                                                                                                                       
1           58          17.791        0.1         8.19181     31.9144     0           0.0119048                                                                                                                                      
2           116         40.637        0.1         7.68947     7.59687     0.00892857  0.00892857                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                   
  File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                                                                                                         
    return _run_code(code, main_globals, None,                                                                                                                                                                                       
  File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code                                                                                                                                                                    
    exec(code, run_globals)                                                                                                                                                                                                          
  File "mywork/__main__.py", line 40, in <module>                                      
    main()                                                                                                                                                                                                                           
  File "mywork/__main__.py", line 34, in main                                          
    args.func(args)                                                                                                                                                                                                                  
  File "mywork/train.py", line 422, in main                                            
    run_train(manager, model, device, train_loader, optimizer,                                                                                                                                                                       
  File "mywork/train.py", line 153, in run_train                                       
    ppe.reporting.report({'train/loss': loss.item()})                                                                                                                                                                                
  File "/usr/local/lib/python3.8/contextlib.py", line 120, in __exit__                                                                                                                                                               
    next(self.gen)                                                                                                                                                                                                                   
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/manager.py", line 430, in run_iteration                                                                                                         
    self.run_extensions()                                                                                                                                                                                                            
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/manager.py", line 299, in run_extensions                                                                                                        
    entry.extension(self)                                                                                                                                                                                                            
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 458, in __call__                                                                                                 
    self._make_snapshot(trainer)                                                                                                                                                                                                     
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 410, in _make_snapshot                                                                                           
    writer(filename, outdir, serialized_target, savefun=self._savefun)                                                                                                                                                               
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/writing.py", line 270, in __call__                                                             
    self.save(filename, out_dir, target, savefun, append, **self._kwds)
  File "/home/nishino/.local/lib/python3.8/site-packages/pytorch_pfn_extras/writing.py", line 213, in save                                                             
    self.fs.rename(dest, bak)                                                                                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/pfio/filesystems/hdfs.py", line 295, in rename                                                                                                                                        
    return self.connection.rename(src, dst)                                                                                                          
  File "/usr/local/lib/python3.8/site-packages/pyarrow/hdfs.py", line 86, in rename                                                                                        
    return super().rename(path, new_path)                                           
  File "pyarrow/io-hdfs.pxi", line 178, in pyarrow.lib.HadoopFileSystem.rename                                                                                                          
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status                                                                                                                                                
OSError: HDFS Rename failed, errno: 5 (Input/output error)

-rw-r--r--   3 niboshi niboshi  231673368 2021-05-24 22:40 hdfs:///.../last_snapshot
-rw-r--r--   3 niboshi niboshi  231991832 2021-05-24 21:57 hdfs:///.../last_snapshot.bak
-rw-r--r--   3 niboshi niboshi       554 2021-05-24 23:14 hdfs:///.../log
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 22:40 hdfs:///.../model_epoch_001
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 23:14 hdfs:///.../model_epoch_002
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 11:51 hdfs:///.../model_epoch_003
-rw-r--r--   3 niboshi niboshi  115971823 2021-05-24 11:52 hdfs:///.../model_epoch_004
...

Recommended version specifications

Some projects seems to stick with the same version pytorch-pfn-extras==0.2.0
I think we need to declare our versioning rules (e.g. Z changes of version X.Y.Z does not introduce backward-breaking changes so that they can use pytorch-pfn-extras<0.3.0)

Separate coding style check from FlexCI to GitHub Actions

Add release note generator bot

Use pysen for lint check

TODO:

Enables isort and black.
Use flake8 with pysen's default configuration.

Snapshot extensions can't deal with `reporting.Summary` when the tensor is on GPU

As described here, Summay.add can deal with a tensor on GPU.

pytorch-pfn-extras/pytorch_pfn_extras/reporting.py

Lines 261 to 271 in 5024cf7

 def add(self, value, weight=1): 

 """Adds a scalar value. 

  Args: 

  value: Scalar value to accumulate. It is either a NumPy scalar or 

  a zero-dimensional array (on CPU or GPU). 

  weight: An optional weight for the value. It is a NumPy scalar or 

  a zero-dimensional array (on CPU or GPU). 

  Default is 1 (integer). 

  """

When calling add after snapshot extensions load summary from their snapshot, it causes an error because of multiple devices.

import pytorch_pfn_extras as ppe
import torch

SNAPSHOT_FILENAME = "snapshot"

tensor = torch.zeros(0, device="cuda")
summary = ppe.reporting.Summary()
summary.add(tensor)
torch.save(summary.state_dict(), SNAPSHOT_FILENAME)

manager = ppe.training.ExtensionsManager(
    {}, None, None, iters_per_epoch=1, out_dir="./"
)
snapshot = ppe.training.extensions.snapshot_object(
    summary,
    SNAPSHOT_FILENAME,
    autoload=True,
)

snapshot.initialize(manager)
summary.add(tensor)

Output:

Traceback (most recent call last):
  File "/tmp/a.py", line 21, in <module>
    summary.add(tensor)
  File "/usr/local/lib/python3.8/site-packages/pytorch_pfn_extras/reporting.py", line 272, in add
    self._x += weight * value
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Refactor config.py

Feature request: LazyBatchNorm

In my project, I implemented LazyBatchNorm like this and I think pytorch-pfn-extras should have the similar feature:

class LazyBatchNorm1d(
    ppe.nn.modules.lazy.LazyInitializationMixin, torch.nn.BatchNorm1d
):
    """
    LazyBatchNorm1d is a lazy version of BatchNorm1d.  It does not require to
    specify the number of the input parameters beforehand.
    """

    lazy_parameter_names = ("weight", "bias")

    def __init__(self, num_features, *args, **kwargs):
        super().__init__(num_features or 0, *args, **kwargs)
        if num_features is None:
            self.num_features = None
            self.weight = ppe.nn.modules.lazy.UninitializedParameter()
            self.bias = ppe.nn.modules.lazy.UninitializedParameter()

    def forward(self, input):
        if isinstance(self.weight, ppe.nn.modules.lazy.UninitializedParameter):
            self.num_features = input.shape[-1]
            if self.affine:
                self.weight = torch.nn.Parameter(
                    self.weight.new_empty(self.num_features)
                )
                self.bias = torch.nn.Parameter(self.weight.new_empty(self.num_features))
            if self.track_running_stats:
                self.running_mean = torch.zeros(
                    self.num_features, device=self.running_mean.device
                )
                self.running_var = torch.ones(
                    self.num_features, device=self.running_var.device
                )
            self.reset_parameters()
        return super().forward(input)

    def reset_parameters(self):
        if self.lazy_parmeters_determined:
            super().reset_parameters()

Another discussion:
It would be good to have LazyBatchNorm instead of having LazyBatchNorm*d-s since lazy ones know its desired shapes. Any thoughts?

Make onnx optional requirement to import `pytorch_pfn_extras.onnx`

This is convenient as we can build docs without onnx requirement

[ONNX] Single input is not exported correctly

When exporting ONNX with single input, input tensor is split and input shape will be wrong.

pytorch-pfn-extras/pytorch_pfn_extras/onnx/export_testcase.py

Line 177 in 2c5e144

args = [args[i] for i in used_input_index_list]

Bug in sequential repeat when the layer has no parameters

problems Statement
In init mode, the repeated layer will be reset. In Pytorch, we used reset_parameters function to reset the parameters of layers, as here:

pytorch-pfn-extras/pytorch_pfn_extras/nn/modules/extended_sequential.py

Line 15 in 92dad97

model.reset_parameters()

However, there are layers have no parameters nor reset_parameters, such as torch.nn.ReLU. An error is raised when the model contains such layer.

Error Message

pytorch_pfn_extras/nn/modules/extended_sequential.py:68: in repeat
    model_list.append(self._copy_model(mode))
pytorch_pfn_extras/nn/modules/extended_sequential.py:27: in _copy_model
    return _reset_parameters(copy.deepcopy(self))
pytorch_pfn_extras/nn/modules/extended_sequential.py:9: in _reset_parameters
    _reset_parameters(submodel)
pytorch_pfn_extras/nn/modules/extended_sequential.py:17: in _reset_parameters
    model.reset_parameters()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = ReLU(), name = 'reset_parameters'

    def __getattr__(self, name):
        if '_parameters' in self.__dict__:
            _parameters = self.__dict__['_parameters']
            if name in _parameters:
                return _parameters[name]
        if '_buffers' in self.__dict__:
            _buffers = self.__dict__['_buffers']
            if name in _buffers:
                return _buffers[name]
        if '_modules' in self.__dict__:
            modules = self.__dict__['_modules']
            if name in modules:
                return modules[name]
        raise AttributeError("'{}' object has no attribute '{}'".format(
>           type(self).__name__, name))
E       AttributeError: 'ReLU' object has no attribute 'reset_parameters'

Wrong counters when restoring snapshots

When loading a snapshot of a finished training using autoload=True The training runs an additional epoch while it should not run at all.

Release automation

Automate the release process (publish to pypi) by using GitHub Actions.

Avoid using `register_backward_hook` deprecated in PyTorch 1.8

Consider migrating to pysen?

Currently, we are just using the latest flake8. Using pysen looks nice https://github.com/pfnet/pysen for reproducibility between local environment and CI.

Make StandardWriter reusable

Currently StandardWriter variants cannot be reused once finalized. It should be able to be reused.

Add mergify bot

API to handle filesystems that does not support append

Related to: #118

Some filesystems like HDFS or S3 (equivalents) do not support appending to an existing file. It would be beneficial for users to provide support for these file systems via Writer although keeping performance.

An initial idea: provide a BufferedWriter class that can be used as a mix-in to existing writers. The mixin writes to a local scratch disk, then sync it to the underlying HDFS/S3 filesystems in a background thread.

CI failures in torch19

test_tensorboard_writing (linux, win): #194
test_empty_shared_dataset (win): #194
test_transform[tuple-None-indices53-None-True] (linux): #194
test_annotate (linux, win) #196
test_apply_annotation (linux, win) #196
test_scoped_anchor (linux, win) #197
test_export_testcase (linux, win) #197
test_tensorboard_writer (linux, win): #194
test_queue_writer (linux): #194

linux: https://ci.preferred.jp/pytorch-pfn-extras.torch19-linux/73786/
win: https://ci.preferred.jp/pytorch-pfn-extras.torch19-win/73789/

Make tests stateless

https://github.com/pfnet/pytorch-pfn-extras/pull/196/files#r655265909

At least, temporary directories (e.g., out, result) should be removed after each testcase.

Add test for `LogWriterSaveFunc`

Add tests check if the output of LogWriterSaveFunc follows the specified format.

Introduce type annotations

As we assume users inherit some APIs, it's better to have a type annotation.

Load the log for pre-train

How to load the log and continue draw the loss png when I pre-train the model ?

Bug in when we use ProcessWriter with extensions.

problem statement

I run a code that includes the below fragment. Then, the prompt throws the below error.

    writer = writing.ProcessWriter(savefun=torch.save, out_dir=save_path)
    manager.extend(extensions.snapshot(writer=writer), trigger=(1, 'iteration'))
    manager.extend(extensions.snapshot(writer=writer, filename='gen_{.epoch}', target=generator.module), trigger=(10, 'iteration'))
    manager.extend(extensions.snapshot(), trigger=(10, 'epoch'))
    manager.extend(extensions.snapshot(filename='gen_{.epoch}', target=generator.module), trigger=(10, 'epoch'))

error message

Traceback (most recent call last):
  File "main_train.py", line 232, in <module>
    train()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "main_train.py", line 226, in train
    Image.fromarray(x).save(f'{i}.png')
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/manager.py", line 390, in run_iteration
    self.run_extensions()
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/manager.py", line 272, in run_extensions
    entry.extension(self)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 397, in __call__
    self._make_snapshot(manager)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/training/extensions/_snapshot.py", line 422, in _make_snapshot
    writer(filename, outdir, serialized_target, savefun=self._savefun)
  File "/usr/local/lib/python3.7/site-packages/pytorch_pfn_extras/writing.py", line 308, in __call__
    savefun, **self._kwds)
TypeError: create_worker() takes 4 positional arguments but 5 were given

The create_worker in StandardWriter accepts five arguments including self.
However, create_worker in ProccessWriter and ThreadWriter accept only four arguments.

pytorch-pfn-extras/pytorch_pfn_extras/writing.py

Lines 307 to 312 in 8b16df9

 self._worker = self.create_worker(filename, out_dir, target, 

 savefun, **self._kwds) 

 self._worker.start() 

 self._started = True 

 def create_worker(self, filename, out_dir, target, savefun, **kwds):

pytorch-pfn-extras/pytorch_pfn_extras/writing.py

Lines 374 to 378 in 8b16df9

 def create_worker(self, filename, out_dir, target, **kwds): 

 return multiprocessing.Process( 

 target=self.save, 

 args=(filename, out_dir, target, self._savefun), 

 kwargs=self._kwds)

`BestValueTrigger` does not work well in mpi run like multi node setting

What happened

_DistributedSnapshot with BestValueTrigger gets stuck.

code

https://gist.github.com/dhgrs/56424106e00bafee9617b0a15a028c2c

command

CUDA_VISIBLE_DEVICES=0,1 mpiexec -N 2 python3 mnist.py

Why it causes

Reporter works in all mpi processes without all reduce operation so that BestValueTrigger check different values in each process. It causes that some processes are triggered but the others are not. _DistributedSnapshot waits for all mpi processes but some of them would never finish because are not triggered.

Workaround

Apply all reduce operation manually before reporting. But is this the best way? Will ppe support auto all reduce?

DistributedSnapshot with NCCL

I want to use DistributedSnapshot with NCCL. However, current _get_ranks_from_env only accept environment variables for MPI and MV1.
In pytorch documentation, they use the environment variables: WORLD_SIZE, RANK, and LOCAL_RANK, for specifying a process.

pytorch-pfn-extras/pytorch_pfn_extras/training/extensions/_snapshot.py

Lines 473 to 488 in 8b16df9

 def _get_ranks_from_env(): 

 if 'OMPI_COMM_WORLD_SIZE' in os.environ: 

 # We are running Open MPI 

 comm_world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"]) 

 comm_rank = int(os.environ["OMPI_COMM_WORLD_RANK"]) 

 comm_local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK']) 

 elif 'MV2_COMM_WORLD_SIZE' in os.environ: 

 comm_world_size = int(os.environ['MV2_COMM_WORLD_SIZE']) 

 comm_rank = int(os.environ['MV2_COMM_WORLD_RANK']) 

 comm_local_rank = int(os.environ['MV2_COMM_WORLD_LOCAL_RANK']) 

 else: 

 comm_world_size = 1 

 comm_rank = 0 

 comm_local_rank = 0 

 return comm_world_size, comm_rank, comm_local_rank

	writer (writer object, optional): must be callable.
	object to dump the log to. If specified, it needs to have a correct
	`savefun` defined. The writer can override the save location in
	the :class:`pytorch_pfn_extras.training.ExtensionsManager` object

	def add(self, value, weight=1):
	"""Adds a scalar value.

	Args:
	value: Scalar value to accumulate. It is either a NumPy scalar or
	a zero-dimensional array (on CPU or GPU).
	weight: An optional weight for the value. It is a NumPy scalar or
	a zero-dimensional array (on CPU or GPU).
	Default is 1 (integer).

	"""

	self._worker = self.create_worker(filename, out_dir, target,
	savefun, **self._kwds)
	self._worker.start()
	self._started = True

	def create_worker(self, filename, out_dir, target, savefun, **kwds):

	def create_worker(self, filename, out_dir, target, **kwds):
	return multiprocessing.Process(
	target=self.save,
	args=(filename, out_dir, target, self._savefun),
	kwargs=self._kwds)

	def _get_ranks_from_env():
	if 'OMPI_COMM_WORLD_SIZE' in os.environ:
	# We are running Open MPI
	comm_world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
	comm_rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
	comm_local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
	elif 'MV2_COMM_WORLD_SIZE' in os.environ:
	comm_world_size = int(os.environ['MV2_COMM_WORLD_SIZE'])
	comm_rank = int(os.environ['MV2_COMM_WORLD_RANK'])
	comm_local_rank = int(os.environ['MV2_COMM_WORLD_LOCAL_RANK'])
	else:
	comm_world_size = 1
	comm_rank = 0
	comm_local_rank = 0

	return comm_world_size, comm_rank, comm_local_rank

pfnet / pytorch-pfn-extras Goto Github PK

pytorch-pfn-extras's People

Contributors

Stargazers

Watchers

Forkers

pytorch-pfn-extras's Issues

problem statement

error message

What happened

code

command

Why it causes

Workaround

Recommend Projects

Recommend Topics

Recommend Org