Comments (3)
Two things:
AttributeError: 'DistributeFilesDataset' object has no attribute '_workers'
Because of early exception in __init__
. But anyway, we should avoid this error by checking hasattr
or so in __del__
. But that's not really the main issue here.
The main issue is that we call _get_rank_and_size
in the sub proc. Instead, this should be passed from the parent to the sub proc in serialization. We can pass some _rank
and _size
as args and if they are passed, use them. That should already fix it.
from returnn.
I think I know what this is
from returnn.
The main issue is that we call _get_rank_and_size in the sub proc. Instead, this should be passed from the parent to the sub proc in serialization. We can pass some _rank and _size as args and if they are passed, use them. That should already fix it.
I feel like we should rather fix the DistributedContext to properly work in and child process, no?
Never mind, I don't think this is feasible because torch will always try to "properly" initialize a process and make it part of the worker group, even if we just want to access some worker group metadata like the size and rank of the parent.
from returnn.
Related Issues (20)
- Make batch_size configurable for cross validation HOT 1
- Ignore a single broken gradient HOT 2
- DistributeFilesDataset: _distribute_evenly_by_size suboptimal for multi-gpu sharding HOT 8
- multiprocessing: OSError: AF_UNIX path too long HOT 11
- ConcatSeqsDataset with extended functionality HOT 3
- Torch: print model at log verbosity 3 HOT 1
- RuntimeError: CUDA error: an illegal memory access was encountered HOT 1
- Torch gradient_checkpoint_scope _unregister_custom_saved_tensors_hooks error HOT 4
- RF parametrization breaks Conv
- Torch gradient_checkpoint_scope could trigger segmentation fault? HOT 16
- Torch gradient_checkpoint_scope potential memory leak
- Torch multiple simultaneous gradient_checkpoint_scope
- `rf.pack_padded` with PyTorch takes a lot of memory HOT 1
- `rf.RelPosCausalSelfAttention` fails with `single_step_dim` HOT 9
- Torch `report_profile` `check_events` based tests maybe unstable HOT 1
- Torch: gradient_clip wrong when grad_scaler is used
- Torch print step info on crash
- Make `FileCache` able to detect updated remote files HOT 1
- RF masked computation / masking (like masked_select but without the packing) HOT 3
- TF end layer independent of batch causes error in beam search
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.