When I activate "distrib_shard_files": True and at th

Two things: <div class="snippet-clipboard-content notranslate position-relative ov

DistributeFilesDataset Sharding with PT Dataloader breaks about returnn HOT 3 CLOSED

michelwi commented on September 15, 2024

DistributeFilesDataset Sharding with PT Dataloader breaks

from returnn.

Comments (3)

albertz commented on September 15, 2024 1

Two things:

AttributeError: 'DistributeFilesDataset' object has no attribute '_workers'

Because of early exception in __init__. But anyway, we should avoid this error by checking hasattr or so in __del__. But that's not really the main issue here.

The main issue is that we call _get_rank_and_size in the sub proc. Instead, this should be passed from the parent to the sub proc in serialization. We can pass some _rank and _size as args and if they are passed, use them. That should already fix it.

from returnn.

NeoLegends commented on September 15, 2024

I think I know what this is

from returnn.

NeoLegends commented on September 15, 2024

The main issue is that we call _get_rank_and_size in the sub proc. Instead, this should be passed from the parent to the sub proc in serialization. We can pass some _rank and _size as args and if they are passed, use them. That should already fix it.

~~I feel like we should rather fix the DistributedContext to properly work in and child process, no?~~

Never mind, I don't think this is feasible because torch will always try to "properly" initialize a process and make it part of the worker group, even if we just want to access some worker group metadata like the size and rank of the parent.

from returnn.

DistributeFilesDataset Sharding with PT Dataloader breaks about returnn HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent