I got an error when trying to train model with DistributedDataParallel and set num_wor

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I got the same error when running on Windows <p dir="au

I got the same error when running on Windows </blockquot

Error with num_workers>0 in DistributedDataParallel mode about l5kit HOT 8 OPEN

woven-planet commented on May 24, 2024

Error with num_workers>0 in DistributedDataParallel mode

from l5kit.

Comments (8)

lucabergamini commented on May 24, 2024 1

Do we know where we use the google.protobuf? Is it specific to l5kit, zarr, or, it was used by pytorch?

it's specific to l5kit, we use it for the semantic map

from l5kit.

lucabergamini commented on May 24, 2024 1

The downside is that each worker process will try to load its copy of the entire pytorch library, which somehow takes like 4GB of commit memory. During training, each worker can take upto 7GB of commit memory. So you need to turn on virtual memory for the buffer.

I see. I guess there is no "one solution for them all" in this case :)

from l5kit.

lucabergamini commented on May 24, 2024

Hi @RocketFlash

DistributedDataParallel has never been tested with L5Kit, so I'm not surprise that it's not working. I'll try to take a look into it!

from l5kit.

louis925 commented on May 24, 2024

I got the same error when running on Windows. I didn't use DistributedDataParallel but just simply follow the example notebook with num_workers > 0.

from l5kit.

lucabergamini commented on May 24, 2024

I got the same error when running on Windows

Same version of python (3.8)?

from l5kit.

louis925 commented on May 24, 2024

I got the same error when running on Windows

Same version of python (3.8)?

Oh, I am using python 3.7. Is that why?

from l5kit.

louis925 commented on May 24, 2024

FYI, This is the kind of error that I got. I am not using DistributedDataParallel.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>
c:\users\louis\appdata\local\programs\python\python37\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    289             return _SingleProcessDataLoaderIter(self)
    290         else:
--> 291             return _MultiProcessingDataLoaderIter(self)
    292 
    293     @property
c:\users\louis\appdata\local\programs\python\python37\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    735             #     before it starts, and __del__ tries to join but will get:
    736             #     AssertionError: can only join a started process.
--> 737             w.start()
    738             self._index_queues.append(index_queue)
    739             self._workers.append(w)
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #
TypeError: can't pickle google.protobuf.pyext._message.RepeatedCompositeContainer objects

Do we know where we use the google.protobuf? Is it specific to l5kit, zarr, or, it was used by pytorch?

from l5kit.

louis925 commented on May 24, 2024

FYI, I have a workaround, which is wrapping around the Dataset into another class that only construct the dataset and the rasterizer when the object has been loaded to the workers. My code for doing that in a mydataset.py script:

class MyTrainDataset:
    def __init__(self, cfg, dm):
        self.cfg = cfg
        self.dm = dm
    def initialize(self, worker_id):
        print('initialize called with worker_id', worker_id)
        from l5kit.data import ChunkedDataset
        from l5kit.dataset import AgentDataset #, EgoDataset
        from l5kit.rasterization import build_rasterizer
        rasterizer = build_rasterizer(self.cfg, self.dm)
        train_cfg = self.cfg["train_data_loader"]
        train_zarr = ChunkedDataset(self.dm.require(train_cfg["key"])).open(cached=False)  # try to turn off cache
        self.dataset = AgentDataset(self.cfg, train_zarr, rasterizer)
    def __len__(self):
        # NOTE: You have to figure out the actual length beforehand since once the rasterizer and/or AgentDataset been 
        # constructed, you cannot pickle it anymore! So we can't compute the size from the real dataset. However, 
        # DataLoader require the len to determine the sampling.
        return 22496709
    def __getitem__(self, index):
        return self.dataset[index]

from torch.utils.data import get_worker_info
def my_dataset_worker_init_func(worker_id):
    worker_info = get_worker_info()
    dataset = worker_info.dataset
    dataset.initialize(worker_id)

Then you can load it in the training jupyter notebook as

from mydataset import MyTrainDataset, my_dataset_worker_init_func
train_dataset = MyTrainDataset(cfg, dm)
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True, 
    batch_size=16,
    num_workers=2,
    persistent_workers=True,
    worker_init_fn=my_dataset_worker_init_func,
)
tr_it = iter(train_dataloader)

The downside is that each worker process will try to load its copy of the entire pytorch library, which somehow takes like 4GB of commit memory. During training, each worker can take upto 7GB of commit memory. So you need to turn on virtual memory for the buffer.

from l5kit.

Error with num_workers>0 in DistributedDataParallel mode about l5kit HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent