Git Product home page Git Product logo

Comments (8)

lucabergamini avatar lucabergamini commented on May 24, 2024 1

Do we know where we use the google.protobuf? Is it specific to l5kit, zarr, or, it was used by pytorch?

it's specific to l5kit, we use it for the semantic map

from l5kit.

lucabergamini avatar lucabergamini commented on May 24, 2024 1

The downside is that each worker process will try to load its copy of the entire pytorch library, which somehow takes like 4GB of commit memory. During training, each worker can take upto 7GB of commit memory. So you need to turn on virtual memory for the buffer.

I see. I guess there is no "one solution for them all" in this case :)

from l5kit.

lucabergamini avatar lucabergamini commented on May 24, 2024

Hi @RocketFlash

DistributedDataParallel has never been tested with L5Kit, so I'm not surprise that it's not working. I'll try to take a look into it!

from l5kit.

louis925 avatar louis925 commented on May 24, 2024

I got the same error when running on Windows. I didn't use DistributedDataParallel but just simply follow the example notebook with num_workers > 0.

from l5kit.

lucabergamini avatar lucabergamini commented on May 24, 2024

I got the same error when running on Windows

Same version of python (3.8)?

from l5kit.

louis925 avatar louis925 commented on May 24, 2024

I got the same error when running on Windows

Same version of python (3.8)?

Oh, I am using python 3.7. Is that why?

from l5kit.

louis925 avatar louis925 commented on May 24, 2024

FYI, This is the kind of error that I got. I am not using DistributedDataParallel.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>
c:\users\louis\appdata\local\programs\python\python37\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    289             return _SingleProcessDataLoaderIter(self)
    290         else:
--> 291             return _MultiProcessingDataLoaderIter(self)
    292 
    293     @property
c:\users\louis\appdata\local\programs\python\python37\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    735             #     before it starts, and __del__ tries to join but will get:
    736             #     AssertionError: can only join a started process.
--> 737             w.start()
    738             self._index_queues.append(index_queue)
    739             self._workers.append(w)
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)
c:\users\louis\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #
TypeError: can't pickle google.protobuf.pyext._message.RepeatedCompositeContainer objects

Do we know where we use the google.protobuf? Is it specific to l5kit, zarr, or, it was used by pytorch?

from l5kit.

louis925 avatar louis925 commented on May 24, 2024

FYI, I have a workaround, which is wrapping around the Dataset into another class that only construct the dataset and the rasterizer when the object has been loaded to the workers. My code for doing that in a mydataset.py script:

class MyTrainDataset:
    def __init__(self, cfg, dm):
        self.cfg = cfg
        self.dm = dm
    def initialize(self, worker_id):
        print('initialize called with worker_id', worker_id)
        from l5kit.data import ChunkedDataset
        from l5kit.dataset import AgentDataset #, EgoDataset
        from l5kit.rasterization import build_rasterizer
        rasterizer = build_rasterizer(self.cfg, self.dm)
        train_cfg = self.cfg["train_data_loader"]
        train_zarr = ChunkedDataset(self.dm.require(train_cfg["key"])).open(cached=False)  # try to turn off cache
        self.dataset = AgentDataset(self.cfg, train_zarr, rasterizer)
    def __len__(self):
        # NOTE: You have to figure out the actual length beforehand since once the rasterizer and/or AgentDataset been 
        # constructed, you cannot pickle it anymore! So we can't compute the size from the real dataset. However, 
        # DataLoader require the len to determine the sampling.
        return 22496709
    def __getitem__(self, index):
        return self.dataset[index]

from torch.utils.data import get_worker_info
def my_dataset_worker_init_func(worker_id):
    worker_info = get_worker_info()
    dataset = worker_info.dataset
    dataset.initialize(worker_id)

Then you can load it in the training jupyter notebook as

from mydataset import MyTrainDataset, my_dataset_worker_init_func
train_dataset = MyTrainDataset(cfg, dm)
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True, 
    batch_size=16,
    num_workers=2,
    persistent_workers=True,
    worker_init_fn=my_dataset_worker_init_func,
)
tr_it = iter(train_dataloader)

The downside is that each worker process will try to load its copy of the entire pytorch library, which somehow takes like 4GB of commit memory. During training, each worker can take upto 7GB of commit memory. So you need to turn on virtual memory for the buffer.

from l5kit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.