Benchmark the new PyT data loader with the REES46 ecommerce dataset, using multiple GP

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets about transformers4rec HOT 5 OPEN

nvidia-merlin commented on May 17, 2024

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets

from transformers4rec.

Comments (5)

Ahanmr commented on May 17, 2024 1

@rnyak Does the library currently support training on multi-gpu configuration? Even though I have multi GPUs, the training is happening only on one of them, and not parallelizing the training on both. Is there a way to add this into our trainer?

from transformers4rec.

rnyak commented on May 17, 2024

Gabriel and I did a debugging session and we found out that the problem with model distributed happens between 50-70% of the training with the first parquet file (day) when this args are set in our dataloader

NVTDataLoader(        
        global_size=global_size,
        global_rank=global_rank,

when these are arguments are disabled, we can see that we can train on two GPUs (but both using the same dataset). So most likely the issues is because of our NVT PyT dataloader.

We can reproduce it quickly with ecom_small

from transformers4rec.

rnyak commented on May 17, 2024

Gabriel, Julio and I did another debugging session and looks like one of our worker is not waiting for the other worker, and this creates a bottleneck. The options/guidance to explore:

[] torch.distributed.barrier()
[] https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html

from transformers4rec.

rnyak commented on May 17, 2024

@Ahanmr currently we are working on support training on multi-gpu.

from transformers4rec.

alan-ai-learner commented on May 17, 2024

@rnyak after a custom preparing dataset, i have 1321 folders that is 1 for each day.
So i'm training it as mentioned in the youchoose dataset example. I have few question regarding that...

Currently i'm training it with train batch size = 32, eval batch size = 16, as i'm having 16gb gpu memory. I'm not sure what could be better number number as per my resources, so any suggestion on that would be helpful?
After training it...after 500 days of traing the loss is 0, i'm not sure it is overfitting or it is the way. Do i need to stop or is there any better way to do the training.
Also these evaluation scores are very less for each day, so how any one know final evaluation score?

Any help would be great.
Thanks!

from transformers4rec.

Recommend Projects

Benchmark the new PyT data loader (with sparse tensors support) scalability with multi-GPU and larger datasets about transformers4rec HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent