Comments (5)
@rnyak Does the library currently support training on multi-gpu configuration? Even though I have multi GPUs, the training is happening only on one of them, and not parallelizing the training on both. Is there a way to add this into our trainer?
from transformers4rec.
Gabriel and I did a debugging session and we found out that the problem with model distributed happens between 50-70% of the training with the first parquet file (day) when this args are set in our dataloader
NVTDataLoader(
global_size=global_size,
global_rank=global_rank,
when these are arguments are disabled, we can see that we can train on two GPUs (but both using the same dataset). So most likely the issues is because of our NVT PyT dataloader.
We can reproduce it quickly with ecom_small
from transformers4rec.
Gabriel, Julio and I did another debugging session and looks like one of our worker is not waiting for the other worker, and this creates a bottleneck. The options/guidance to explore:
- [] torch.distributed.barrier()
- [] https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html
from transformers4rec.
@Ahanmr currently we are working on support training on multi-gpu.
from transformers4rec.
@rnyak after a custom preparing dataset, i have 1321 folders that is 1 for each day.
So i'm training it as mentioned in the youchoose dataset example. I have few question regarding that...
- Currently i'm training it with train batch size = 32, eval batch size = 16, as i'm having 16gb gpu memory. I'm not sure what could be better number number as per my resources, so any suggestion on that would be helpful?
- After training it...after 500 days of traing the loss is 0, i'm not sure it is overfitting or it is the way. Do i need to stop or is there any better way to do the training.
- Also these evaluation scores are very less for each day, so how any one know final evaluation score?
Any help would be great.
Thanks!
from transformers4rec.
Related Issues (20)
- [BUG] conda env import error cudf HOT 2
- [FEA] Feature to extract attention values from transformer heads
- [BUG] Incorrect scores for evaluation
- [FEA] Multi-task prediction support with Next-Item-Prediction HOT 1
- How to use Transformers4Rec with pandas HOT 2
- [QST] ValueError: For masking a categorical_module is required including an item_id.
- [QST] Projecting inputs of NextItemPredictionTask to'64' As weight tying requires the input dimension '320' to be equal to the item-id embedding dimension '64' HOT 4
- [QST] Cross-entropy and pairwise losses are supported in Next Item Prediction
- [QST] How to print metrics while training?
- RuntimeError: CUDF failure at: /__w/cudf/cudf/cpp/src/io/parquet/reader_impl_helpers.cpp:379: Invalid rowgroup index[BUG] HOT 10
- Génerating predictions HOT 5
- [BUG] Inconsistent inference and evaluation results of the XLNET-CLM even on the training set! HOT 2
- [BUG] CausalLanguageModeling masking error on last item only condition HOT 1
- [QST] Help with creating two tower model with transformers. HOT 1
- [FEA] Post context fusion using T4rec api HOT 1
- [BUG] CausalLanguageModeling do not mask last input item HOT 3
- [QST] Extracting User Representation Vectors from Pre-trained Next Item Prediction Model
- [BUG] AttributeError: 'list' object has no attribute 'output_node'" HOT 3
- Model is not generating accurate recommandations [QST]
- [BUG] RuntimeError: PyTorch execute failure: Expected Tensor but got GenericList
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers4rec.