Comments (6)
Are you change the node id for the second command ?
from dora.
@adefossez Yes, sure. I did some debugging and noticed that the time it took to complete each step increased greatly. In fact, learning is happening, but it is very slow. Moreover, if I run it on one machine, the process goes quickly. I expected that with the addition of N nodes the time to complete one epoch would be reduced, but it increased greatly in the end.
from dora.
It will depend on the batch size, and whether you are specifying a per GPU or overall batch size (I recommend the later so that the XP meaning doesn't change based on how many gpus are used). What codebase is this for ?
from dora.
Also it will depend if you have good interconnect between nodes!
from dora.
Check your network config, you might have a firewall, security group, etc blocking access on the ports torchrun is using. If you get this working please update us, I'm currently going through the same headache...
from dora.
@Tristan-Kosciuch Yes, everything works for me. To be honest, I donβt remember what the problem was. I reinstalled the environment, updated all the libraries and the problem was solved.
from dora.
Related Issues (20)
- `dora grid ... -t 0` crashes if the job hasn't logged anything yet HOT 1
- Cannot import name 'hydra_main' from 'dora' on Colab or Kaggle environment HOT 2
- Can I train on multiple machines? HOT 2
- Slurm Configuration HOT 1
- World size by dora_distrib.world_size() is equal to 1 when I have two GPU's HOT 2
- Run a grid experiment for the first time HOT 1
- Why only the log file of rank > 0 is created? HOT 2
- How to run with torchrun? HOT 3
- Can not work on multi machines with multi gpus HOT 2
- Initializing Dora xp/using Dora HOT 1
- Can we train with dora on multiple machines without SLURM? HOT 2
- Is there any way to use the Debugger of VSCode while using "dora run"? HOT 3
- Cannot install due to requirement of "sklearn" HOT 1
- How to add the --export=ALL option to srun? HOT 2
- Support for custom resolvers with Hydra HOT 2
- [Feature request] Export grid tree table to LaTeX/csv HOT 3
- No stop command? HOT 2
- Now I want to debug dora,Is dora parsing from the train.py file?
- Python Debugger and dora HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dora.