Comments (4)
Looks like either the pytorch version that you use doesn't match or the checkpoint file is corrupted.
from diffae.
Thanks for your help. I have solved the problem I met before. However, when I train the latent DPM, I met some errors. It showed RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout.
And some other errors are **[E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1132, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800188 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. | 0/157 [00:00<?, ?it/s]
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1132, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800188 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1132, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800188 milliseconds before timing out.
**
I ran the code on two RTX A6000s. I am wondering if you met these errors before. If you could give me some advice, you will be very appreciated.
from diffae.
The problem you encountered here is due to parallel training which isn't at all needed to train the latent DPM, which wasn't designed to be trained parallelly in the first place. Please try training the latent DPM with a single GPU, and it won't take that long!
from diffae.
Thanks for your help. The problem has been solved.
from diffae.
Related Issues (20)
- Why use zero_module? HOT 2
- Configuration of the experiment -- attribute manipulation on real images HOT 4
- what is the input of conditional DDIM decoder? HOT 6
- how to visualize the reconstruction result
- Inquiry about using Guided-Diffusion parameters HOT 2
- How to determine cond_fn in condition_mean HOT 4
- I got the issue about lmdb: lmdb.Error: ffhq256.lmdb: No such file or directory HOT 7
- Difference in Model Weights HOT 1
- I cannot access to URL for converting the datasets to LMDB format
- Extensive GPU Usage for Manipulation HOT 1
- Issues with Conditional Sampling HOT 3
- about the partition of training and validation sets HOT 1
- It looks like z-sem is not being trained HOT 9
- an error occurred during evaluation. HOT 3
- the setting of use_inverted_noise
- Retraining for getting higher resolution Image
- Checkpoint
- log_sample after the batch training HOT 1
- How is the autoencoding happening in the code during inference?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffae.