Comments (11)
Question 1: What is the purpose of this project? What significance does it have?
Answer: This project is a reimplementation of DDPM (Diffusion Probabilistic Models) and DDIM (Diffusion Denoising Implicit Models). It serves as an introductory project to classic deep learning algorithms in the image generation domain. It provides an intuitive understanding of the algorithm's underlying principles. The code structure mirrors the paper structure, facilitating an easier learning experience.
问题1:这个项目是做什么的?它有什么意义?
回答:这个项目是一个基础的DDPM和DDIM复现项目,是入门图像生成领域经典的深度学习算法。它可以直观的教给你算法底层原理,代码结构与论文结构相同,更轻松学习。
from integrated-design-diffusion-model.
Question 2: How should I choose appropriate parameters during training?
Answer: In the tools/train.py
file, you can customize the values in argparse
. For specific training parameters, refer to the Parameter Explanation.
问题2:我该如何在训练时选择合适的参数?
回答:在tools/train.py
文件中,你可以自定义设置argparse
中的值,具体训练参数可以到参数讲解获得。
from integrated-design-diffusion-model.
Question 3: How can I accelerate image generation during training?
Answer: Use --sample ddim
.
问题3:我该如何在训练时加速图像生成?
回答:--sample
设置为ddim
from integrated-design-diffusion-model.
Question 4: Why am I encountering numerous CUDA or cuDNN errors such as THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp
or RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
during training?
Answer: Check whether the --num_classes
value in argparse
matches the number of classes in your current dataset. One major reason for this issue is that the value here is less than the number of classes in your dataset.
问题4:为什么我在训练的时候出现了THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp
或RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
等大片CUDA或cuDNN错误?
回答:检查argparse
中--num_classes
是否与当前数据集类别个数相同。出现该问题的一大原因是此处设置的值小于你的类别个数。
from integrated-design-diffusion-model.
Question 5: Why do I see noise issues in the generated images?
Answer: The appearance of noise is often due to a mismatch between the current model configuration and the one used during training. Please check if --act
, --num_classes
, and --sample
are set correctly. Also, make sure to inspect your training results to verify if the model has adequately converged in each validation round.
问题5:为什么我生成的图片会出现噪点问题?
回答:噪点出现的很重要的原因是当前使用的模型配置未与训练时保持一致,请检查--act
、--num_classes
、--sample
是否设置正确。与此同时,请查看一下自己的训练结果中每轮验证图片是否训练到拟合。
The image and question from #7
图片和问题来自 #7
from integrated-design-diffusion-model.
Question 6: How should the dataset be divided? How to set up conditional and unconditional training?
Answer: You can store the dataset anywhere on your computer; just set --dataset_path
accordingly. For unconditional training, place all data in one folder, for example: if the file path is /path/dataset/unconditional/images
, store all images in the images
folder, and set --dataset_path
to /path/dataset/unconditional
. For conditional training, organize images of the same type into corresponding folders. For instance, if you have folders class0
and class1
, with the main directory being /path/dataset/conditional
, the paths for the two folders would be /path/dataset/conditional/class0
and /path/dataset/conditional/class1
. After organizing the dataset, for conditional training, modify --num_classes
to the number of input categories. The configuration is complete.
问题6:数据集该如何划分?条件训练和非条件训练该怎么设置?
回答:数据集你可以存放在电脑的任何地方,只需要设置--dataset_path
即可。当使用非条件训练时,应将所有数据放在一个文件夹中,例如:文件地址为/path/dataset/unconditional/images
,将所有图片存放在images
中,设置--dataset_path
为/path/dataset/unconditional
即可。当使用条件训练时,应将相同类型的图片放入对应文件夹中,例如有class0
,class1
这两个文件夹,主目录为/path/dataset/conditional
,此时两个文件夹路径为/path/dataset/conditional/class0
和/path/dataset/conditional/class1
。此时,数据集都划分完毕,但是在条件训练时需要将--num_classes
修改为输入种类的个数,当设置完毕后即配置完成。
Refer to the diagram below for a detailed structure.
详细结构可以参考下图。
from integrated-design-diffusion-model.
Question 7: The training was interrupted unexpectedly. How can I resume training?
Answer: Don't worry, the trainer provides a resume training feature with detailed parameters --resume
and --start_epoch
. For resuming training on a single GPU, you can directly use python train.py --resume True
. This will resume training using the ckpt_last.pt
by default. If you want to resume training from a specific epoch, say, epoch 50, you can use python train.py --resume True --start_epoch 50
. The trainer will then read the weights from the 49th epoch and start training from the 50th epoch (--save_model_interval
must be set to True
). When conducting distributed training, please ensure that all processes have been terminated before resuming training. If any process is still active, it will indicate that the current address is occupied.
问题7:训练异常中断了,如何恢复训练?
回答:别担心,训练器提供了恢复训练功能,详细参数为--resume
和--start_epoch
。当单GPU需要恢复训练时,可以直接使用python train.py --resume True
,此时默认使用ckpt_last.pt
恢复训练;当使用python train.py --resume True --start_epoch 50
时,训练器将会从读取第49个权重文件,开始第50次训练(--save_model_interval
必须为True
)。当为分布式训练时,请在恢复训练前查看是否所有进程都已销毁,如果没销毁,则会显示当前地址被占用。
from integrated-design-diffusion-model.
Question 8: The training time for each epoch is too long. How can I use a pretrained model?
Answer: Pretrained models are released with each major version Release
. Please stay informed about their release. To use a pretrained model, download the model with matching parameters such as network
, image_size
, act
, etc., to any local folder. Then, use python train.py --pretrain True --pretrain_path /your/pretrain/model.pt
to load the pretrained weights. Alternatively, you can directly modify the --pretrain
and --pretrain_path
parameters in train.py
.
问题8:每轮训练时间太长了,怎么使用预训练模型?
回答:预训练模型在每次大版本Release
中发布,请留意。预训练模型使用方法如下,首先将对应network
、image_size
、act
等相同参数的模型下到本地任意文件夹下。使用python train.py --pretrain True --pretrain_path /your/pretrain/model.pt
即可加载训练。或直接调整train.py
中--pretrain
和--pretrain_path
即可。
from integrated-design-diffusion-model.
Question 9: Why does using a 32×32 model to generate 64×64 or 128×128 images result in distortion and more objects?
Answer: This is due to the mismatch in model sizes. If it's an image with defect textures where the features are not clear, generating a large size directly might not have these issues, such as in NRSD or NEU datasets. However, if the image contains a background with specific distinctive features, you may need to use super-resolution or resizing to increase the size, for example, in Cifar10, CelebA-HQ, etc. If you really need large-sized images, you can directly train with large pixel images if there is enough GPU memory.
问题9:为什么使用32×32的模型生成64×64的图片会扭曲、物体会变多呢?
回答:这是由于模型尺寸不匹配导致的。如果是缺陷纹理那种图片,特征物不明显的直接生成大尺寸就不会有这些问题,例如NRSD、NEU数据集。如果是含有背景有特定明显特征的则需要超分或者resize提升尺寸,例如Cifar10、CelebA-HQ等。如果实在需要大尺寸图像,在显存足够的情况下直接训练大像素图片。
from integrated-design-diffusion-model.
Question 10: Why do I get a RuntimeError: Address already in use
error when starting training?
Answer: This issue often occurs when running distributed training. To resolve it, follow these steps: Start the console with the htop
or top
command, look for a program starting with mp
that is running at high usage, and use the kill
command to terminate that process. Simultaneously, use the nvidia-smi
command to check if the GPU memory usage has returned to 0.
问题10:为什么我启动训练报RuntimeError: Address already in use
错误?
回答:这种问题常发生在开启了分布式训练中。解决方法如下:首先启动控制台htop
或top
命令,查找mp
开头的正在高运行的程序,使用kill
命令将该进程结束。同时配合nvidia-smi
命令检查显存占用率是否恢复为0。
from integrated-design-diffusion-model.
Question 11: I encountered a ValueError: Imaginary component XXX
error when calculating FID. How can I resolve it?
Answer: The error occurs due to an excessively high version of scipy. To resolve it, please downgrade scipy to version 1.11.1
.
问题11:在计算FID的时候出现了ValueError: Imaginary component XXX
错误,如何解决?
回答:该错误出现的原因是当前scipy版本过高导致,请降低版本至1.11.1
即可解决。
from integrated-design-diffusion-model.
Related Issues (17)
- 可以用来生成工业上的缺陷图片吗? HOT 4
- 写地太神了, 作为新手仍然想慢慢读源码 HOT 1
- 关于训练更大尺寸需要的显存 HOT 9
- 数据集是否需要划分 HOT 5
- 输入的是灰度图像 能输出灰度图像吗 HOT 3
- cifar-10 调参 HOT 1
- 租用的GPU训练时出现socket has failed to listen on any local network address HOT 1
- 恢复训练 HOT 1
- 請問支援prompt輸入嗎? HOT 1
- 咨询 HOT 4
- 关于训练集和验证集 HOT 2
- Loading Unconditional Model Failure!!! HOT 1
- Why MES loss is nan in training? HOT 2
- 生成出来的图片是噪点图 HOT 5
- 关于test_module.py HOT 4
- 如有任何问题,可以加入群聊交流,Q群号:949120343 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from integrated-design-diffusion-model.