Comments (6)
The word input is a bit ambiguous because there are actually "two" inputs:
- Input to the CNN encoder (semantic encoder): Input here is clean, original, no noise.
- Input to the UNET (diffusion model): Input here is noisy depending on t.
There is only one loss function and it's for UNET (CNN is learned end-to-end as part of the UNET):
loss = || UNET(xt) - x_clean|| or || UNET(xt) - noise ||
In the paper, we used the latter version, but both should be equally valid.
from diffae.
The word input is a bit ambiguous because there are actually "two" inputs:
- Input to the CNN encoder (semantic encoder): Input here is clean, original, no noise.
- Input to the UNET (diffusion model): Input here is noisy depending on t.
There is only one loss function and it's for UNET (CNN is learned end-to-end as part of the UNET):
loss = || UNET(xt) - x_clean|| or || UNET(xt) - noise ||
In the paper, we used the latter version, but both should be equally valid.
Many thanks for your reply!
Then I have a small question. Does the diffusion model itself has the ability of reconstruction? Have you ever tried using unconditional DDIM to train to reconstruct the images? I've used img2img of stable diffusion webui to try to reconstruct the original image, but failed. What is your intuition if I use deterministic DDIM reverse image to latent X_T(as your paper mentioned) then i feed this X_T to an unconditional DDIM, whether this is a feasible way of reconstruct the original image?
from diffae.
Moreover, another question is why not need a reconstruction loss to restrict the output of your ddim decoder to ensure the reconstruction of original image? Only the denoising loss is sufficient, why? I can't think of the logic.
Much appreciation for your patience!
from diffae.
Another question is i noticed in your code that in your DDIM decoder, the condition, z_sem, only used in ResBlock, together with timestep. But in Attention Module, you only use self-attn instead of cross-attn, can i ask the reason here?
from diffae.
Diffusion autoencoders or even plain DDIM can definitely reconstruct the image.
I have heard (but not a first hand experience) that classifier-free guidance models (like Stable Diffusion) have problems with inversion. But I don't have any further intuition beyond that.
why not need a reconstruction loss to restrict the output of your ddim decoder to ensure the reconstruction of original image?
This is the property of DDIM itself. An intuition is that DDIM is an ODE, and an ODE can be thought as an invertible mapping (so you have the way to go from the input and back, hence reconstruction). I refer you to read more from the DDIM paper (Song et al, 2020).
Another question is i noticed in your code that in your DDIM decoder, the condition, z_sem, only used in ResBlock, together with timestep. But in Attention Module, you only use self-attn instead of cross-attn, can i ask the reason here?
You can definitely add conditioning signals to the attention modules as well for sure. Is it worth it or not? It's hard to tell without experiments. But the high-level motivation is Attention is attention: "you are working with the inputs" not really adding something new to it. Convolution, in this case, is a better choice for a layer that "adding something new".
from diffae.
Diffusion autoencoders or even plain DDIM can definitely reconstruct the image. I have heard (but not a first hand experience) that classifier-free guidance models (like Stable Diffusion) have problems with inversion. But I don't have any further intuition beyond that.
why not need a reconstruction loss to restrict the output of your ddim decoder to ensure the reconstruction of original image?
This is the property of DDIM itself. An intuition is that DDIM is an ODE, and an ODE can be thought as an invertible mapping (so you have the way to go from the input and back, hence reconstruction). I refer you to read more from the DDIM paper (Song et al, 2020).
Another question is i noticed in your code that in your DDIM decoder, the condition, z_sem, only used in ResBlock, together with timestep. But in Attention Module, you only use self-attn instead of cross-attn, can i ask the reason here?
You can definitely add conditioning signals to the attention modules as well for sure. Is it worth it or not? It's hard to tell without experiments. But the high-level motivation is Attention is attention: "you are working with the inputs" not really adding something new to it. Convolution, in this case, is a better choice for a layer that "adding something new".
Thank you so so much for your reply!
from diffae.
Related Issues (20)
- Why use zero_module? HOT 2
- Configuration of the experiment -- attribute manipulation on real images HOT 4
- how to visualize the reconstruction result
- Inquiry about using Guided-Diffusion parameters HOT 2
- How to determine cond_fn in condition_mean HOT 4
- I got the issue about lmdb: lmdb.Error: ffhq256.lmdb: No such file or directory HOT 7
- Regarding the error I met when I try to run the run_bedroom128.py HOT 4
- Difference in Model Weights HOT 1
- I cannot access to URL for converting the datasets to LMDB format
- Extensive GPU Usage for Manipulation HOT 1
- Issues with Conditional Sampling HOT 3
- about the partition of training and validation sets HOT 1
- It looks like z-sem is not being trained HOT 9
- an error occurred during evaluation. HOT 3
- the setting of use_inverted_noise
- Retraining for getting higher resolution Image
- Checkpoint
- log_sample after the batch training HOT 1
- How is the autoencoding happening in the code during inference?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffae.