Hi, Can we train the autoencoder only, by fixing the ddim? I want to train an

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

A few jargons might need to be made clear first. You already h

I have an autoencoder that transfers an RGB image into a feature space. <li

Train AutoEncoder Only about diffae HOT 9 OPEN

phizaz commented on August 11, 2024

Train AutoEncoder Only

from diffae.

Comments (9)

phizaz commented on August 11, 2024

You mean you want to apply DiffAE not on RGB images but a matrix of image features, e.g. VQ-VAE-like features? The thing you expect to get by this is a meaningful z_sem that DiffAE may provide?
If so, it seems possible and you don't need to have a groundtruth z_sem for it. You just need to train DiffAE on tor of that image feature space instead of training DiffAE on the RGB space as usual. In this case, DiffAE learns to reconstruct the image features, i.e. 64x64x256, and at the same time learns to come up a useful z_sem.

from diffae.

mdv3101 commented on August 11, 2024

Hi @phizaz,
Yes, I am looking for something similar. I need to reconstruct the image using the diffusion model. The conditioning i.e. z_sem should come from image feature space while the DDIM should work on RGB space.
I have a few doubts:

Do I have to train only Diffusion AutoEncoder or DDIM as well?
If we train Diffusion AutoEncoder only, then whether the z_sem generated from it will be compatible with pre-trained DDIM or not?
What about the losses like LPIPS if we train in feature space instead of RGB?
As per my understanding Diffusion AutoEncoder training uses autoencoder as well as diffusion model. If we train a model using a config file 'ffhq128_autoenc_130M', then it will be going to use both autoencoder and diffusion. Am I right?.

from diffae.

phizaz commented on August 11, 2024

A few jargons might need to be made clear first.

You already have an autoencoder that provides the feature space on which everything else will build up on.
Diffusion autoencode is itself a kind of DDIM. It's not very clear to mention DiffAE and DDIM separately. In any case, I don't think you need another DDIM besides a DiffAE.
You mentioned pretrained DDIM* which I'm not sure what it is.
Definitely, the word autoencoder in DiffAE is NOT the same as in 1). You need to be careful with words here.

My imaginative picture about how should it look like is like this:

You have a pretrained autoencoder that provides the image feature space.
You train DiffAE on the image feature space with some loss function, I don't think you need LPIPS.

I don't think you need anymore than these two components.

from diffae.

mdv3101 commented on August 11, 2024

I have an autoencoder that transfers an RGB image into a feature space.
Thanks for clarifying.
Pretrained DDIM: I was referring to the model generated using 'ffhq128_ddpm_130M' config.
Thanks for removing the confusion.

If I train DIFFAE on Image Feature Space, then I will need the decoder from my original autoencoder model to transfer the generated feature vector back to RGB space.
Is there any way I can only train the semantic encoder of DIFFAE, keeping the DDIM part fixed? In that way, the semantic encoder will take the image feature space-> generates z_sem (a 512 vector using model.encode() ) -> z_sem will be used for manipulating the Conditional DDIM model which still works on RGB space.

from diffae.

phizaz commented on August 11, 2024

Is there any way I can only train the semantic encoder of DIFFAE, keeping the DDIM part fixed?

I think you mean training only the semantic encoder while keeping the DDIM part fixed. Let assume that we have a pretrained DiffAE on potentially related dataset, it might be possible, not sure, no experiment on this.

from diffae.

HanshuYAN commented on August 11, 2024

Dear Authors,

Thanks for sharing this work. I have a question about how the Semantic Encoder (shown in Figure 2) is trained. I cannot find a related loss for the training of the Semantic Encoder. The paper only shows Eqn (6) and (9), but these two losses are used to train the "conditional DDIM" and "latent DDIM," not for the "Semantic Encoder."

from diffae.

phizaz commented on August 11, 2024

The semantic encoder is trained end-to-end which means the training signal is propagated from the reconstruction loss function, through the diffusion model, UNET, then arrives at the semactic encoder.
This means the encoder is encouraged to encode useful information to help denoising the whole image while only the corrupted version of it is avaible to the UNET at the time.

from diffae.

HanshuYAN commented on August 11, 2024

I see. Thanks for the reply.

Then, in this case, how can we ensure the code z_{sem} will have two separate parts, one for linear semantics and the other for stochastic details (as mentioned in the Abstract)? It seems there is no explicit regularization to encourage this kind of disentangling. Any insightful understanding about this?

from diffae.

phizaz commented on August 11, 2024

z_sem should only encode semantic information leaving the stochasitc part be the job of the X_T.

from diffae.

Train AutoEncoder Only about diffae HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent