Git Product home page Git Product logo

Comments (9)

phizaz avatar phizaz commented on August 11, 2024

You mean you want to apply DiffAE not on RGB images but a matrix of image features, e.g. VQ-VAE-like features? The thing you expect to get by this is a meaningful z_sem that DiffAE may provide?
If so, it seems possible and you don't need to have a groundtruth z_sem for it. You just need to train DiffAE on tor of that image feature space instead of training DiffAE on the RGB space as usual. In this case, DiffAE learns to reconstruct the image features, i.e. 64x64x256, and at the same time learns to come up a useful z_sem.

from diffae.

mdv3101 avatar mdv3101 commented on August 11, 2024

Hi @phizaz,
Yes, I am looking for something similar. I need to reconstruct the image using the diffusion model. The conditioning i.e. z_sem should come from image feature space while the DDIM should work on RGB space.
I have a few doubts:

  • Do I have to train only Diffusion AutoEncoder or DDIM as well?
  • If we train Diffusion AutoEncoder only, then whether the z_sem generated from it will be compatible with pre-trained DDIM or not?
  • What about the losses like LPIPS if we train in feature space instead of RGB?
  • As per my understanding Diffusion AutoEncoder training uses autoencoder as well as diffusion model. If we train a model using a config file 'ffhq128_autoenc_130M', then it will be going to use both autoencoder and diffusion. Am I right?.

from diffae.

phizaz avatar phizaz commented on August 11, 2024

A few jargons might need to be made clear first.

  1. You already have an autoencoder that provides the feature space on which everything else will build up on.
  2. Diffusion autoencode is itself a kind of DDIM. It's not very clear to mention DiffAE and DDIM separately. In any case, I don't think you need another DDIM besides a DiffAE.
  3. You mentioned pretrained DDIM* which I'm not sure what it is.
  4. Definitely, the word autoencoder in DiffAE is NOT the same as in 1). You need to be careful with words here.

My imaginative picture about how should it look like is like this:

  1. You have a pretrained autoencoder that provides the image feature space.
  2. You train DiffAE on the image feature space with some loss function, I don't think you need LPIPS.

I don't think you need anymore than these two components.

from diffae.

mdv3101 avatar mdv3101 commented on August 11, 2024
  1. I have an autoencoder that transfers an RGB image into a feature space.
  2. Thanks for clarifying.
  3. Pretrained DDIM: I was referring to the model generated using 'ffhq128_ddpm_130M' config.
  4. Thanks for removing the confusion.

If I train DIFFAE on Image Feature Space, then I will need the decoder from my original autoencoder model to transfer the generated feature vector back to RGB space.
Is there any way I can only train the semantic encoder of DIFFAE, keeping the DDIM part fixed? In that way, the semantic encoder will take the image feature space-> generates z_sem (a 512 vector using model.encode() ) -> z_sem will be used for manipulating the Conditional DDIM model which still works on RGB space.

from diffae.

phizaz avatar phizaz commented on August 11, 2024

Is there any way I can only train the semantic encoder of DIFFAE, keeping the DDIM part fixed?

I think you mean training only the semantic encoder while keeping the DDIM part fixed. Let assume that we have a pretrained DiffAE on potentially related dataset, it might be possible, not sure, no experiment on this.

from diffae.

HanshuYAN avatar HanshuYAN commented on August 11, 2024

Dear Authors,

Thanks for sharing this work. I have a question about how the Semantic Encoder (shown in Figure 2) is trained. I cannot find a related loss for the training of the Semantic Encoder. The paper only shows Eqn (6) and (9), but these two losses are used to train the "conditional DDIM" and "latent DDIM," not for the "Semantic Encoder."

from diffae.

phizaz avatar phizaz commented on August 11, 2024

The semantic encoder is trained end-to-end which means the training signal is propagated from the reconstruction loss function, through the diffusion model, UNET, then arrives at the semactic encoder.
This means the encoder is encouraged to encode useful information to help denoising the whole image while only the corrupted version of it is avaible to the UNET at the time.

from diffae.

HanshuYAN avatar HanshuYAN commented on August 11, 2024

I see. Thanks for the reply.

Then, in this case, how can we ensure the code z_{sem} will have two separate parts, one for linear semantics and the other for stochastic details (as mentioned in the Abstract)? It seems there is no explicit regularization to encourage this kind of disentangling. Any insightful understanding about this?

from diffae.

phizaz avatar phizaz commented on August 11, 2024

z_sem should only encode semantic information leaving the stochasitc part be the job of the X_T.

from diffae.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.