Diffusion-xray

Experiments and utilities to reverse engineer stable diffusion model mechanics and their neural networks structure on a CPU laptop for ease of inspecting.

Alternatives to running SD on GPU

Run Tensorflow on CPU
8GB or less host memory instead of 8GB GPU VRAM

Image Generator

The diffusion model generator orchestrates image generation through the AE, UNet, Text Encoder, Noise Scheduler:

Autoencoder
- VAE encodes the input image into a compressed latent vector
- VAE decodes the denoised latent vector back into a image
UNet
- ResNet-based blocks predict noise in latent vector (estimated residuals)
- Cross-attention layers steer the residuals using the conditional text embedding vector
Text Encoder CLIP ViT-L/14
- Text Encoder tokenizes prompt
- Text Encoder produces embeddings for text conditioning (input to the attention layers in the UNet).
Noise Scheduler (DDPM / DDIM)
- runs noising of latent vector forward for several steps (train)
- runs the UNet denoiser backwards the needed time steps (train, inference)

Sampling: Image Corrupting Noise vs Distribution Stabilizing Noise

An interesting sampling algorithm idea: predicted "corrupting" noise vs extra "stabilizing" noise.

At each step,

the sampling algorithm removes the predicted corrupting noise (image pixel sample not normally distribution anymore)
the sampling algorithm adds back a smaller scaled stabilizing noise to the currently denoised image (image pixel sample normally distributed again)
the normally distributed image sample becomes compatible again with the distribution expected by the UNet residual predictor.

Feature	Modelling Equation	Description
Stabilizing noise	$$z_{\text{noise}} = \sqrt{b_t} \times \mathcal{N}(0, I)$$	Make denoised image normally distributed again.
mean image	`mean = (x - pred_noise * ((1 - a_t) / sqrt(1 - ab_t))) / sqrt(a_t)`	Remove the predicted noise from corrupted image x

# remove the predicted noise from corrupted image x
# z noise back in to avoid denoising collapse due to changed noise distribution
def denoise_add_noise(x, t, pred_noise):

    # stabilizing noise
    z_noise = b_t.sqrt()[t] * tf.randn_like(x) 

    mean = (x - pred_noise * ((1 - a_t[t]) / (1 - ab_t[t]).sqrt())) / a_t[t].sqrt()
    return mean + z_noise

Debugging Noise - Distribution Collapse

If the extra noise is not added back the UNet denoiser predicts wrong noise levels that collapse the image mean values.

Context & Time Embedding ~ Scaling & Offset the Noise Decoder Level

For each time step,

Time Embedding added to UNet so the UNet decoder offsets the image noise with the right noise step.
Context Embedding added to UNet controls UNet decoder image noise level by text embedding

Model Training: Random Timestep & Loss

Sample a random image, and sample a random timestep (noise level)
compare predicted noise at the time step with actual injected noise
compute MSE(noise_true, noise_pred) loss from actual and predicted noise

Control Sampling

Context embedding cemb is added to the random noise during training

Fast Sampling: DDIM

def sample_ddim(n_sample, n):
    samples = randn(n_sample, 3, height, height).to(device)  
    step_size = timesteps // n

    for i in range(timesteps, 0, -step_size):
        t = tensor([i / timesteps])[:, None, None, None]
        pred_noise = nn_model(samples, t)   
        samples = denoise_ddim(samples, i, i - step_size, pred_noise)

Experiments Overview

all experiments are scripts that start with test* or text* filename. They each test building blocks or simpler internal modules of the diffusion model.

test noise corruptions in the image encoder
test the image decoder
test the text encoder
attemps visualizing semantic similarity in the text encodings' latent space
latent level interpolations

DebugTensor

Module xray/tensor_debug contains a DebugTensor class to support reverse engineer an encoded vector at various (de)noising stages.

Setup Notes

pip install -r requirements (warning: uninstall GPU keras as would conflict with CPU keras version)
download weights

The backend image encoder, decoder are based on Keras, but could in theory be swapped with a Pytorch or similar other DL frameowrk. But the Keras implementation makes it possible to run on CPU accessible host memory (not possible with NVDIA GPUs with 4GB VRAM or less)

alicata / diffusion-xray Goto Github PK

diffusion-xray's Introduction

Diffusion-xray

Alternatives to running SD on GPU

Image Generator

Sampling: Image Corrupting Noise vs Distribution Stabilizing Noise

Debugging Noise - Distribution Collapse

Context & Time Embedding ~ Scaling & Offset the Noise Decoder Level

Model Training: Random Timestep & Loss

Control Sampling

Fast Sampling: DDIM

Experiments Overview

DebugTensor

Setup Notes

diffusion-xray's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent