Visual Concepts Tokenization

This is the official PyTorch implementation of NeurIPS 2022 paper: Visual Concepts Tokenization Arxiv | OpenReview.

Visual Concepts Tokenization
Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng
arXiv preprint arXiv:2102.10303
NeurIPS 2022

requirements

A suitable conda environment named vct can be created and activated with:

conda env create -f environment.yaml
conda activate vct

Usage

from models.visual_concept_tokenizor import VCT_Decoder, VCT_Encoder
import torch
vct_enc = VCT_Encoder(z_index_dim = 20)
vct_dec = VCT_Decoder(index_num = 256, z_index_dim=256, ce_loss=True)
vct_enc.cuda()
vct_dec.cuda()
x = torch.randn((32,256,256)).cuda()
z = vct_enc(x)
x_hat = vct_dec(z)

Datasets

Download datasets for training VCT (Shapes3D as an example)

Download Shapes3D

Manully set the "get_args.data_dir = /my/datasets/path" in each gin-config, e.g., shapes3d_ce.gin. Or run the following script for setting dataset path.

python set_dataset_dir.py --datasets /path/to/your/datasets

Dataset path structure:

/path/to/your/datasets/
├──shapes3d/3dshapes.h5
├──mpi_toy/mpi3d_toy.npz
├──...

VQVAE as Image Tokenizor

Train VQVAE

main_vqvae.py script conducts the training of VQVAE, we take the pretrained VQVAE as the image tokenizor and detokenizor. A VQVAE can be trained with the following commend:

python train_vqvae.py --dataset {dataset} --model vqvae --data-dir /path/to/your/datasets/ --epochs 200

dataset can be {cars3d, shapes3d, mpi3d, clevr}. The training setting is the same with VQ-VAE. As an example,

python train_vqvae.py --dataset shapes3d --model vqvae --data-dir /path/to/your/datasets/ --epochs 200

Train VCT

main_vct.py is the main script to train VCT autoencoder with pretrained VQ-VAE. After the VQ-VAE is trained we set the checkpoint path of the pretrained model as follows: "get_args.data_dir = /my/checkpoint/path" in each gin-config, e.g., shapes3d_shared.gin.

python main_vct.py --config {dataset}_ce.gin

For scene decomposition

If we train a VQ-VAE based VCT model for scene decomposition, the training command is following:

python train_vqvae.py --dataset clevr --model vqvae --data-dir /path/to/your/datasets/ --epochs 200

python main_vct.py --config clevr_ce.gin

main_vct_mask.py is the script to train VCT autoencoder with a mask based decoder (see discriptions in the main paper).

python main_vct_mask.py --config clevr_ce.gin

CLIP as Image Tokenizor

main_vct_clip.py script conducts the training of VCT with CLIP image encoder as the Image Tokenizor.

python main_vct_clip.py --config shapes3d_ce.gin

Pretrained GAN as Image Detokenizor

We follow DisCo to adopt different pretrained gan as the Image Detokenizor. Specifically, for the following datasets we use the following commands, respectively.

For dataset Cat/Church

python main_gan_based.py --config {church/cat}_ce.gin

For dataset FFHQ

python main_ffhq_gan_based.py --config ffhq_low_ce.gin

For dataset ImageNet

python main_biggan_based.py --config imgnet_ce.gin

Visualization Results

For VCT with VQVAE as Image Tokenizor

For VCT with Pretrained GAN as Image Detokenizor

Acknowlegement

Note that this project is built upon VQ-VAE and DisCo and Perceiver. The eveluation code is built upon disentanglement_lib.

Citation

@article{yang2022visual,
  title={Visual Concepts Tokenization},
  author={Yang, Tao and Wang, Yuwang and Lu, Yan and Zheng, Nanning},
  journal={NeurIPS},
  year={2022}
}

gomb0c / vct Goto Github PK

vct's Introduction

Visual Concepts Tokenization

requirements

Usage

Datasets

VQVAE as Image Tokenizor

Train VQVAE

Train VCT

For scene decomposition

CLIP as Image Tokenizor

Pretrained GAN as Image Detokenizor

Visualization Results

Acknowlegement

Citation

vct's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent