drboog / lafite Goto Github PK

View Code? Open in Web Editor NEW

180.0 180.0 25.0 30.57 MB

Code for paper LAFITE: Towards Language-Free Training for Text-to-Image Generation (CVPR 2022)

License: MIT License

Python 91.88% Jupyter Notebook 1.54% C++ 2.02% Cuda 4.56%

lafite's People

Contributors

Stargazers

Watchers

lafite's Issues

Pretrained language free MM CelebA-HQ model

Hi I'm currently putting together a paper on a different language free text-to-image generation method and I was wondering if you had the pre-trained celebA hq model that you could share with me for comparison purposes?

training time

Hi, you mentioned all of your experiments are conducted on 4 Nvidia Tesla V100 GPUs. MS-COCO, CUB, LN-COO and CelebA are datasets that are considered.

I am wondering how long training process takes for the above datasets? and how many iterations (what's the batchsize) and epochs?
(you mentioned 4 days to reach FID of 18 on MS-COCO, how about other dataset?)

Custom Dataset training

Hi, Would you leave a detailed comment on custom dataset training.

Truly thankfull

Intuition of injecting noises for pseudo text features

Hi authors,

I really appreciate you for sharing your codes.
I would like to ask a question about the intuition of injecting the random noise instead of using the image feature embedded by CLIP.

Thank you in advance!

some problem in dataset_tool.py

I don't know why this program is running so slow. is there anything I can do to speed up the process?

PS：Due to personal habits, I used to run my code on Jupyter Notebook, but I still couldn't run it on Jupyter Notebook when all the packages were installed successfully and the function didn't have any problems, but I could run it on CMD (albeit slowly). May I ask what is the reason for this?

About training detials in Table 3

Your Lafite is a very important model. It is very exciting for me to find that Lafite can generate complex images of such high quality, showing amazing performance even compared with the large pre-trained models.
I am interested in some training details of Lafite.
Are the standard text-to-image Lafite models of different datasets in table 3 fine-tuned from pre-trained models?
If they are fine-tuned, how to pre-train the model for the different datasets? What is the pre-trained dataset? which is the training mode (language free or with gt) of pretraining?

Btw, thanks for sharing this wonderful work.

Horizontal flip on COCO

The pre-trained model COCO2014_CLIP_ViT32_all_text_FID.pkl has the 'xflip': True. So, could I assume that to achieve the score in the paper we use horizontal flip but no other augmentation?
Then, --mirror should be added in the command line, I assumed.

evaluate issue

Thans for your excellent work! But i have some question about calculating fid on mscoco.
Firstly, how do you randomly sample text from coco2014 validation set? I think there are two ways, one is sampling 30k images from 40k images then sample one captions for each image, the other is sampling 30k captions from 200k captions in coco2014 validation set. The first way will get better fid because only one caption will be sampled for each images.
Secondly, after getting 30k generated images, i am confused about whether the coco2014 train set or validation set are used for calculating fid, especially the best fid 8.12 in you paper. Using train set to calculate will get a better fid.
I am going to cite your paper, for a reasons of fairnessI, i want to understand your evaluation method totally. i would be very appreciate if you could answer my question.

Reproducibility with ground-truth pairs

I am trying to reproduce "Training with ground-truth pairs" using the command line:

python train.py --gpus=4 --outdir=./outputs/ --temp=0.5 --itd=10 --itc=20 --gamma=10 --data=./datasets/COCO2014_train_CLIP_ViTB32.zip --test_data=./datasets/COCO2014_val_CLIP_ViTB32.zip --mixing_prob=0.0

with the downloaded CLIP features. I didn't touch any hyperparameter. I resumed the training with --resume path/to/pkl.

Unfortunately, I have to resume the training but I notice this is unlikely reproducible until 25,000 kimg targeting FID=8.6 as in the pretrained model (I checked the pretrained model, and it's fine.)

3,024 kimg got 17.425
In this moment, I resumed.
3,024 (previously) + 3024 (resumed log says) = 6,048 kimg got 14.27
3,024 (previously) + 6048 (resumed log says) = 9,072 kimg got 12.65
3,024 (previously) + 9,072 (resumed log says) = 12,096 kimg got 12.15
3,024 (previously) + 14,515 (resumed log says) = 17,539 kimg got 11.96
In this moment, I resumed.
17,539 (previously in total) + 0 (resumed log says) = 17,539 kimg got 11.93 (confirmed correctly resumed)
17,539 (previously in total) + 8,064 (resumed log says) = 25,603 kimg got 11.55

So, I am wondering if the resuming mechanism may hurt the optimizing integrity, for example, the states of optimizers? Do you have any suggestions for this situation?

If you can share your training log, it will be helpful, by the way.

Understanding Multi-Line Text input for training and testing model

Hi, from my understanding, dataset_tool.py supports datasets having more than one line of text caption per image but the testing code only allows for a single line to be passed as input per query. I would like to know how/where in the codebase these multiple captions lines are passed for training and if the inference code can be updated for testing with multi-line caption queries.

I am using the Ground-Truth pair based training approach for evaluation. Thanks in advance.

Tune hyperparameters

Hi,
Thanks for sharing this wonderful work. Could you please explain how to tune itd, itc and gamma for custom dataset.
Thanks in advance

data preprocess

can u share how to preprocess coco dataset to your form:
1.png
1.txt

Questions on the class Model and Gaussian noise in loss.py, and data preprocessing

Hi, when I read the code I noticed that you mentioned "# This implementation is incorrect, it should be sim=sim/temp." for L149 in loss.py. I later found that in L45, there is the same "bug". If I correct L149, do I also need to correct L45?

However, when I read the code I found that the Model class in loss.py is only defined in L75-L77 in loss.py but not called later. May I know what is the function of Model class here?

format of dataset

It seems that the format of the dataset you provided is wrong

any plans for the Lafite2?

Reproduce the experimental results of the paper

First of all thank you very much for your paper, it has been a huge help to me.The project you uploaded has also greatly helped my research. I want to ask you a few questions.

1.Are the results shown in the paper based on "sim = torch.exp(sim/temp), itd=10, itc=20" ? However, what is the result of "sim = sim/temp, itd=5, itc=10"? Under the setting of "sim = sim/temp", is this "itd=5, itc=10" optimal?

2.I am using 4 Nvidia 1080 for training and it takes 15 days for me to run a 25000kimg experiment, I would like to know your equipment and how long it will take to run one training session.

Exporting weights to Pytorch StyleGAN2 implementation

Hi,

Thanks for the wonderful work! I am trying to export the pretrained weights released in this repository to rosinality's implementation. I am using the export script from the stylegan2-ada-pytorch repository which is known to work for checkpoints using that repo (see here). However, when I used the converted Lafite checkpoints for the pretrained model COCO2014_CLIP_ViTB32_all_text.pkl, I am facing the following error:

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Generator:
size mismatch for conv1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgb1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.0.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.2.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.3.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.4.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.5.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.6.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.7.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.8.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.9.conv.modulation.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for convs.10.conv.modulation.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for convs.11.conv.modulation.weight: copying a param with shape torch.Size([128, 1024]) from checkpoint, the shape in current model is torch.Size([128, 512]).
size mismatch for to_rgbs.0.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.2.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.3.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.4.conv.modulation.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for to_rgbs.5.conv.modulation.weight: copying a param with shape torch.Size([128, 1024]) from checkpoint, the shape in current model is torch.Size([128, 512]).

Any idea, why this mismatch might be happening?

lowering gpu requirement hyperparameter setting

First of all thank you very much for this great work. It really helped me a lot. but I have a few questions that I want to ask.

I'm a beginner in text to image synthesis task, so my following questions may be a little stupid.

for training environment issue, I only had one gpu (RTX3090) avaliable right now. I think it will not Influence gamma (batch size = 16 on one gpu) and other hyperparameters like itc, itd etc (in COCO dataset and use ground truth setting). Am I correct ? If not, Is there any tuning advice can you give me in this situation ?

I think gamma is only sensitive in dataset and batch size per gpu.

in the paper, are you using ada='noaug' setting because the training data is sufficient (after x-flip) ?

Data set download

Thank you very much for your work. Now I have the problem of data Set download failure. When downloading MS-COCO Training Set、MS-COCO Validation Set、LN-COCO Training Set and LN-COCO Testing Set, I failed after downloading part of them.

Pre-trained Model on CUB dataset

Hi,
Thanks for the nice work! Is it possible to provide pre-trained models on the CUB dataset?
It would be helpful. Thanks

Best,
Kilichbek

pretrained model link

Hi, the Model trained on MS-COCO with Ground-truth Image-text Pairs, CLIP-ViT/B-16 is actually linked to the same link as Model Pre-trained On Google CC3M in the README. Could you update the correct ViT/B-16 model link? Thank you so much!

Can Lafite do the image-to-image generation?

Can Lafite do the image-to-image generation?
When I fed image feature to the Generator, the generated image was not as good as the generated images from text.
Is there another way to generate images from image input?
The image feature was extracted from CLIP as below.

image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
img_fts = clip_model.encode_image(image)
img_fts = img_fts/img_fts.norm(dim=-1,keepdim=True)

z = torch.randn((num_images_to_generate, 512)).to(device)
c = torch.randn((num_images_to_generate, 1)).to(device)  # label is actually not used
img, _ = generator.generate(z=z, c=c, fts=img_fts)

Btw, thanks for sharing this wonderful work.

I created a notebook that runs on Google Colab

If there is any interest in including it here:
https://colab.research.google.com/github/pollinations/hive/blob/main/interesting_notebooks/LAFITE_generate.ipynb

Pretrained checkpoints for MM-CelebA-HQ

Hi,
The checkpoints that you have already shared is a great help to the project I'm working on.
Can you provide the checkpoints for the MM-CelebA-HQ trained from scratch?

Thanks!

resume pretrained model problem

I trained CelebA datasets whose image size is 256 with pretrained FFHQ256model load. I tried and ffhq256, but both failed with RuntimeError: The size of tensor a (512) must match the size of tensor b (256) at non-singleton dimension 0.
Strangely, I turn to ffhq512 and find it training successfully with terrible output.

Question about FID-k

Hi authors,

Again, I really appreciate you for sharing your codes.
I would like to ask a question about FID-k.
As written in your paper, FID-k means the FID is computed after blurring all the images by a Gaussian filter with radius k.

In the sentence, what is the value of sigma for the Gaussian filter? (sigma is the hyperparameter about torchvision.transforms.GaussianBlur which is described in https://pytorch.org/vision/stable/generated/torchvision.transforms.GaussianBlur.html )

Something question about datasets processed (with CLIP-ViT/B-32)

Could you help me? You provide several commonly used datasets that you have already processed (with CLIP-ViT/B-32) in datasets.json. So how I should do in my datasets and how to keep processed image order as origin unprocessed image?
Another question is why running train.py needs the parameter with val_dataset?

About generated results

Hi,
I used the COCO2014_CLIP_ViTB32_all_text.pkl file to infer images based on the text 'A city street line with brick buildings and trees.' However, the results I obtained were inferior, as shown in the following image.

CUDA problem ：）

Training with language-free methods (pseudo image-text feature pairs)
My code:

python train.py --outdir=D:\\python_learning\\lafite\\Lafite-main\\data\\training-runs --data=D:\\python_learning\\data\\pokemon数据集\\905all\\256\\white\\dest_train\\train.zip --test_data=D:\\python_learning\\data\\pokemon数据集\\905all\\256\\white\\dest_test\\test.zip --mixing_prob=1.0 --cfg=auto

feedback:

Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"... Failed!

D:\python_learning\lafite\Lafite-main\torch_utils\ops\bias_act.py:43: UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation.
Details:

Traceback (most recent call last):
File "D:\python_learning\lafite\Lafite-main\torch_utils\ops\bias_act.py", line 41, in _init
_plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py", line 57, in get_plugin
raise RuntimeError(f'Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "{file}".')

RuntimeError: Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py".

warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())

Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py:27: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
File "D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py", line 25, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py", line 57, in get_plugin
raise RuntimeError(f'Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "{file}".')

RuntimeError: Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py".

warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py:27: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py:27: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Traceback (most recent call last):
File "train.py", line 636, in
main() # pylint: disable=no-value-for-parameter
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 1128, in call
return self.main(*args, **kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "train.py", line 629, in main
subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
File "train.py", line 460, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "D:\python_learning\lafite\Lafite-main\training\training_loop.py", line 181, in training_loop
img = misc.print_module_summary(G, [z, c])
File "D:\python_learning\lafite\Lafite-main\torch_utils\misc.py", line 205, in print_module_summary
outputs = module(*inputs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 812, in forward
img = self.synthesis(ws, fts=fts, styles=styles, return_styles=return_styles, **synthesis_kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 759, in forward
x, img, style_list, style_init_list, style_res_list = block(x, img, cur_ws, fts=fts, styles=cur_style, **block_kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 667, in forward
x, style_init, style_res = self.conv1(x, next(w_iter), styles=next(s_iter), fts=fts, fused_modconv=fused_modconv, **layer_kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 472, in forward
padding=self.padding, resample_filter=self.resample_filter, flip_weight=flip_weight, fused_modconv=fused_modconv)
File "D:\python_learning\lafite\Lafite-main\torch_utils\misc.py", line 94, in decorator
return fn(*args, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 51, in modulated_conv2d
dcoefs = (w.square().sum(dim=[2,3,4]) + 1e-8).rsqrt() # [NO]
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 2.00 GiB total capacity; 618.39 MiB already allocated; 0 bytes free; 760.00 MiB reserved in total by PyTorch)`

The questions about parameters img_img_c and img_img_d

Thank you very much for your work. I have some confusion when I read the source code.

Why did you set the parameters img_img_c and img_img_d to 0? Are they good for FID and IS of Lafite on COCO?

A Question on Injecting Text Condition into Each Generator Layer.

Hi, I found that there seems to be a discrepancy between the paper's figure 3 (also the related descriptions on Section 3.2) and the code. Could you help to verify if my understanding is correct? Thanks a lot! Of course, the discrepancy is not an error and does not influence any correctness at all. I am just currently developing a TTI work based on LAFITE and hope to be more clear about the details.

In LAFITE paper's fig 3, there are two types of affine mappings: (1) mapping W to S (2) mapping the concatenation of [S,C] to U.

However, in the network.py [L454-L457], there is only one affine mapping. Actually it simply concatenates W (rather than S) with C and an affine mapping maps it to U directly.

FID evaluation

Hi,

i tried to evaluate the model with the command:
python calc_metrics.py --network=./some_pre-trained_models.pkl --metrics=fid50k_full,is50k --data=./training_data.zip --test_data=./testing_data.zip

I found in the function compute_feature_stats_for_dataset in metric_utils.py, the training images are used, not the test images.
It makes me confused. Did I make some mistakes or it is what you want to implement?

I think the features of the test images should be computed for FID.

best

Question about finetuning hyperparams for mscoco

Hi,
I was wondering what hyperparameters you used for finetuning the pre-trained CC3M model on MSCOCO (language-free gaussian) and how many kimgs was it finetuned for? Also, I'm starting to finetune the model myself and it seems to take very long. I'm only using one GPU but it takes roughly 3 hours for 50 ticks. Is that expected?

Thanks!

A Question on the Temperature in the Contrastive Loss

Hi Yufan, thank you again for your great work! I just got a small question on the role of temperature and find that there seems to be a small mismatch between the formula in the paper and the code and got a bit confused. It would be super helpful if you could kindly provide more details/thoughts.

In the paper the contrastive loss is defined in formula (6) and (7) as L = - τ log(A/B), but in the code L153-L60 in loss.py, it is actually L = - log(τ * A/B). Considering that - log(τ * A/B) = - log(τ) - log(A/B), and the fact that the term - log(τ) is a constant, it will have no influence on the gradient. So I would recommend rewriting formula (6) and (7) in the paper as simply L = - log(A/B) without a τ term as a multiplier. Please correct me if my understanding is not correct.

I was also wondering if you could lmk why in your original design, there is a τ multiplier outside of the log() stuff, since in a standard InfoNCE loss, τ usually only appears in the exp(sth./τ) but is not used outside of the log() stuff.

Some questions about the experimental setting

Thanks for your excellent work. By the way, Happy Mid Autumn Festival.
I have some questions about the experimental setting.

How many times have you run Lafite on COCO in a supervision manner? And do you report the best FID 8.12 out of all runs?
How do you pre-process the original images in COCO to 256x256 resolution? Interpolation or Center-crop?

drboog / lafite Goto Github PK

lafite's People

Contributors

Stargazers

Watchers

Forkers

lafite's Issues

Recommend Projects

Recommend Topics

Recommend Org