drboog / lafite Goto Github PK
View Code? Open in Web Editor NEWCode for paper LAFITE: Towards Language-Free Training for Text-to-Image Generation (CVPR 2022)
License: MIT License
Code for paper LAFITE: Towards Language-Free Training for Text-to-Image Generation (CVPR 2022)
License: MIT License
Hi I'm currently putting together a paper on a different language free text-to-image generation method and I was wondering if you had the pre-trained celebA hq model that you could share with me for comparison purposes?
Hi, you mentioned all of your experiments are conducted on 4 Nvidia Tesla V100 GPUs. MS-COCO, CUB, LN-COO and CelebA are datasets that are considered.
I am wondering how long training process takes for the above datasets? and how many iterations (what's the batchsize) and epochs?
(you mentioned 4 days to reach FID of 18 on MS-COCO, how about other dataset?)
Hi, Would you leave a detailed comment on custom dataset training.
Truly thankfull
Hi authors,
I really appreciate you for sharing your codes.
I would like to ask a question about the intuition of injecting the random noise instead of using the image feature embedded by CLIP.
Thank you in advance!
I don't know why this program is running so slow. is there anything I can do to speed up the process?
PS:Due to personal habits, I used to run my code on Jupyter Notebook, but I still couldn't run it on Jupyter Notebook when all the packages were installed successfully and the function didn't have any problems, but I could run it on CMD (albeit slowly). May I ask what is the reason for this?
Your Lafite is a very important model. It is very exciting for me to find that Lafite can generate complex images of such high quality, showing amazing performance even compared with the large pre-trained models.
I am interested in some training details of Lafite.
Are the standard text-to-image Lafite models of different datasets in table 3 fine-tuned from pre-trained models?
If they are fine-tuned, how to pre-train the model for the different datasets? What is the pre-trained dataset? which is the training mode (language free or with gt) of pretraining?
Btw, thanks for sharing this wonderful work.
The pre-trained model COCO2014_CLIP_ViT32_all_text_FID.pkl
has the 'xflip': True
. So, could I assume that to achieve the score in the paper we use horizontal flip but no other augmentation?
Then, --mirror
should be added in the command line, I assumed.
Thans for your excellent work! But i have some question about calculating fid on mscoco.
Firstly, how do you randomly sample text from coco2014 validation set? I think there are two ways, one is sampling 30k images from 40k images then sample one captions for each image, the other is sampling 30k captions from 200k captions in coco2014 validation set. The first way will get better fid because only one caption will be sampled for each images.
Secondly, after getting 30k generated images, i am confused about whether the coco2014 train set or validation set are used for calculating fid, especially the best fid 8.12 in you paper. Using train set to calculate will get a better fid.
I am going to cite your paper, for a reasons of fairnessI, i want to understand your evaluation method totally. i would be very appreciate if you could answer my question.
I am trying to reproduce "Training with ground-truth pairs" using the command line:
python train.py --gpus=4 --outdir=./outputs/ --temp=0.5 --itd=10 --itc=20 --gamma=10 --data=./datasets/COCO2014_train_CLIP_ViTB32.zip --test_data=./datasets/COCO2014_val_CLIP_ViTB32.zip --mixing_prob=0.0
with the downloaded CLIP features. I didn't touch any hyperparameter. I resumed the training with --resume path/to/pkl
.
Unfortunately, I have to resume the training but I notice this is unlikely reproducible until 25,000 kimg targeting FID=8.6 as in the pretrained model (I checked the pretrained model, and it's fine.)
3,024 kimg got 17.425
In this moment, I resumed.
3,024 (previously) + 3024 (resumed log says) = 6,048 kimg got 14.27
3,024 (previously) + 6048 (resumed log says) = 9,072 kimg got 12.65
3,024 (previously) + 9,072 (resumed log says) = 12,096 kimg got 12.15
3,024 (previously) + 14,515 (resumed log says) = 17,539 kimg got 11.96
In this moment, I resumed.
17,539 (previously in total) + 0 (resumed log says) = 17,539 kimg got 11.93 (confirmed correctly resumed)
17,539 (previously in total) + 8,064 (resumed log says) = 25,603 kimg got 11.55
So, I am wondering if the resuming mechanism may hurt the optimizing integrity, for example, the states of optimizers? Do you have any suggestions for this situation?
If you can share your training log, it will be helpful, by the way.
Hi, from my understanding, dataset_tool.py
supports datasets having more than one line of text caption per image but the testing code only allows for a single line to be passed as input per query. I would like to know how/where in the codebase these multiple captions lines are passed for training and if the inference code can be updated for testing with multi-line caption queries.
I am using the Ground-Truth pair based training approach for evaluation. Thanks in advance.
Hi,
Thanks for sharing this wonderful work. Could you please explain how to tune itd, itc and gamma for custom dataset.
Thanks in advance
can u share how to preprocess coco dataset to your form:
1.png
1.txt
Hi, when I read the code I noticed that you mentioned "# This implementation is incorrect, it should be sim=sim/temp." for L149 in loss.py. I later found that in L45, there is the same "bug". If I correct L149, do I also need to correct L45?
However, when I read the code I found that the Model class in loss.py is only defined in L75-L77 in loss.py but not called later. May I know what is the function of Model class here?
It seems that the format of the dataset you provided is wrong
First of all thank you very much for your paper, it has been a huge help to me.The project you uploaded has also greatly helped my research. I want to ask you a few questions.
1.Are the results shown in the paper based on "sim = torch.exp(sim/temp), itd=10, itc=20" ? However, what is the result of "sim = sim/temp, itd=5, itc=10"? Under the setting of "sim = sim/temp", is this "itd=5, itc=10" optimal?
2.I am using 4 Nvidia 1080 for training and it takes 15 days for me to run a 25000kimg experiment, I would like to know your equipment and how long it will take to run one training session.
Hi,
Thanks for the wonderful work! I am trying to export the pretrained weights released in this repository to rosinality's implementation. I am using the export script from the stylegan2-ada-pytorch repository which is known to work for checkpoints using that repo (see here). However, when I used the converted Lafite checkpoints for the pretrained model COCO2014_CLIP_ViTB32_all_text.pkl, I am facing the following error:
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Generator:
size mismatch for conv1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgb1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.0.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.2.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.3.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.4.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.5.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.6.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.7.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.8.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for convs.9.conv.modulation.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for convs.10.conv.modulation.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for convs.11.conv.modulation.weight: copying a param with shape torch.Size([128, 1024]) from checkpoint, the shape in current model is torch.Size([128, 512]).
size mismatch for to_rgbs.0.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.1.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.2.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.3.conv.modulation.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
size mismatch for to_rgbs.4.conv.modulation.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([256, 512]).
size mismatch for to_rgbs.5.conv.modulation.weight: copying a param with shape torch.Size([128, 1024]) from checkpoint, the shape in current model is torch.Size([128, 512]).
Any idea, why this mismatch might be happening?
First of all thank you very much for this great work. It really helped me a lot. but I have a few questions that I want to ask.
I'm a beginner in text to image synthesis task, so my following questions may be a little stupid.
I think gamma is only sensitive in dataset and batch size per gpu.
Thank you very much for your work. Now I have the problem of data Set download failure. When downloading MS-COCO Training Set、MS-COCO Validation Set、LN-COCO Training Set and LN-COCO Testing Set, I failed after downloading part of them.
Hi,
Thanks for the nice work! Is it possible to provide pre-trained models on the CUB dataset?
It would be helpful. Thanks
Best,
Kilichbek
Hi, the Model trained on MS-COCO with Ground-truth Image-text Pairs, CLIP-ViT/B-16 is actually linked to the same link as Model Pre-trained On Google CC3M in the README. Could you update the correct ViT/B-16 model link? Thank you so much!
Can Lafite do the image-to-image generation?
When I fed image feature to the Generator, the generated image was not as good as the generated images from text.
Is there another way to generate images from image input?
The image feature was extracted from CLIP as below.
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
img_fts = clip_model.encode_image(image)
img_fts = img_fts/img_fts.norm(dim=-1,keepdim=True)
z = torch.randn((num_images_to_generate, 512)).to(device)
c = torch.randn((num_images_to_generate, 1)).to(device) # label is actually not used
img, _ = generator.generate(z=z, c=c, fts=img_fts)
Btw, thanks for sharing this wonderful work.
If there is any interest in including it here:
https://colab.research.google.com/github/pollinations/hive/blob/main/interesting_notebooks/LAFITE_generate.ipynb
Hi,
The checkpoints that you have already shared is a great help to the project I'm working on.
Can you provide the checkpoints for the MM-CelebA-HQ trained from scratch?
Thanks!
I trained CelebA datasets whose image size is 256 with pretrained FFHQ256model load. I tried and ffhq256, but both failed with RuntimeError: The size of tensor a (512) must match the size of tensor b (256) at non-singleton dimension 0.
Strangely, I turn to ffhq512 and find it training successfully with terrible output.
Hi authors,
Again, I really appreciate you for sharing your codes.
I would like to ask a question about FID-k.
As written in your paper, FID-k means the FID is computed after blurring all the images by a Gaussian filter with radius k.
In the sentence, what is the value of sigma for the Gaussian filter? (sigma is the hyperparameter about torchvision.transforms.GaussianBlur which is described in https://pytorch.org/vision/stable/generated/torchvision.transforms.GaussianBlur.html )
Could you help me? You provide several commonly used datasets that you have already processed (with CLIP-ViT/B-32) in datasets.json. So how I should do in my datasets and how to keep processed image order as origin unprocessed image?
Another question is why running train.py needs the parameter with val_dataset?
Training with language-free methods (pseudo image-text feature pairs)
My code:
python train.py --outdir=D:\\python_learning\\lafite\\Lafite-main\\data\\training-runs --data=D:\\python_learning\\data\\pokemon数据集\\905all\\256\\white\\dest_train\\train.zip --test_data=D:\\python_learning\\data\\pokemon数据集\\905all\\256\\white\\dest_test\\test.zip --mixing_prob=1.0 --cfg=auto
feedback:
Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"... Failed!
D:\python_learning\lafite\Lafite-main\torch_utils\ops\bias_act.py:43: UserWarning: Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation.
Details:
Traceback (most recent call last):
File "D:\python_learning\lafite\Lafite-main\torch_utils\ops\bias_act.py", line 41, in _init
_plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py", line 57, in get_plugin
raise RuntimeError(f'Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "{file}".')
RuntimeError: Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py".
warnings.warn('Failed to build CUDA kernels for bias_act. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py:27: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:
Traceback (most recent call last):
File "D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py", line 25, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py", line 57, in get_plugin
raise RuntimeError(f'Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "{file}".')
RuntimeError: Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py".
warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py:27: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:
Traceback (most recent call last):
File "D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py", line 25, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py", line 57, in get_plugin
raise RuntimeError(f'Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "{file}".')
RuntimeError: Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py".
warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py:27: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:
Traceback (most recent call last):
File "D:\python_learning\lafite\Lafite-main\torch_utils\ops\upfirdn2d.py", line 25, in _init
_plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
File "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py", line 57, in get_plugin
raise RuntimeError(f'Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "{file}".')
RuntimeError: Could not find MSVC/GCC/CLANG installation on this computer. Check _find_compiler_bindir() in "D:\python_learning\lafite\Lafite-main\torch_utils\custom_ops.py".
warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Traceback (most recent call last):
File "train.py", line 636, in
main() # pylint: disable=no-value-for-parameter
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 1128, in call
return self.main(*args, **kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\click\decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "train.py", line 629, in main
subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
File "train.py", line 460, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "D:\python_learning\lafite\Lafite-main\training\training_loop.py", line 181, in training_loop
img = misc.print_module_summary(G, [z, c])
File "D:\python_learning\lafite\Lafite-main\torch_utils\misc.py", line 205, in print_module_summary
outputs = module(*inputs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 812, in forward
img = self.synthesis(ws, fts=fts, styles=styles, return_styles=return_styles, **synthesis_kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 759, in forward
x, img, style_list, style_init_list, style_res_list = block(x, img, cur_ws, fts=fts, styles=cur_style, **block_kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 667, in forward
x, style_init, style_res = self.conv1(x, next(w_iter), styles=next(s_iter), fts=fts, fused_modconv=fused_modconv, **layer_kwargs)
File "D:\app\Anaconda\envs\torch-py37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 472, in forward
padding=self.padding, resample_filter=self.resample_filter, flip_weight=flip_weight, fused_modconv=fused_modconv)
File "D:\python_learning\lafite\Lafite-main\torch_utils\misc.py", line 94, in decorator
return fn(*args, **kwargs)
File "D:\python_learning\lafite\Lafite-main\training\networks.py", line 51, in modulated_conv2d
dcoefs = (w.square().sum(dim=[2,3,4]) + 1e-8).rsqrt() # [NO]
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 2.00 GiB total capacity; 618.39 MiB already allocated; 0 bytes free; 760.00 MiB reserved in total by PyTorch)`
Thank you very much for your work. I have some confusion when I read the source code.
Why did you set the parameters img_img_c and img_img_d to 0? Are they good for FID and IS of Lafite on COCO?
Hi, I found that there seems to be a discrepancy between the paper's figure 3 (also the related descriptions on Section 3.2) and the code. Could you help to verify if my understanding is correct? Thanks a lot! Of course, the discrepancy is not an error and does not influence any correctness at all. I am just currently developing a TTI work based on LAFITE and hope to be more clear about the details.
In LAFITE paper's fig 3, there are two types of affine mappings: (1) mapping W to S (2) mapping the concatenation of [S,C] to U.
However, in the network.py [L454-L457], there is only one affine mapping. Actually it simply concatenates W (rather than S) with C and an affine mapping maps it to U directly.
Hi,
i tried to evaluate the model with the command:
python calc_metrics.py --network=./some_pre-trained_models.pkl --metrics=fid50k_full,is50k --data=./training_data.zip --test_data=./testing_data.zip
I found in the function compute_feature_stats_for_dataset
in metric_utils.py
, the training images are used, not the test images.
It makes me confused. Did I make some mistakes or it is what you want to implement?
I think the features of the test images should be computed for FID.
best
Hi,
I was wondering what hyperparameters you used for finetuning the pre-trained CC3M model on MSCOCO (language-free gaussian) and how many kimgs was it finetuned for? Also, I'm starting to finetune the model myself and it seems to take very long. I'm only using one GPU but it takes roughly 3 hours for 50 ticks. Is that expected?
Thanks!
Hi Yufan, thank you again for your great work! I just got a small question on the role of temperature and find that there seems to be a small mismatch between the formula in the paper and the code and got a bit confused. It would be super helpful if you could kindly provide more details/thoughts.
In the paper the contrastive loss is defined in formula (6) and (7) as L = - τ log(A/B), but in the code L153-L60 in loss.py, it is actually L = - log(τ * A/B). Considering that - log(τ * A/B) = - log(τ) - log(A/B), and the fact that the term - log(τ) is a constant, it will have no influence on the gradient. So I would recommend rewriting formula (6) and (7) in the paper as simply L = - log(A/B) without a τ term as a multiplier. Please correct me if my understanding is not correct.
I was also wondering if you could lmk why in your original design, there is a τ multiplier outside of the log() stuff, since in a standard InfoNCE loss, τ usually only appears in the exp(sth./τ) but is not used outside of the log() stuff.
Thanks for your excellent work. By the way, Happy Mid Autumn Festival.
I have some questions about the experimental setting.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.