Hi Thanks for the repo! I am running your provided training configuration on t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Weird results about deepfillv2-pytorch HOT 20 CLOSED

nipponjo commented on June 26, 2024

Weird results

from deepfillv2-pytorch.

Comments (20)

nipponjo commented on June 26, 2024

Hi, that looks strange to me. Have you adapted the mask options to the different image size? When do these artefacts start to show?

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

@nipponjo Thanks for the quick response!

Yes. I have adapted the mask options. This is my training configuration file.

# resume training
model_restore: '' # start new training
#model_restore: 'checkpoints/celebahq/model_exp0/states.pth'

# dataloading
dataset_path: '/home/ohayonguy/research/datasets/celeba/img_align_celeba_splits/train/img_align_celeba'
scan_subdirs: True # Are the images organized in subfolders?
random_crop: False  # Set to false when dataset is 'celebahq', meaning only resize the images to img_shapes, instead of crop img_shapes from a larger raw image. This is useful when you train on images with different resolutions like places2. In these cases, please set random_crop to true.
random_horizontal_flip: False
batch_size: 16
num_workers: 10

# training
tb_logging: True # Enable Tensorboard logging?
log_dir: 'tb_logs/celebahq/model_exp0' # Tensorboard logging folder
checkpoint_dir: 'checkpoints/celebahq/model_exp0' # Checkpoint folder

use_cuda_if_available: True
random_seed: False # options: False | <int>

g_lr: 0.0001 # lr for Adam optimizer (generator)
g_beta1: 0.5 # beta1 for Adam optimizer (generator)
g_beta2: 0.999 # beta2 for Adam optimizer (generator)

d_lr: 0.0001 # lr for Adam optimizer (discriminator)
d_beta1: 0.5 # beta1 for Adam optimizer (discriminator)
d_beta2: 0.999 # beta2 for Adam optimizer (discriminator)

max_iters: 1000000 # number of batches to train the models

# logging
viz_max_out: 10 # number of images from batch 
# if optional: set to False to deactivate 
print_iter: 100 # write losses to console and tensorboard
save_checkpoint_iter: 100 # save checkpoint file and overwrite last one
save_imgs_to_tb_iter: 500 # (optional) add image grids to tensorboard
save_imgs_to_dics_iter: 500 # (optional) save image grids in checkpoint folder
save_cp_backup_iter: 5000 # (optional) save checkpoint file named states_{n_iter}.pth

img_shapes: [128, 128, 3]

# mask options
height: 50
width: 50
max_delta_height: 32
max_delta_width: 32
vertical_margin: 0
horizontal_margin: 0

# loss
gan_loss: 'hinge' # options: 'hinge', 'ls'
gan_loss_alpha: 1.

ae_loss: True
l1_loss_alpha: 1.

Does this seem okay to you?

These artifacts start to show right in the beginning:

Also, take a look at stage 2:

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

The config looks okay, even though I would also reduce the max_delta_height/width. Do you use batch or instance norm? I had some complications with those. When I tested with 128px, I also removed some layers in the generator, and one in the discriminator.

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

I am using the same networks that are provided in this repo, without any changes.
I took another look at the discriminator architecture. Why would we need to remove a layer from the discriminator and some layers from the generator, when the image size changes?

So you say that 128x128 images worked for you in some experiment you performed?

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

With an image size of 128px the discriminator with 6 layers outputs 2x2 feature maps (instead of 4x4) but I don't think that should be a problem. I think it worked in some experiments, but I am currently unable to find them. I will try with your config tomorrow (14.1.).

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

Thanks a lot!!

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

Hi, I made a few tests and here is what I think:
With 128px images, in general, I would recommend making these changes:

mask options:
height: 64
width: 64
max_delta_height: 16
max_delta_width: 16

remove these layers:
conv_bn5 (CoarseGenerator)
conv_conv_bn5 and ca_conv_bn5 (FineGenerator)
conv6 (Discriminator)

However, for this face dataset, I think the 32x32 bottleneck is too narrow to produce good results.
With 256px images, its resolution is 64x64, which makes it much easier to preserve detailed information.

One can remove one up-/downsampling stage (that's what I did), but that won't save much compute compared to using the 256px images.
Alternatively, I assume that skip-connections (as in U-Net for example) could help in this case.

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

Have you tried to run the code with these changes, and it worked for you?
I does not seem to me like an architecture issue. Why would these artifacts appear in the first place? And why would removing a layer pose such a huge affect on the results?

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

Hi, yes I tried that, but the training became unstable quickly (orange loss curve). The architecture was designed for 256px images, so I wouldn't just expect it to work with 128px without problems. It seems to me that the generator can't keep up with the discriminator. Removing (unnessesary) layers can make optimization easier, especially since there are no skip-connections or normalization layers in the net. I trained with these changes (red graph), but found that the generator still can't keep up after a while. I assume that learning an upsampling from 32px to 128px is considerably harder than from 64px to 256px, as a 64px face still shows some important details. I also trained the net with an added skip connection between the down-/upsampled 64x64 feature maps (shown here) and got some more reasonable results (2nd orange graph).

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

I trained with 256x256 images from celeba, with the original config file you provided, and the issue persists.
What version of PyTorch are you using?

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

I used version 1.10. What does your ae loss curve look like?

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

Like so.

Around 16k iterations, images seem okay! It appears like some sort of a mode collapse? Or too high learning rate? What do you think causes this unstability?

I am also using torch 1.10 btw. Lucky me. All deepfillv2 repos use either very old torch or very old tensorflow. Not sure why.

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

That's strange. When I trained, it looked like this. I don't think I changed anything, but I will try another run and see if the beginning is different.

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

are you trying to work with the exact same dataset? regular celeba?

Here is my config:

# resume training
model_restore: '' # start new training
#model_restore: 'checkpoints/celebahq/model_exp0/states.pth'

# dataloading
dataset_path: '/home/ohayonguy/research/datasets/celeba/img_align_celeba_splits/train/img_align_celeba'
scan_subdirs: True # Are the images organized in subfolders?
random_crop: False  # Set to false when dataset is 'celebahq', meaning only resize the images to img_shapes, instead of crop img_shapes from a larger raw image. This is useful when you train on images with different resolutions like places2. In these cases, please set random_crop to true.
random_horizontal_flip: False
batch_size: 16
num_workers: 10

# training
tb_logging: True # Enable Tensorboard logging?
log_dir: 'tb_logs/celebahq/model_exp0' # Tensorboard logging folder
checkpoint_dir: 'checkpoints/celebahq/model_exp0' # Checkpoint folder

use_cuda_if_available: True
random_seed: False # options: False | <int>

g_lr: 0.0001 # lr for Adam optimizer (generator)
g_beta1: 0.5 # beta1 for Adam optimizer (generator)
g_beta2: 0.999 # beta2 for Adam optimizer (generator)

d_lr: 0.0001 # lr for Adam optimizer (discriminator)
d_beta1: 0.5 # beta1 for Adam optimizer (discriminator)
d_beta2: 0.999 # beta2 for Adam optimizer (discriminator)

max_iters: 1000000 # number of batches to train the models

# logging
viz_max_out: 10 # number of images from batch 
# if optional: set to False to deactivate 
print_iter: 100 # write losses to console and tensorboard
save_checkpoint_iter: 100 # save checkpoint file and overwrite last one
save_imgs_to_tb_iter: 500 # (optional) add image grids to tensorboard
save_imgs_to_dics_iter: 500 # (optional) save image grids in checkpoint folder
save_cp_backup_iter: 5000 # (optional) save checkpoint file named states_{n_iter}.pth

img_shapes: [256, 256, 3]

# mask options
height: 128
width: 128
max_delta_height: 32
max_delta_width: 32
vertical_margin: 0
horizontal_margin: 0

# loss
gan_loss: 'hinge' # options: 'hinge', 'ls'
gan_loss_alpha: 1.

ae_loss: True
l1_loss_alpha: 1.

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

I used CelebA-HQ.

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

I see. I am using the regular celeba. Maybe this causes the difference?

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

That's hard to tell.

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

I will give it a shot on celeba-hq. Although it seems odd to me that an algorithm would work on one dataset of faces but not on another.

from deepfillv2-pytorch.

nipponjo commented on June 26, 2024

It seems like there is some problem with the spectral_norm in the discriminator. When I trained the model on the dataset, I used the Conv2DSpectralNorm layer, which I implemented as it is in the original implementation. Now that I have tested both variants, the one with the spectral_norm from torch.nn.utils.parametrizations caused instability (the gan loss overpowers the ae loss). Maybe the model is so fragile that this small difference causes a problem. You can just switch them out:

in networky.py -> DConv

self.conv_sn = Conv2DSpectralNorm(cnum_in, cnum_out, ksize, stride, padding)
#self.conv_sn = spectral_norm(nn.Conv2d(cnum_in, cnum_out, ksize, stride, padding))

from deepfillv2-pytorch.

ohayonguy commented on June 26, 2024

Ok. Will give it a shot. Thanks!

from deepfillv2-pytorch.

Weird results about deepfillv2-pytorch HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent