Git Product home page Git Product logo

feed_forward_vqgan_clip's People

Contributors

afiaka87 avatar mehdidc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

feed_forward_vqgan_clip's Issues

Repo License

Hi, what is the license for this repo/the pretrained models? I may have a way to build upon it.

Unavailable and broken links

When I run the notebook, some links seem unavailable.
I don't know why this happens, because it seems that I can manually download the files in my web browser.

Unavailable links

Moreover, the links in the README are broken.

Broken links

clarifying differences between available models

Hi @mehdidc ๐Ÿ‘‹๐Ÿผ I'm a new team member at @replicate.

I was trying out your model on replicate.ai and noticed that the names of the models are a bit cryptic, so it's hard to know what differences to expect when using each:

Screen Shot 2021-09-23 at 6 21 40 PM

Here's where those are declared:

MODELS = [
"cc12m_32x1024_vitgan_v0.1.th",
"cc12m_32x1024_vitgan_v0.2.th",
"cc12m_32x1024_mlp_mixer_v0.2.th",
]

Looking at the source for cog's Input class it looks like options can be a list of anything:

options: Optional[List[Any]] = None

I'm not sure if this is right, but maybe this means that each model could be declared as a tuple with an accompanying label:

MODELS = [
    ("cc12m_32x1024_vitgan_v0.1.th", "This model does x"),
    ("cc12m_32x1024_vitgan_v0.2.th" "This model does y"),,
    ("cc12m_32x1024_mlp_mixer_v0.2.th", "This model does z"),
]

We could then display those labels on the model form on replicate.ai to make the available options more clear to users.

Curious to hear your thoughts!

cc @cjwbw @bfirsh @andreasjansson

VQGAN - blended models

I want to take a film (say the Shining )

  • caption it using amazon ai label detection (maybe 1 every 100 frames)
  • throw these image + text paris into training -
  • then take trained model have the neural nets spit out something in the style of the movie....

Is it possible? In the nerdyrodent/VQGAN-CLIP repo - there's a style transfer

  • but I'm in an enquiry of how to merge the model layers so that the content is skewed to a certain style / astethic.

@Norod + @justinpinkney were successful in blending models together (the FFHQ + cartoon designs) which could easily - could it be achieved in this VQGAN domain? They kind of perform some neural surgery / hacking the layers to force the results.
https://github.com/justinpinkney/toonify

Does the VQGAN give us some access to hack these layers?

UPDATE
@JCBrouwer - seems to have a combined a style transfer via video here
https://github.com/JCBrouwer/maua-style

fyi @nerdyrodent

How to condition model output z that looks like it can from a standard normal distribution?

Hi, this is a nice repo and I'm trying to reimplement something similar for StyleGAN2. Using a list of texts, I'm trying to map CLIP text embeddings to StyleGAN2 latent vectors which is input to StyleGAN2 generator for generating images and then optimize this MLP mapper model using CLIP loss. However, I'm quickly getting blown out images for entire batches. I'm suspecting perhaps this is due to the output of the MLP not conditioned to output something that looks like it can from a standard normal distribution. I wonder if you could perhaps point me in the right direction how to do this.

How to get more variation in the null image

I've been generating images using this model, which is delightfully fast, but I've noticed that it produces images that are all alike. I tried generating the "null" image by doing:

H = perceptor.encode_text(toks.to(device)).float()
z = net(0 * H)

This resulted in:

base image

And indeed, everything I generated kind of matched that: you can see the fleshly protrusion on the left in "gold coin":

gold-coin--0 0

The object and matching mini-object in "tent":

tent-0 5

And it always seems to try to caption the image with nonsense lettering ("lion"):

lion--0 0

So I'm wondering if there's a way to "prime" the model and suggest it use a different zero image for each run. Is there a variable I can set, or is this deeply ingrained in training data?

Any advice would be appreciated, thank you!

(Apologies if this is the same as #8, but it sounded like #8 was solved by using priors which doesn't seem to help with this.)

Error in Load Model

Two issues found:

(1) A Restart Runtime occurs on !pip install requirements.txt . This, in turn, resets the
current directory to /current. But even after manually updating the current directory....

(2) Under Load Model: ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZN3c106ivalue6Future15extractDataPtrsERKNS_6IValueE

Slow Training Speed

Hi,
First of all great work! I really loved it. To replicate, I tried training on the Conceptual 12M Dataset with the depth and dims same as the pretrained models but the training was too slow. Even in 4 days it was going through the first (or 0th) epoch. I'm training it on NVIDIA Quadro RTX A6000 which I don't think is that much slow.
Any suggestions to improve the speed of training? I have multi-gpu access but seems it isn't supported rn.
Thanks !

training GPU configuration

Thanks for your excellent repo.

When training cc12m_32x1024 with type VitGAN or MLP Mixer, what kinds of GPU environment do you use?
Tesla V100 with 32G mem or others?

Thanks

CLIP-guided-diffusion updates

Katherine has released a better notebook for the CLIP-guided-diffusion. Outputs on a P100 is quite slow; but results can be very good. I've put the new notebook in my current repo as the "HQ" version.

Is there any chance of using these concepts for diffusion in a similar way? Main issue I'm seeing is that the output from guided-diffusion is 256x256xRGB rather than 16x16 or 32x32 patches. It's also a much larger checkpoint than the VQGAN and diffusion is sort of inherently tough to reason about in my experience.

https://github.com/afiaka87/clip-guided-diffusion

Positional Stickiness

For lack of a better word; I've noticed during training that the VitGAN tends to get stuck on one, two, or three (i don't see four happen very often/at all) "positional blobs" for lack of better words.

Does this match your experience? Effectively what I'm see is that the VitGAN needs to slide from one generation to the next in its latent space. In doing so - it seems to find that it's easier to just sort of create two "spots" in the image that are highly likely to contain specific concepts from each caption.

Does this match your experience? Any idea if this is bad/good? In my experience with the "chimera" examples; it seems to hurt things.

progress_0000422000
progress_0000421900
progress_0000421400

I hope you can see what I mean - there's a position in particular that seems designated for the "head" of the animal. But it also biases the outputs from other captions as well; for instance -

tri - x 4 0 0 tx a cylinder made of coffee beans . a cylinder with the texture of coffee beans .
progress_0000418200

New Checkpoint Idea

The Gumbel VQGAN from ruDALLE may prove to be the best VQGAN available. Might be worth training a checkpoint on.

Observations training with different modifying words/phrases

Searching for a more photo-realistic output - I've found that training on certain words is likely to bias the output heavily.

"illustration"/"cartoon" biases heavily towards a complete lack of photorealism in favor of very abstract interpretations that are often too simple in fact.

Here - an example from training on the blog post captions with the word "minimalist" prepended to each caption (and a removal of all mannequin captions which are about a 1/16 of all the captions)

progress_0000019700

In the Eleuther AI discord; a user @kingdomakrillic posted a very useful link https://imgur.com/a/SnSIQRu showing the effect a starting caption/modifier caption has on various other words when generating an image using the VQGAN + CLIP method.

With those captions; I decided to randomly prepend all the modifying words/phrases which produced a (subjectively) photo-realistic output to the blog post captions.

        "8k resolution",
        "Flickr",
        "Ambient occlusion",
        "filmic",
        "global illumination",
        "Photo taken with Nikon D750",
        "DSLR",
        "20 megapixels",
        "photo taken with Ektachrome",
        "photo taken with Fugifilm Superia",
        "photo taken with Provia",
        "criterion collection",
        "National Geographic photo ",
        "Associated Press photo",
        "detailed",
        "shot on 70mm",
        "3840x2160",
        "ISO 200",
        "Tri-X 400 TX",
        "Ilford HPS",
        "matte photo",
        "Kodak Gold 200",
        "Kodak Ektar",
        "Kodak Portra",
        "geometric",

With this in place - outputs tend to much more photorealistic (similar caption to above, less than 1 epoch trained):
<|startoftext|>2 0 megapixels photo of richmond district , san francisco , from a tall vantage point in the morning <|endoftext|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
progress_0000005100

None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that.

In the meantime - here is a checkpoint from the first round of captions (prepend "minimalist" to every blog caption, removing all captions containing "mannequin"). I trained it using the vitgan for 8 epochs, 128 dim, 8 depth, ViT-B16, 32 cutn. The loss was perhaps still going down at this point; but with very diminished returns.

model.th.zip

How to improve so we could get results closer to the "regular" VQGAN+CLIP?

Hi! I really love this idea and think that this concept solves the main bottleneck of current VQGAN+CLIP approach which is the optimisation for each prompt. I love how instantaneous this approach is to generating new images. However results with the different CC12M or blog captions model fall short in comparison to the most recent VQGAN+CLIP optimisation approaches

I am wondering where it could potentially be improved. I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches. The other is that I am wondering whether a bigger training dataset would improve the quality. Would it make sense to train it on ImageNet captions or maybe even a bigger 100M+ caption dataset? (maybe C@H?)

As you can see, I can't actually contribute much (but I could help with a bigger dataset training effort) but I'm cheering for this project to not die!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.