Git Product home page Git Product logo

Comments (16)

aleksusklim avatar aleksusklim commented on August 13, 2024 1

Yes, there is a big difference.

Each prompt (line of text) is first converted to tokens (array of integers) and those tokens are converted to embeddings (array of elements where each element it itself a vector of floats).
This is done straight-forward, and this is where Embedding Merge is working: it adds or multiplies those embedding vectors, not words or their tokens.

But! Those embeddings (representing all words of the prompt) aren't fed to Stable Diffusion directly. Instead, CLIP or OpenClip transformer network is used to recalculate this two-dimensional array.

It is CLIP who "understands" what the text means. Numbers are changed drastically, no longer representing simple words but their meanings.

Transformed array represents the high-level prompt ready to be sent to U-Net of Stable Diffusion.
And this is where two other controlling methods are used: prompt weighting and prompt merging.

When you write (green) hair – you are not increasing just the "word's" weight, you are changing the weight of vectors that were outputted by CLIP: they also contain positional information, semantic relations, and could have been influenced by ClipSkip.

When your prompt is longer than 75 tokens, or if you put BREAK explicitly – your prompt is split, and its parts are transformed with CLIP independently of each other.
Then you will have several valid "prompts" (and each of them can be partially weighed independently).

Before sending them to Stable Diffusion, those parts are summed by elements, so each vector becomes a sum of corresponding vectors, each of which was already transformed with CLIP.

So here you are not merging words, but their meaning. green hair BREAK blue eyes becomes something that is simultaneously means both "green hair" and "blue eyes".
(Which doesn't prevent SD to generate blue hair with green eyes, because wrong properties bindings in an inherent problem, both in U-Net and in the CLIP itself!)

EmbeddingMerge works at much lower level, merging stuff at "words", before CLIP.
This means that merged parts change their properties, no longer representing of what it was.

<'green hair'+'blue eyes'> is the same as <'blue hair'+'green eyes'> or <'green'+'blue'><'eyes'+'hair'>, and at the end we will see what CLIP thinks it is.
So probably the first word is a color, and the second word is a part of the face.

On the other hand, <'green'+'hair'> is something different, meaning both a color and an object. Unfortunately, this doesn't anyhow help CLIP or SD to separate or localize objects and their properties together.

The importance of CLIP it huge: it transforms groups of words together, and their meaning may change. In your example, low quality is a concept of bad generation, while low and quality mean different things.
By putting low in the negative prompt, what it would actually negate? Will it make buildings taller?
Worse with quality: don't you want a concept of "quality" to be positive, not negative?

So what I see is <'low'+'resolution'> being something that means both "low" and "resolution" simultaneously, but not "low resolution". On the other hand, <'bad'+'low'><'quality'+'resolution'> might work more or less as expected (just be sure to check token lengths of your vectors to account for alignment)

Still, CLIP tends to understand even messed-up concepts, so <'eyes'+'blue'> might work too, and my extension has more research purpose rather than a practical one.

from stable-diffusion-webui-embedding-merge.

aleksusklim avatar aleksusklim commented on August 13, 2024 1

By default, shortest string is padded with zero vectors.
The side effect is that the "amount" of information there is low.

In your example, adding 4 tokens text with 5 tokens text will give you 5 tokens where first 4 are merged (and thus have double-length unless you put =/2 at the end) while the last token is unmodified from the second text (which would be halved in length if you would go for =/2 at the end)

Good news is that, firstly, absolute vector length (in Cartesian sense) is not too important, SD tolerates in 0.5>=X>=3 just fine: a half of dog or thrice a dog is still a dog; and secondly, zero-vectors are not messing up general concept understanding, and their addition has even less artifacts than putting extra commas here and there.

I heard BREAK gives a very good person identity merging, like your main prompt BREAK person1 BREAK person2 BREAK masterpiece etc
Sometimes you would need to accommodate for alignment too, if you repeat the same prompt in those parts but with changed subject.

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024 1

Try it!

I must say that you've showed me a way to use it and I'm going to use it more often. Thanks again! ;-)

from stable-diffusion-webui-embedding-merge.

aleksusklim avatar aleksusklim commented on August 13, 2024 1

I'm going to use it more often.

Those who can use BREAK are often wondering why nobody else are using such power!

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

Perfect! I couldn't imagine that I would get so wide answer!
Tokens length is my second question related to my favorite trick with faces mixing. It works as a charm for usual [name1 | name2] but mixing tokens is unclear for me.
For example:
Laura Vandervoort, Katheryn Winnick have 4 vs 5 parts
изображение
Should I do something more than just <'Laura Vandervoort' + 'Katheryn Winnick'> to make the mix working correctly?

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

So much information, so hard to get it immediately.
You have an example:
'kuvshinov' + 'kuvshinov':-1 + 'kuvshinov':-2 + 'kuvshinov':-3 =: 1
As I understand in the example you make a single vector from a complex last name.
If we have such the example, it has some sense. What is the sense?
Should I covert all my complex names into single vectored ones like this?
'Vandervoort' + 'Vandervoort':-1 + 'Vandervoort':-2 =: 1

from stable-diffusion-webui-embedding-merge.

aleksusklim avatar aleksusklim commented on August 13, 2024

As I understand in the example you make a single vector from a complex last name.

And it gave nothing!
Concepts are destroyed by taking their intermediate tokens.
(Example just showed how to do it, not that it will be useful)

Should I covert all my complex names into single vectored ones like this?

You can have more luck with <'first name'+'last name'>, but probably not either.
What do you want to achieve? BREAK is better both at merging and shortening prompts, so you can describe a character and the scene separately, for example.

One of practical applications of EM is just making chimeras out of simple objects (as I showed in the linked Discussion), it can be fun.

But even then, my preliminary tests with SDXL are showing, for example, <'cat'*X+'girl'*Y> generates ether cat (X≈1, Y≈0.5), either girl (X<0.5), or a girl with a cat (X==Y), but not a catgirl!
I got kids with feline ears on very specific ranges like X=0.87, really unstable and seed-dependent.

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

I can't use typical mixing by [ | | ] because civiai automoderator reads prompts and sends such images to a long queue.
Using EM allows me to avoid such checking and I wonder for information about EM to reach the same visual effect as [ | | ] has.

from stable-diffusion-webui-embedding-merge.

aleksusklim avatar aleksusklim commented on August 13, 2024

Can't you just [ <'one'> | <'two'> | <'three'> ] ?

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

Sure, I can. As I've already checked it works the same.
Just like to know something new, something useful ;-)

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

Actually joining vectors and word switching work differently.
A disadvantage of [|] is in the persons. Each step each current person tries to change not only the face but the whole image.
<''> + <''> has more "healthy" behavior and as the result the final image might be more "consistent".

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

One more thing to keep in mind.
All parts inside [|] have different wights by its nature, but inside <''+''> their weights are the same.
So if I have a good face made by [|] I can't get the same just by replacing the constructions, I have to play with weights inside <''+''>

from stable-diffusion-webui-embedding-merge.

aleksusklim avatar aleksusklim commented on August 13, 2024

Have you tried BREAK?
a female model Cameron Diaz BREAK a female model Lucy Liu
or
a female model BREAK Cameron Diaz BREAK Lucy Liu
(You can still hide words with EM synax if needed)

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

BREAK? Why do I need to use it?
I want to mix faces, not to separate

from stable-diffusion-webui-embedding-merge.

aleksusklim avatar aleksusklim commented on August 13, 2024

Try it!

from stable-diffusion-webui-embedding-merge.

miasik avatar miasik commented on August 13, 2024

Try it!

Actually I had used it before rarely.
This is my fresh work with it https://civitai.com/images/5744964

from stable-diffusion-webui-embedding-merge.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.