Git Product home page Git Product logo

Comments (6)

jdongca2003 avatar jdongca2003 commented on August 26, 2024 1

Thank joein. Your clarification is very reasonable.

from qdrant-client.

joein avatar joein commented on August 26, 2024

Hi @jdongca2003

I don't think that if the string is empty vector should be zero, it actually depends on the model you are using.

Could you please check whether you have a similar result with the original BAAI/bge-small-en model ? (What I mean is: take a model from huggingface, compute the embeddings manually for your documents and check whether the situation is the same)

from qdrant-client.

jdongca2003 avatar jdongca2003 commented on August 26, 2024

Thank Joein for quick response. I checked embedding vector of empty document. It is not zero vector! But it is still not a good behavior.

from typing import List
import numpy as np
from fastembed import TextEmbedding
import json

documents: List[str] = [
    "",
    "email address",
    "placeholder",
    "",
    "wireless customer",
    "He died in 1597 at the age of 57",
    "Maharana Pratap is considered a symbol of Rajput resistance against foreign rule",
    "He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar",
    "total active lines",
    ""
]

embedding_model = TextEmbedding(model_name="BAAI/bge-small-en", max_length=512)

embeddings: List[np.ndarray] = list(
    embedding_model.passage_embed(documents)
)  # notice that we are casting the generator to a list


#print(embeddings[0].shape, len(embeddings))

query = "Count the number of active residential customer"
query_embedding = list(embedding_model.query_embed(query))[0]

def print_top_k(query_embedding, embeddings, documents, k=5):
    # use numpy to calculate the cosine similarity between the query and the documents
    scores = np.dot(embeddings, query_embedding)
    for score, doc in zip(scores, documents):
        print(f'{doc}|score: {score}')
    # sort the scores in descending order
    sorted_scores = np.argsort(scores)[::-1]
    # print the top 5
    #for i in range(k):
    #    print(f"score: {scores[sorted_scores[i]]} Rank {i+1}: {documents[sorted_scores[i]]}")

print_top_k(query_embedding, embeddings, documents, k=5)

I directly calculated cosine similarity score.
|score: 0.7972172498703003
email address|score: 0.8228306174278259
placeholder|score: 0.814424991607666
|score: 0.7972172498703003
wireless customer|score: 0.8295177221298218
He died in 1597 at the age of 57|score: 0.7157479524612427
Maharana Pratap is considered a symbol of Rajput resistance against foreign rule|score: 0.7073748111724854
He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar|score: 0.7003960609436035

embedding vector for empty document:

dim = 384
[-2.53819916e-02 -5.44682052e-03 -5.09282853e-03 -1.49776395e-02
-1.08098146e-02 1.19938692e-02 1.92262717e-02 4.08581644e-02
-9.28279664e-03 1.56196468e-02 1.86153606e-03 -4.88135368e-02
6.96400367e-03 3.49483788e-02 3.50163616e-02 4.01080912e-03
3.18448767e-02 1.36998445e-02 -1.56665053e-02 1.64450370e-02
2.16239858e-02 -1.99406147e-02 1.17815230e-02 -1.80905703e-02
4.76054614e-03 2.72297114e-02 -5.90159511e-03 -8.18434451e-03
-4.85137738e-02 -1.91728160e-01 -3.33202034e-02 -1.37138087e-02
3.19078634e-03 -9.87244491e-03 -1.03822276e-02 -9.70588345e-03
-1.62116215e-02 1.38158510e-02 -1.09591316e-02 4.05766815e-02
2.16749441e-02 1.38471741e-02 -1.54241202e-02 -1.06100161e-02
5.69914840e-03 -2.26438437e-02 -1.67865120e-02 -6.69355411e-03
5.80454506e-02 -6.32909359e-03 2.05236953e-03 1.03720073e-02 ...

from qdrant-client.

joein avatar joein commented on August 26, 2024

Could you elaborate on not a good behavior, what do you mean exactly?

from qdrant-client.

jdongca2003 avatar jdongca2003 commented on August 26, 2024

I mean that a good scoring behavior is that a low score is for empty document when the query is a natural text.

from qdrant-client.

joein avatar joein commented on August 26, 2024

Unfortunately, we can't do anything about this, Qdrant provides a way to operate with embeddings, it can't do anything with the embedding values.
Embedding values are determined by the model you've chosen.

from qdrant-client.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.