Comments (6)
Thank joein. Your clarification is very reasonable.
from qdrant-client.
Hi @jdongca2003
I don't think that if the string is empty vector should be zero, it actually depends on the model you are using.
Could you please check whether you have a similar result with the original BAAI/bge-small-en model ? (What I mean is: take a model from huggingface, compute the embeddings manually for your documents and check whether the situation is the same)
from qdrant-client.
Thank Joein for quick response. I checked embedding vector of empty document. It is not zero vector! But it is still not a good behavior.
from typing import List
import numpy as np
from fastembed import TextEmbedding
import json
documents: List[str] = [
"",
"email address",
"placeholder",
"",
"wireless customer",
"He died in 1597 at the age of 57",
"Maharana Pratap is considered a symbol of Rajput resistance against foreign rule",
"He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar",
"total active lines",
""
]
embedding_model = TextEmbedding(model_name="BAAI/bge-small-en", max_length=512)
embeddings: List[np.ndarray] = list(
embedding_model.passage_embed(documents)
) # notice that we are casting the generator to a list
#print(embeddings[0].shape, len(embeddings))
query = "Count the number of active residential customer"
query_embedding = list(embedding_model.query_embed(query))[0]
def print_top_k(query_embedding, embeddings, documents, k=5):
# use numpy to calculate the cosine similarity between the query and the documents
scores = np.dot(embeddings, query_embedding)
for score, doc in zip(scores, documents):
print(f'{doc}|score: {score}')
# sort the scores in descending order
sorted_scores = np.argsort(scores)[::-1]
# print the top 5
#for i in range(k):
# print(f"score: {scores[sorted_scores[i]]} Rank {i+1}: {documents[sorted_scores[i]]}")
print_top_k(query_embedding, embeddings, documents, k=5)
I directly calculated cosine similarity score.
|score: 0.7972172498703003
email address|score: 0.8228306174278259
placeholder|score: 0.814424991607666
|score: 0.7972172498703003
wireless customer|score: 0.8295177221298218
He died in 1597 at the age of 57|score: 0.7157479524612427
Maharana Pratap is considered a symbol of Rajput resistance against foreign rule|score: 0.7073748111724854
He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar|score: 0.7003960609436035
embedding vector for empty document:
dim = 384
[-2.53819916e-02 -5.44682052e-03 -5.09282853e-03 -1.49776395e-02
-1.08098146e-02 1.19938692e-02 1.92262717e-02 4.08581644e-02
-9.28279664e-03 1.56196468e-02 1.86153606e-03 -4.88135368e-02
6.96400367e-03 3.49483788e-02 3.50163616e-02 4.01080912e-03
3.18448767e-02 1.36998445e-02 -1.56665053e-02 1.64450370e-02
2.16239858e-02 -1.99406147e-02 1.17815230e-02 -1.80905703e-02
4.76054614e-03 2.72297114e-02 -5.90159511e-03 -8.18434451e-03
-4.85137738e-02 -1.91728160e-01 -3.33202034e-02 -1.37138087e-02
3.19078634e-03 -9.87244491e-03 -1.03822276e-02 -9.70588345e-03
-1.62116215e-02 1.38158510e-02 -1.09591316e-02 4.05766815e-02
2.16749441e-02 1.38471741e-02 -1.54241202e-02 -1.06100161e-02
5.69914840e-03 -2.26438437e-02 -1.67865120e-02 -6.69355411e-03
5.80454506e-02 -6.32909359e-03 2.05236953e-03 1.03720073e-02 ...
from qdrant-client.
Could you elaborate on not a good behavior
, what do you mean exactly?
from qdrant-client.
I mean that a good scoring behavior is that a low score is for empty document when the query is a natural text.
from qdrant-client.
Unfortunately, we can't do anything about this, Qdrant provides a way to operate with embeddings, it can't do anything with the embedding values.
Embedding values are determined by the model you've chosen.
from qdrant-client.
Related Issues (20)
- Add note about batching into README.md HOT 1
- grpc.PointStruct.PayloadEntry errror HOT 2
- How to upload collection asynchronous HOT 2
- Feature Request: Add ability to have properties/metadata for a collection
- qdrant_client.QdrantClient never returns HOT 1
- Datetime timezone parsing inconsistency HOT 1
- investigate local mode close HOT 1
- client method to recover from snapshots
- Make httpx client aware of timeouts passed to methods HOT 2
- Socket error using Windows and REST
- UnexpectedResponse: Unexpected Response: 400 (Bad Request) Raw response content: b'{"status":{"error":"Format error in JSON body: data did not match any variant of untagged enum PointInsertOperations"},"time":0.0}' HOT 2
- Datetime inconsistency between Qdrant local and remote HOT 2
- add `key` parameter to `set_payload` HOT 2
- add sparse embed support from fastembed 0.2.3 HOT 1
- Upload batch function of rest uploader is always attempting to upsert points max_retries (3) times. HOT 2
- ValueError: could not broadcast input array from shape (768,) into shape (384,) HOT 1
- client.get_collection(...) throws Pydantic validation error HOT 4
- 504 Gateway Time-out error while performing search operation. HOT 2
- mypy fails to recognize import `from qdrant_client import models`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qdrant-client.