Comments (14)
Hey there, @caes27! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for a human maintainer.
The issue you're encountering is likely due to the refine_factor
parameter in the LanceDBVectorStore
class. This parameter fetches more results than requested and then re-ranks them in memory. If this factor is set, it will fetch more nodes than the similarity_top_k
value specified in your query.
Here's the relevant code snippet:
table = self.connection.open_table(self.table_name)
lance_query = (
table.search(
query=query.query_embedding,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k)
.where(where)
.nprobes(self.nprobes)
)
if self.refine_factor is not None:
lance_query.refine_factor(self.refine_factor)
In this code, if self.refine_factor
is set, it will refine the results by reading extra elements and re-ranking them in memory. This could result in initially fetching 1080 nodes before any further ranking is applied [1][2].
To address this, you can try adjusting the refine_factor
parameter when initializing the LanceDBVectorStore
:
vector_store = LanceDBVectorStore.from_table(table, refine_factor=None)
Additionally, ensure that the nprobes
parameter is set appropriately. The default value is 20
, which balances accuracy and speed. You can adjust it based on your performance needs:
vector_store = LanceDBVectorStore.from_table(table, nprobes=20)
By fine-tuning these parameters, you should be able to control the number of nodes retrieved and improve the accuracy of your queries.
To continue talking to Dosu, mention @dosu.
from llama_index.
You're right, the current LanceDBVectorStore.from_table()
method only takes a table as a parameter and does not accept additional parameters like refine_factor
or nprobes
. This is a limitation in the current implementation.
However, it is feasible to modify the from_table
method to accept these parameters. Here is an example of how the method could be updated to include refine_factor
and nprobes
:
@classmethod
def from_table(cls, table: Any, refine_factor: Optional[int] = None, nprobes: int = 20) -> "LanceDBVectorStore":
"""Create instance from table."""
try:
if not isinstance(
table, (lancedb.db.LanceTable, lancedb.remote.table.RemoteTable)
):
raise Exception("argument is not lancedb table instance")
return cls(table=table, refine_factor=refine_factor, nprobes=nprobes)
except Exception as e:
print("ldb version", lancedb.__version__)
raise
This modification would allow users to specify refine_factor
and nprobes
when creating an instance from a table. The LanceDBVectorStore
class already supports these parameters in its __init__
method [1][2].
For now, you might need to manually adjust these parameters in the LanceDBVectorStore
class. I'll loop in a maintainer to see if this feature can be added in a future update.
@logan-markewich, could you step in and provide your guidance on this feature request?
To continue talking to Dosu, mention @dosu.
from llama_index.
The retrieval of nodes from your LanceDBVectorStore
is limited to 1498 nodes because of the query.similarity_top_k
parameter in the query
method. This parameter specifies the maximum number of top similar nodes to retrieve.
Here is the relevant code snippet:
lance_query = (
self._table.search(
query=_query,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k * self.overfetch_factor)
.where(where)
)
The .limit(query.similarity_top_k * self.overfetch_factor)
part of the code limits the number of nodes retrieved to the value specified by query.similarity_top_k
multiplied by self.overfetch_factor
. If this value is set to 1498, then only 1498 nodes will be retrieved, even if there are more nodes stored in the database. Adjusting the query.similarity_top_k
or self.overfetch_factor
values can help retrieve more nodes if needed [1][2].
Additionally, the default value of self.overfetch_factor
in the LanceDBVectorStore
implementation is 1
. It can be adjusted by passing a different value to the overfetch_factor
parameter when initializing the LanceDBVectorStore
object:
vector_store = LanceDBVectorStore(overfetch_factor=2)
This should help you retrieve more nodes if needed [2].
To continue talking to Dosu, mention @dosu.
from llama_index.
vector_store = LanceDBVectorStore.from_table()
only takes in a table as a parameter, nothing else.
Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system.
from llama_index.
vector_store = LanceDBVectorStore.from_table()
only takes in a table as a parameter, nothing else.
Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system.
from llama_index.
Update: I ran some more tests and it is not an issue of creating a vector store from a lancedb table, which I thought it might have. Here is some code:
question = "test"
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_2", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
index.storage_context.persist(persist_dir="./RL_test_2_8.1")
retriever = index.as_retriever(similarity_top_k = 2500)
nodes = retriever.retrieve(question)
db = lancedb.connect("lancedb_RL_test_2")
table = db.open_table("docs_8.1")
vector_store2 = LanceDBVectorStore.from_table(table)
index2 = VectorStoreIndex.from_vector_store(vector_store)
retriever2 = index2.as_retriever(similarity_top_k = 2500)
nodes2 = retriever2.retrieve(question)
Keep in mind, "all_leaf_nodes" contains 3558 nodes, but both times I retrieve nodes using the VectorStoreIndex as a retriever, it is being limited to 1498. Any idea of what might be happening? I can see the 3500+ nodes inside of my lancedb table directory.
from llama_index.
Hi @caes27 , thanks for reporting the issue.
I tested from integration end and came to the following conclusions (I used hierarchical parser and ingested 768 nodes into the DB) :
len(index.vector_store._table.search().where(None).limit(700).to_pandas())
gives the correct result and returns700
- I added a print statement /logged the len(results) fetched in the query function and
similarity_top_k
seems to be correctly parsed by lancedb query function.response = index.as_retriever(similarity_top_k = 700).retrieve('test')
returnsnodes : 700
- but when I check
len(response)
it returns234
which seems odd.
I am not sure but it seems to be an issue in how the final results are built by llama index retriever API / query engine API, I can see VectorIndexRetriever._build_node_list_from_query_result()
function being called but @logan-markewich could you have a look once as you would have a better idea?
from lancedb integration API end, it seems to be fine, perhaps some minor docstore, storage context issue could be there and I can make the fix if needed but I am not sure what the fix is.
adding the query function debug code snippet :
def query(
self,
query: VectorStoreQuery,
**kwargs: Any,
) -> VectorStoreQueryResult:
"""Query index for top k most similar nodes."""
if query.filters is not None:
if "where" in kwargs:
raise ValueError(
"Cannot specify filter via both query and kwargs. "
"Use kwargs only for lancedb specific items that are "
"not supported via the generic query interface."
)
where = _to_lance_filter(query.filters, self._metadata_keys)
else:
where = kwargs.pop("where", None)
query_type = kwargs.pop("query_type", self.query_type)
_logger.info("query_type :", query_type)
if query_type == "vector":
_query = query.query_embedding
else:
if not isinstance(self._table, lancedb.db.LanceTable):
raise ValueError(
"creating FTS index is not supported for LanceDB Cloud yet. "
"Please use a local table for FTS/Hybrid search."
)
if self._fts_index is None:
self._fts_index = self._table.create_fts_index(
self.text_key, replace=True
)
if query_type == "hybrid":
_query = (query.query_embedding, query.query_str)
elif query_type == "fts":
_query = query.query_str
else:
raise ValueError(f"Invalid query type: {query_type}")
lance_query = (
self._table.search(
query=_query,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k * self.overfetch_factor)
.where(where)
)
if query_type != "fts":
lance_query.nprobes(self.nprobes)
if query_type == "hybrid" and self._reranker is not None:
_logger.info(f"using {self._reranker} for reranking results.")
lance_query.rerank(reranker=self._reranker)
if self.refine_factor is not None:
lance_query.refine_factor(self.refine_factor)
results = lance_query.to_pandas()
if len(results) == 0:
raise Warning("query results are empty..")
nodes = []
for _, item in results.iterrows():
try:
node = metadata_dict_to_node(item.metadata)
node.embedding = list(item[self.vector_column_name])
except Exception:
# deprecated legacy logic for backward compatibility
_logger.debug(
"Failed to parse Node metadata, fallback to legacy logic."
)
if item.metadata:
metadata, node_info, _relation = legacy_metadata_dict_to_node(
item.metadata, text_key=self.text_key
)
else:
metadata, node_info = {}, {}
node = TextNode(
text=item[self.text_key] or "",
id_=item.id,
metadata=metadata,
start_char_idx=node_info.get("start", None),
end_char_idx=node_info.get("end", None),
relationships={
NodeRelationship.SOURCE: RelatedNodeInfo(
node_id=item[self.doc_id_key]
),
},
)
nodes.append(node)
# _logger.info("nodes :", len(nodes))
print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k
return VectorStoreQueryResult(
nodes=nodes,
similarities=_to_llama_similarities(results),
ids=results["id"].tolist(),
)
from llama_index.
Hello @raghavdixit99,
Thank you for helping me, I really appreciate it.
There are a bunch of things that are weird.
I rechunked a smaller set of documents and ingested 3500 nodes into a separate lancedb table. I set similarity_top_k to 1500 and by adding your debugging statement of:
print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k
It correctly showed 1500 nodes being retuned, but in the final response:
response = index.as_retriever(similarity_top_k = 700).retrieve('test')
print(len(response))
This outputted 1488 nodes, so some nodes were lost in this process. It was kinda fascinating how yours went from 700 to 234. But there is also another issue.
Since there is 3500 documents, I wanted to test it with a larger limit/similarity_top_k.
I set it to 2500 and everytime, both by using:
table_nodes = table.search().limit(2500).to_list()
print(len(table_nodes))
response = index.as_retriever(similarity_top_k = 2500).retrieve('test')
print(len(response))
The top piece of code returned 1510 nodes.
For the bottom piece of code, the debugging statement added into the query function showed 1510 nodes, and then it went down to 1498.
The limit/similarity_top_k was set to 2500, so what is going on here? I think this a bigger issue than the nodes being lost in the final stages of the retrieval process?
Tagging for visbility: @logan-markewich
from llama_index.
@caes27 , a lancedb search : table.search().limit(x)
will return the correct result as thats calling our OSS API which is a simple vector search and has been tested without any issues.
Additionally, I locally tested it via len(index.vector_store._table.search().where(None).limit(None).to_pandas())
and got the entire table(768 nodes) which is the correct result, you can refer to our API reference for more details - https://lancedb.github.io/lancedb/python/python/#lancedb.query.LanceQueryBuilder.limit
Perhaps your table has not ingested all the data or your uri needs a refresh (rm -rf /your_lancedb_path
).
As for the final retrieval results coming less than expected I have already covered that in my comment and tagged Logan, we should wait for his response as it seems like a parsing problem from the base retriever class.
Thanks
from llama_index.
Hey @raghavdixit99,
I believe you when you say the table.search().limit(x)
method works lol
I have refreshed the uri multiple times and same issue. Maybe it's a matter of how nodes are being ingested into the lancedb table when you do this:
vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_3", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
I can't see anywhere else where it can go wrong.
If you have time, maybe you can try it on your end by populating the table with 2000+ nodes and see if you get the same issue?
Thank you!
from llama_index.
Did more digging. As I was populating the table little by little, instead of sending it 25000+ nodes at once, I realized something.
Suppose my table has 500 nodes in it currently and I want to add 300 more nodes to the table. I run the following code:
vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid')
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
After this is done, this should mean there is 800 nodes in the lancedb table, but after I execute the following code:
db3 = lancedb.connect("lancedb_TEST")
table3 = db3.open_table("docs")
vector_store3 = LanceDBVectorStore.from_table(table3)
index3 = VectorStoreIndex.from_vector_store(vector_store3)
index3.insert_nodes(all_leaf_nodes)
retriever3 = index3.as_retriever(similarity_top_k = 1500)
nodes3 = retriever3.retrieve(question)
nodes3 is of length 300, which were the nodes I just added. It ignores the 500 nodes that were in the lancedb table previously.
Is this not the correct way to add nodes to an existing lancedb table?
I appreciate any help, thank you!
from llama_index.
Hi @caes27
Thanks for the update.
Since you are trying to iteratively ingest data you should try changing the mode to “append” by default the table overwrites the data could be the reason for such behavior.
vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid', mode=“append”)
from llama_index.
Hello @raghavdixit99,
I think I might have found the issue that was causing problems.
First, I noticed some faulty logic in the LanceDBVectorStore's "add" method, and fixed it myself. At the same time, I thought about upgrading the package and this also fixed it lol.
I also tried your solution yesterday, and it works if the table already exists and has some sort of data in it. However, it throws an error when the table is empty. So that is a work around, but this fix in the "add" method seemed to solve ingesting data into a fresh table:
Previous:
if self._table is None:
self._table = self._connection.create_table(
self._table_name, data, mode=self.mode
)
else:
if self.api_key is None:
self._table.add(data, mode=self.mode)
else:
self._table.add(data)
After:
if self._table is None:
self._table = self._connection.create_table(
self._table_name, data, mode=self.mode
)
else:
if self.api_key is None:
self._table.add(data, mode="append")
else:
self._table.add(data)
From what I saw, the data was ingested in batches and when the second batch came around, because it was in "overwrite" mode, the second batch's data would completely wipe the first batch's data and so on.
The other problem with LlamaIndex losing some nodes in the retrieval process still persists, so I'm still waiting.
Thanks again Raghav for your help throughout this whole thread.
from llama_index.
Hi @caes27 that is not faulty logic, you are hardcoding the mode argument and we have added that as per user’s requirement/ input. Please follow the usage as per my last comment, rest we are waiting on Logans response.
from llama_index.
Related Issues (20)
- [Question]: How to insert/delete document to/from VectorStoreIndex when using IngestionPipeline? HOT 2
- Compatibility issue between Qdrant and DSPy when Qdrant is used as the VectorStoreIndex's storage context HOT 5
- [Question]: AttributeError: 'property' object has no attribute 'context_window' HOT 1
- [Question]: The created knowledge graph does not have edge relationships neo4j HOT 12
- [Documentation]: Some of the URL Not Working HOT 3
- [Question]: Unable to understand how document storage works in case nodes are deleted HOT 1
- [Documentation]: Broken 'Examples' Link HOT 3
- [Feature Request]: Add a notebook to show llamaindex agent works with graphRAG and Vertex AI
- [Bug]: File rename error in llama-index-finetuning/llama_index/finetuning/mistralai/utils.py HOT 1
- [Question]: How to enable "Calling function" print out after querying from Multi-Document Agent example HOT 3
- [Question]: Access LLM's response object CompleteResponse() attribute `additional_kwarg` in RAG HOT 2
- [Bug]: Error in initializing neo4j HOT 2
- Indexes cannot be created correctly using the MilvusVectorStore. HOT 12
- How should the dim parameter value of MilvusVectorStore be calculated? HOT 4
- [Bug]: ERROR: Failed building wheel for pystemmer HOT 1
- How to deploy open-source embedding models in auto-merging retriever: ValueError: shapes (1024,) and (384,) not aligned: 1024 (dim 0) != 384 (dim 0) HOT 2
- [Bug]: No module named 'llama_index.llms.openai.base HOT 1
- [Bug]: [OpenAILike] Cannot use llm_chat_callback on an instance without a callback_manager attribute HOT 4
- [Feature Request]: Version pinning for sub packages HOT 2
- I wonder how to use llama_index to retrieve the Milvus collection after it is created and indexed using the MilvusVectorStore. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama_index.