Question Validation <

Hey there, <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

[Question]: Index created from a LanceDBVectorStore is not returning all nodes?,about run-llama/llama_index

Comments (14)

dosubot commented on July 17, 2024 3

Hey there, @caes27! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for a human maintainer.

The issue you're encountering is likely due to the refine_factor parameter in the LanceDBVectorStore class. This parameter fetches more results than requested and then re-ranks them in memory. If this factor is set, it will fetch more nodes than the similarity_top_k value specified in your query.

Here's the relevant code snippet:

table = self.connection.open_table(self.table_name)
lance_query = (
    table.search(
        query=query.query_embedding,
        vector_column_name=self.vector_column_name,
    )
    .limit(query.similarity_top_k)
    .where(where)
    .nprobes(self.nprobes)
)

if self.refine_factor is not None:
    lance_query.refine_factor(self.refine_factor)

In this code, if self.refine_factor is set, it will refine the results by reading extra elements and re-ranking them in memory. This could result in initially fetching 1080 nodes before any further ranking is applied ^[1]^[2].

To address this, you can try adjusting the refine_factor parameter when initializing the LanceDBVectorStore:

vector_store = LanceDBVectorStore.from_table(table, refine_factor=None)

Additionally, ensure that the nprobes parameter is set appropriately. The default value is 20, which balances accuracy and speed. You can adjust it based on your performance needs:

vector_store = LanceDBVectorStore.from_table(table, nprobes=20)

By fine-tuning these parameters, you should be able to control the number of nodes retrieved and improve the accuracy of your queries.

To continue talking to Dosu, mention @dosu.

from llama_index.

dosubot commented on July 17, 2024 2

You're right, the current LanceDBVectorStore.from_table() method only takes a table as a parameter and does not accept additional parameters like refine_factor or nprobes. This is a limitation in the current implementation.

However, it is feasible to modify the from_table method to accept these parameters. Here is an example of how the method could be updated to include refine_factor and nprobes:

@classmethod
def from_table(cls, table: Any, refine_factor: Optional[int] = None, nprobes: int = 20) -> "LanceDBVectorStore":
    """Create instance from table."""
    try:
        if not isinstance(
            table, (lancedb.db.LanceTable, lancedb.remote.table.RemoteTable)
        ):
            raise Exception("argument is not lancedb table instance")
        return cls(table=table, refine_factor=refine_factor, nprobes=nprobes)
    except Exception as e:
        print("ldb version", lancedb.__version__)
        raise

This modification would allow users to specify refine_factor and nprobes when creating an instance from a table. The LanceDBVectorStore class already supports these parameters in its __init__ method ^[1]^[2].

For now, you might need to manually adjust these parameters in the LanceDBVectorStore class. I'll loop in a maintainer to see if this feature can be added in a future update.

@logan-markewich, could you step in and provide your guidance on this feature request?

To continue talking to Dosu, mention @dosu.

from llama_index.

dosubot commented on July 17, 2024 2

The retrieval of nodes from your LanceDBVectorStore is limited to 1498 nodes because of the query.similarity_top_k parameter in the query method. This parameter specifies the maximum number of top similar nodes to retrieve.

Here is the relevant code snippet:

lance_query = (
    self._table.search(
        query=_query,
        vector_column_name=self.vector_column_name,
    )
    .limit(query.similarity_top_k * self.overfetch_factor)
    .where(where)
)

The .limit(query.similarity_top_k * self.overfetch_factor) part of the code limits the number of nodes retrieved to the value specified by query.similarity_top_k multiplied by self.overfetch_factor. If this value is set to 1498, then only 1498 nodes will be retrieved, even if there are more nodes stored in the database. Adjusting the query.similarity_top_k or self.overfetch_factor values can help retrieve more nodes if needed ^[1]^[2].

Additionally, the default value of self.overfetch_factor in the LanceDBVectorStore implementation is 1. It can be adjusted by passing a different value to the overfetch_factor parameter when initializing the LanceDBVectorStore object:

vector_store = LanceDBVectorStore(overfetch_factor=2)

This should help you retrieve more nodes if needed ^[2].

To continue talking to Dosu, mention @dosu.

from llama_index.

caes27 commented on July 17, 2024

@dosu

vector_store = LanceDBVectorStore.from_table() only takes in a table as a parameter, nothing else.

Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system.

from llama_index.

caes27 commented on July 17, 2024

@dosu

vector_store = LanceDBVectorStore.from_table() only takes in a table as a parameter, nothing else.

from llama_index.

caes27 commented on July 17, 2024

@dosu

Update: I ran some more tests and it is not an issue of creating a vector store from a lancedb table, which I thought it might have. Here is some code:

question = "test"

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_2", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
index.storage_context.persist(persist_dir="./RL_test_2_8.1")

retriever = index.as_retriever(similarity_top_k = 2500)
nodes = retriever.retrieve(question)

db = lancedb.connect("lancedb_RL_test_2")
table = db.open_table("docs_8.1")
vector_store2 = LanceDBVectorStore.from_table(table)
index2 = VectorStoreIndex.from_vector_store(vector_store)

retriever2 = index2.as_retriever(similarity_top_k = 2500)
nodes2 = retriever2.retrieve(question)

Keep in mind, "all_leaf_nodes" contains 3558 nodes, but both times I retrieve nodes using the VectorStoreIndex as a retriever, it is being limited to 1498. Any idea of what might be happening? I can see the 3500+ nodes inside of my lancedb table directory.

from llama_index.

raghavdixit99 commented on July 17, 2024

Hi @caes27 , thanks for reporting the issue.

I tested from integration end and came to the following conclusions (I used hierarchical parser and ingested 768 nodes into the DB) :

len(index.vector_store._table.search().where(None).limit(700).to_pandas()) gives the correct result and returns 700
I added a print statement /logged the len(results) fetched in the query function and similarity_top_k seems to be correctly parsed by lancedb query function. response = index.as_retriever(similarity_top_k = 700).retrieve('test') returns nodes : 700
but when I check len(response) it returns 234 which seems odd.

I am not sure but it seems to be an issue in how the final results are built by llama index retriever API / query engine API, I can see VectorIndexRetriever._build_node_list_from_query_result() function being called but @logan-markewich could you have a look once as you would have a better idea?

from lancedb integration API end, it seems to be fine, perhaps some minor docstore, storage context issue could be there and I can make the fix if needed but I am not sure what the fix is.

adding the query function debug code snippet :

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """Query index for top k most similar nodes."""
        if query.filters is not None:
            if "where" in kwargs:
                raise ValueError(
                    "Cannot specify filter via both query and kwargs. "
                    "Use kwargs only for lancedb specific items that are "
                    "not supported via the generic query interface."
                )
            where = _to_lance_filter(query.filters, self._metadata_keys)
        else:
            where = kwargs.pop("where", None)

        query_type = kwargs.pop("query_type", self.query_type)

        _logger.info("query_type :", query_type)

        if query_type == "vector":
            _query = query.query_embedding
        else:
            if not isinstance(self._table, lancedb.db.LanceTable):
                raise ValueError(
                    "creating FTS index is not supported for LanceDB Cloud yet. "
                    "Please use a local table for FTS/Hybrid search."
                )
            if self._fts_index is None:
                self._fts_index = self._table.create_fts_index(
                    self.text_key, replace=True
                )

            if query_type == "hybrid":
                _query = (query.query_embedding, query.query_str)
            elif query_type == "fts":
                _query = query.query_str
            else:
                raise ValueError(f"Invalid query type: {query_type}")

        lance_query = (
            self._table.search(
                query=_query,
                vector_column_name=self.vector_column_name,
            )
            .limit(query.similarity_top_k * self.overfetch_factor)
            .where(where)
        )

        if query_type != "fts":
            lance_query.nprobes(self.nprobes)
            if query_type == "hybrid" and self._reranker is not None:
                _logger.info(f"using {self._reranker} for reranking results.")
                lance_query.rerank(reranker=self._reranker)

        if self.refine_factor is not None:
            lance_query.refine_factor(self.refine_factor)

        results = lance_query.to_pandas()

        if len(results) == 0:
            raise Warning("query results are empty..")

        nodes = []

        for _, item in results.iterrows():
            try:
                node = metadata_dict_to_node(item.metadata)
                node.embedding = list(item[self.vector_column_name])
            except Exception:
                # deprecated legacy logic for backward compatibility
                _logger.debug(
                    "Failed to parse Node metadata, fallback to legacy logic."
                )
                if item.metadata:
                    metadata, node_info, _relation = legacy_metadata_dict_to_node(
                        item.metadata, text_key=self.text_key
                    )
                else:
                    metadata, node_info = {}, {}
                node = TextNode(
                    text=item[self.text_key] or "",
                    id_=item.id,
                    metadata=metadata,
                    start_char_idx=node_info.get("start", None),
                    end_char_idx=node_info.get("end", None),
                    relationships={
                        NodeRelationship.SOURCE: RelatedNodeInfo(
                            node_id=item[self.doc_id_key]
                        ),
                    },
                )

            nodes.append(node)

        # _logger.info("nodes :", len(nodes))
        print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k

        return VectorStoreQueryResult(
            nodes=nodes,
            similarities=_to_llama_similarities(results),
            ids=results["id"].tolist(),
        )

from llama_index.

caes27 commented on July 17, 2024

Hello @raghavdixit99,

Thank you for helping me, I really appreciate it.

There are a bunch of things that are weird.

I rechunked a smaller set of documents and ingested 3500 nodes into a separate lancedb table. I set similarity_top_k to 1500 and by adding your debugging statement of:

 print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k

It correctly showed 1500 nodes being retuned, but in the final response:

response = index.as_retriever(similarity_top_k = 700).retrieve('test')
print(len(response))

This outputted 1488 nodes, so some nodes were lost in this process. It was kinda fascinating how yours went from 700 to 234. But there is also another issue.

Since there is 3500 documents, I wanted to test it with a larger limit/similarity_top_k.

I set it to 2500 and everytime, both by using:

table_nodes = table.search().limit(2500).to_list()
print(len(table_nodes))

response = index.as_retriever(similarity_top_k = 2500).retrieve('test')
print(len(response))

The top piece of code returned 1510 nodes.
For the bottom piece of code, the debugging statement added into the query function showed 1510 nodes, and then it went down to 1498.

The limit/similarity_top_k was set to 2500, so what is going on here? I think this a bigger issue than the nodes being lost in the final stages of the retrieval process?

Tagging for visbility: @logan-markewich

from llama_index.

raghavdixit99 commented on July 17, 2024

@caes27 , a lancedb search : table.search().limit(x) will return the correct result as thats calling our OSS API which is a simple vector search and has been tested without any issues.

Additionally, I locally tested it via len(index.vector_store._table.search().where(None).limit(None).to_pandas()) and got the entire table(768 nodes) which is the correct result, you can refer to our API reference for more details - https://lancedb.github.io/lancedb/python/python/#lancedb.query.LanceQueryBuilder.limit

Perhaps your table has not ingested all the data or your uri needs a refresh (rm -rf /your_lancedb_path).

As for the final retrieval results coming less than expected I have already covered that in my comment and tagged Logan, we should wait for his response as it seems like a parsing problem from the base retriever class.

Thanks

from llama_index.

caes27 commented on July 17, 2024

Hey @raghavdixit99,

I believe you when you say the table.search().limit(x) method works lol

I have refreshed the uri multiple times and same issue. Maybe it's a matter of how nodes are being ingested into the lancedb table when you do this:

vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_3", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)

I can't see anywhere else where it can go wrong.

If you have time, maybe you can try it on your end by populating the table with 2000+ nodes and see if you get the same issue?

Thank you!

from llama_index.

caes27 commented on July 17, 2024

Did more digging. As I was populating the table little by little, instead of sending it 25000+ nodes at once, I realized something.

Suppose my table has 500 nodes in it currently and I want to add 300 more nodes to the table. I run the following code:

vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid')
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)

After this is done, this should mean there is 800 nodes in the lancedb table, but after I execute the following code:

db3 = lancedb.connect("lancedb_TEST")
table3 = db3.open_table("docs")
vector_store3 = LanceDBVectorStore.from_table(table3)
index3 = VectorStoreIndex.from_vector_store(vector_store3)
index3.insert_nodes(all_leaf_nodes)
retriever3 = index3.as_retriever(similarity_top_k = 1500)
nodes3 = retriever3.retrieve(question)

nodes3 is of length 300, which were the nodes I just added. It ignores the 500 nodes that were in the lancedb table previously.

Is this not the correct way to add nodes to an existing lancedb table?
I appreciate any help, thank you!

from llama_index.

raghavdixit99 commented on July 17, 2024

Hi @caes27
Thanks for the update.

Since you are trying to iteratively ingest data you should try changing the mode to “append” by default the table overwrites the data could be the reason for such behavior.

vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid', mode=“append”)

from llama_index.

caes27 commented on July 17, 2024

Hello @raghavdixit99,

I think I might have found the issue that was causing problems.

First, I noticed some faulty logic in the LanceDBVectorStore's "add" method, and fixed it myself. At the same time, I thought about upgrading the package and this also fixed it lol.

I also tried your solution yesterday, and it works if the table already exists and has some sort of data in it. However, it throws an error when the table is empty. So that is a work around, but this fix in the "add" method seemed to solve ingesting data into a fresh table:

Previous:

if self._table is None:
            self._table = self._connection.create_table(
                self._table_name, data, mode=self.mode
            )
        else:
            if self.api_key is None:
                self._table.add(data, mode=self.mode)
            else:
                self._table.add(data)

After:

if self._table is None:
            self._table = self._connection.create_table(
                self._table_name, data, mode=self.mode
            )
        else:
            if self.api_key is None:
                self._table.add(data, mode="append")
            else:
                self._table.add(data)

From what I saw, the data was ingested in batches and when the second batch came around, because it was in "overwrite" mode, the second batch's data would completely wipe the first batch's data and so on.

The other problem with LlamaIndex losing some nodes in the retrieval process still persists, so I'm still waiting.

Thanks again Raghav for your help throughout this whole thread.

from llama_index.

raghavdixit99 commented on July 17, 2024

Hi @caes27 that is not faulty logic, you are hardcoding the mode argument and we have added that as per user’s requirement/ input. Please follow the usage as per my last comment, rest we are waiting on Logans response.

from llama_index.

[Question]: Index created from a LanceDBVectorStore is not returning all nodes? about llama_index HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent