Question Validation <

[Question]: Benchmarking time it takes to chunk , embed documents , insert into vector database,about run-llama/llama_index

Comments (1)

dosubot commented on August 16, 2024 2

To split the process of chunking, embedding documents, and inserting them into a vector database into separate components, you can follow these steps:

Load the Data:
Use a reader to load your data into Document objects.

from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()

Chunk the Data:
Use a text splitter to chunk the documents into smaller pieces (Nodes).

from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter()
nodes = text_splitter.split(documents)

Embed the Data:
Apply embeddings to the nodes.

from llama_index.embeddings.openai import OpenAIEmbedding

embedding_model = OpenAIEmbedding()
for node in nodes:
    node.embedding = embedding_model.embed(node.text)

Insert into Vector Database:
Create a vector index and insert the nodes.

from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

for node in nodes:
    vector_store.add(node)

Alternatively, you can streamline the process using the IngestionPipeline class, which applies specified transformations to the documents and stores them in a vector store:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)
_nodes = pipeline.run(documents=documents, num_workers=4)

By following these steps, you can modularize the process into separate components for loading, chunking, embedding, and indexing. Additionally, you can use the IngestionPipeline class to streamline the process, making it more efficient and easier to manage ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

from llama_index.

Recommend Projects

[Question]: Benchmarking time it takes to chunk , embed documents , insert into vector database about llama_index HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent