Git Product home page Git Product logo

Comments (1)

dosubot avatar dosubot commented on August 16, 2024 2

To split the process of chunking, embedding documents, and inserting them into a vector database into separate components, you can follow these steps:

  1. Load the Data:
    Use a reader to load your data into Document objects.

    from llama_index.core import SimpleDirectoryReader
    
    loader = SimpleDirectoryReader(input_dir="./data")
    documents = loader.load_data()
  2. Chunk the Data:
    Use a text splitter to chunk the documents into smaller pieces (Nodes).

    from llama_index.core.node_parser import SentenceSplitter
    
    text_splitter = SentenceSplitter()
    nodes = text_splitter.split(documents)
  3. Embed the Data:
    Apply embeddings to the nodes.

    from llama_index.embeddings.openai import OpenAIEmbedding
    
    embedding_model = OpenAIEmbedding()
    for node in nodes:
        node.embedding = embedding_model.embed(node.text)
  4. Insert into Vector Database:
    Create a vector index and insert the nodes.

    from llama_index.vector_stores.qdrant import QdrantVectorStore
    import qdrant_client
    
    client = qdrant_client.QdrantClient(location=":memory:")
    vector_store = QdrantVectorStore(client=client, collection_name="test_store")
    
    for node in nodes:
        vector_store.add(node)

Alternatively, you can streamline the process using the IngestionPipeline class, which applies specified transformations to the documents and stores them in a vector store:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)
_nodes = pipeline.run(documents=documents, num_workers=4)

By following these steps, you can modularize the process into separate components for loading, chunking, embedding, and indexing. Additionally, you can use the IngestionPipeline class to streamline the process, making it more efficient and easier to manage [1][2][3][4].

To continue talking to Dosu, mention @dosu.

from llama_index.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.