Comments (1)
To split the process of chunking, embedding documents, and inserting them into a vector database into separate components, you can follow these steps:
-
Load the Data:
Use a reader to load your data intoDocument
objects.from llama_index.core import SimpleDirectoryReader loader = SimpleDirectoryReader(input_dir="./data") documents = loader.load_data()
-
Chunk the Data:
Use a text splitter to chunk the documents into smaller pieces (Nodes).from llama_index.core.node_parser import SentenceSplitter text_splitter = SentenceSplitter() nodes = text_splitter.split(documents)
-
Embed the Data:
Apply embeddings to the nodes.from llama_index.embeddings.openai import OpenAIEmbedding embedding_model = OpenAIEmbedding() for node in nodes: node.embedding = embedding_model.embed(node.text)
-
Insert into Vector Database:
Create a vector index and insert the nodes.from llama_index.vector_stores.qdrant import QdrantVectorStore import qdrant_client client = qdrant_client.QdrantClient(location=":memory:") vector_store = QdrantVectorStore(client=client, collection_name="test_store") for node in nodes: vector_store.add(node)
Alternatively, you can streamline the process using the IngestionPipeline
class, which applies specified transformations to the documents and stores them in a vector store:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
OpenAIEmbedding(),
],
vector_store=vector_store,
)
_nodes = pipeline.run(documents=documents, num_workers=4)
By following these steps, you can modularize the process into separate components for loading, chunking, embedding, and indexing. Additionally, you can use the IngestionPipeline
class to streamline the process, making it more efficient and easier to manage [1][2][3][4].
To continue talking to Dosu, mention @dosu.
from llama_index.
Related Issues (20)
- [Question]: AzureChatStore: Ensuring Asynchronous Requests and Understanding _atable_service_client Usage HOT 1
- [Feature Request]: FaissVectorStore with metadata support
- [Question]: Value Prop of LlamaIndex Workflows over LangGraph? HOT 1
- [Bug]: WARNING:llama_index.embeddings.openai.utils:Retrying llama_index.embeddings.openai.base.get_embeddings in 0.7360705661705746 seconds as it raised APIConnectionError: Connection error.. HOT 1
- [Question]: How to handle complex PDFs HOT 2
- [Feature Request]: Support for locally hosted LLM API HOT 1
- [Bug]: MistralAI chat_completion API broken HOT 2
- [Bug]: EmptyNetworkError HOT 2
- [Documentation]: Fix Import Statement in 'Using as a Retriever' Section HOT 1
- [Question]: How to use Azure Table Storage for indexing large files HOT 7
- [Feature Request]: Leveraging MapReduce and LLMs for Big Data Systems - A Potential Benefit for Your Project
- [Bug]: Agent responds with accurate SQL query but then throws error HOT 2
- [Question]: llama_index.llms.openai.base.OpenAI._achat raised APIConnectionError: Connection error.. HOT 7
- [Question]: How to avoid reparsing with llama parse on app relaunch using LlamaIndex + Qdrant? HOT 1
- [Bug]: Workflow raise vague error when we don't set pass_context=True in the @step() decorator HOT 1
- [Feature Request]: Relax criteria of workflows validation
- [Bug]: Nvidia NIM Reranker HTTPError: 400 Client Error HOT 6
- [Feature Request]: Revisit `StopEvent` emission to terminate workflows HOT 1
- [Feature Request]: Make all events accessible like mappings
- [Feature Request]: Better API for `Context` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama_index.