Git Product home page Git Product logo

faiss_builder's Introduction

Language Processing & Indexing with FAISS

This repository hosts main.py, a Python script that trawls through a specified directory, processes various types of document formats (.doc, .xlsx, .pdf, .csv, and .txt), and indexes them using Facebook's FAISS (Facebook AI Similarity Search) library with the help of embeddings generated by OpenAI's Models. The primary goal is to create an efficient search and retrieval system for a variety of text documents.

Prerequisites

To successfully run this project, ensure that you have installed:

  • Python 3.6+
  • docx Python library.
  • pandas Python library.
  • Loggers provided by the standard logging Python library.
  • langchain Python library version 0.1+. This library provides utilities to load various file types (TXTLoader, CSVLoader, PyPDFLoader) and embeddings (OpenAIEmbeddings).
  • faiss Python library.

To install these dependencies, run the following pip command:

pip install python-docx pandas logging langchain faiss-cpu

If you have the appropriate hardware requirements, you can use faiss-gpu instead of faiss-cpu to leverage GPU acceleration.

Usage

To run the script, follow the steps outlined below:

  1. Clone this repository to your local machine.

    git clone <repo_url>
  2. Populate a directory with the documents you wish to process.

  3. Inject your personal OpenAI key into the script by replacing 'YOUR_OPENAI_KEY'.

    openai_key = 'YOUR_OPENAI_KEY'
  4. Include the path to your documents directory by replacing '/path/to/your/directory'.

    root_dir = '/path/to/your/directory'
  5. Run the script.

    python main.py

The script traverses all files in the specified directory and its sub-directories. It converts .doc files into .txt files, and .xls files into .csv files. These converted documents, alongside existing .pdf, .csv, and .txt files, are then loaded into memory one by one. Each file is transformed into an embedding using an OpenAI model, then added to the FAISS index. Once all documents have been processed, the final FAISS index is saved locally as faiss.index.

Please note that our script respects your privacy: it does not send any data directly to OpenAI or any other online service. All processing happens locally on your machine.

However, be mindful of the fact that the script logs errors that occur while processing a document. You can view these warnings in your command line console output.

License

This project follows the Unlicense, allowing unlimited freedom to use, modify, and distribute this project as per your needs or liking.

faiss_builder's People

Contributors

rmilejcz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.