embedchain / embedchain Goto Github PK

Personalizing LLM Responses

License: Apache License 2.0

Python 88.25% Makefile 0.11% Jupyter Notebook 8.72% Shell 0.03% JavaScript 0.07% TypeScript 2.71% Dockerfile 0.03% Mako 0.08%

ai chatgpt llm python chatbots rag application embeddings vector-database

embedchain's Introduction

What is Embedchain?

Embedchain is an Open Source Framework for personalizing LLM responses. It makes it easy to create and deploy personalized AI apps. At its core, Embedchain follows the design principle of being "Conventional but Configurable" to serve both software engineers and machine learning engineers.

Embedchain streamlines the creation of personalized LLM applications, offering a seamless process for managing various types of unstructured data. It efficiently segments data into manageable chunks, generates relevant embeddings, and stores them in a vector database for optimized retrieval. With a suite of diverse APIs, it enables users to extract contextual information, find precise answers, or engage in interactive chat conversations, all tailored to their own data.

🔧 Quick install

Python API

pip install embedchain

✨ Live demo

Checkout the Chat with PDF live demo we created using Embedchain. You can find the source code here.

🔍 Usage

For example, you can create an Elon Musk bot using the following code:

import os
from embedchain import App

# Create a bot instance
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
app = App()

# Embed online resources
app.add("https://en.wikipedia.org/wiki/Elon_Musk")
app.add("https://www.forbes.com/profile/elon-musk")

# Query the app
app.query("How many companies does Elon Musk run and name those?")
# Answer: Elon Musk currently runs several companies. As of my knowledge, he is the CEO and lead designer of SpaceX, the CEO and product architect of Tesla, Inc., the CEO and founder of Neuralink, and the CEO and founder of The Boring Company. However, please note that this information may change over time, so it's always good to verify the latest updates.

You can also try it in your browser with Google Colab:

📖 Documentation

Comprehensive guides and API documentation are available to help you get the most out of Embedchain:

🔗 Join the Community

Connect with fellow developers by joining our Slack Community or Discord Community.
Dive into GitHub Discussions, ask questions, or share your experiences.

🤝 Schedule a 1-on-1 Session

Book a 1-on-1 Session with the founders, to discuss any issues, provide feedback, or explore how we can improve Embedchain for you.

🌐 Contributing

Contributions are welcome! Please check out the issues on the repository, and feel free to open a pull request. For more information, please see the contributing guidelines.

For more reference, please go through Development Guide and Documentation Guide.

Anonymous Telemetry

We collect anonymous usage metrics to enhance our package's quality and user experience. This includes data like feature usage frequency and system info, but never personal details. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable EC_TELEMETRY=false. We prioritize data security and don't share this data externally.

Citation

If you utilize this repository, please consider citing it with:

@misc{embedchain,
  author = {Taranjeet Singh, Deshraj Yadav},
  title = {Embedchain: The Open Source RAG Framework},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/embedchain/embedchain}},
}

embedchain's People

Contributors

Stargazers

Watchers

Forkers

dumoedss developerfred litanlitudan iwillcodeu hertera1 iserralv stjordanis lizalexandrita troystefano logp itsharex nilsonmichiles touristshaun cachho aaishikdutta regression-io vital121 aoocar wpcfan lewis-huang zhangxiaochunxy blackwhites tiandevlab jeejeeguan barryqian750530 miblue119 fujohnwang jcizlz dev-wan linuer lorsso wowmarcomei dylan-jiang softsquire jwang287 mbp101 tulw4r echoflysky shyamal-anadkat veinsvx tomchapin kyleyangsde mrbusche zyjia kayces wushaoneng chengfai deewooo vjimrunning petercao polokobe mohamadosama techthiyanes honestyallan zhangshaoke breeze-64 junzhnag neil0306 limboinf haorand harrison001 leijiangling abhijithka vangelisma satya131113 candidosales hongjiu94 bylele yosshi-git papajaisndjon babybirdprd symbiand 1675fjz arindamatcal selflabs octag0no mivanovitch lingqian3 kumar045 fgeyw krish240574 tonywhite11 amrrs go-bank666 doscadesa kingmaiker jefedeoro gen-ai-experts tonysteven guaguaguaxia hljpeter geocng coffee526069 warghatsatyam wmenjoy gobars iamleon121 youminxue suparious verheesj

embedchain's Issues

Non-feature request - Modularize the application

The Embedchain class has a lot of methods and it would add value in terms of code readability to abstract it a little bit. There are many open issues about integrating multiple llms, vector dbs or embedding. While I see a level of abstraction in the vector db folder and that can be leveraged for further integration options, I believe we should do something similar for the methods where we use the embedding models and the llm model. I have raised a PR for this #92 which attends to abstracting the data formats for loaders and chunkers . @taranjeet @cachho please let me know if this is something we can add, so we can have some further discussions on how to structure it for the more critical pieces like the embedding models and chat completions.

Cannot find reference 'App' in 'embedchain.py'

ImportError: cannot import name 'App' from partially initialized module 'embedchain' (most likely due to a circular import)

I encountered a strange problem: my Python code consists of only one file, and when the name of this Python file is the same as the name of the library it references: embedchain.py, an error is reported. ImportError: cannot import name 'App' from partially initialized module 'xxx' (most likely due to a circular import)

So, Just rename the file to anther name, and it will be fixed.

openai.error.ServiceUnavailableError: The server is overloaded or not ready yet.

My code:

import os
from keys import *
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from embedchain import App

naval_chat_bot = App()

naval_chat_bot.add("web_page", "https://psymplicity.com/")

print(naval_chat_bot.query("what is the three-step approach to private mental health care"))

The Error:

Unable to connect optimized C data functions [No module named '_testbuffer'], falling back to pure Python
All data from https://psymplicity.com/ already exists in the database.
Traceback (most recent call last):
File "c:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\code\flask_app_2\embedchain_test.py", line 21, in
print(naval_chat_bot.query("what is the three-step approach to private mental health care"))
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\embedchain\embedchain.py", line 225, in query
answer = self.get_answer_from_llm(prompt)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\embedchain\embedchain.py", line 211, in get_answer_from_llm
answer = self.get_openai_answer(prompt)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\embedchain\embedchain.py", line 162, in get_openai_answer
response = openai.ChatCompletion.create(
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_resources\chat_completion.py", line 25, in create
return super().create(*args, **kwargs)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_resources\abstract\engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_requestor.py", line 743, in _interpret_response_line
raise error.ServiceUnavailableError(
openai.error.ServiceUnavailableError: The server is overloaded or not ready yet.

Add support for Pinecone as vector database

Currently, is it possible to use embedchain with pinecone? If not, will it be possible in the future?
this issue is opened on behalf of discord user Hector, message link

Issue on TypeVar

When trying to run the sample code I get this:
ImportError: cannot import name 'TypeVar' from 'typing_extensions' (/databricks/python/lib/python3.10/site-packages/typing_extensions.py)

I am running this in a Databricks notebook.

Project Tools

Setup Following Project Management Tools

Project Package and Environment Manager: Poetry is recommended
pytests and pylint setup
Contributing Guide
Sphnix Documentation and deploying on the readthedocs server
Docstrings for API : Google Style is recommended.
CI/CD workflows

I can help with above

Embedchain

Using GPT-4 for prompting

Hi there, I see that the framework is using GPT-3.5 last release , for doing prompting
How I can change to GPT-4 ?

My Best for this project !
Regards

Fine tune tone for the answer

Wondering if it is possible to fine-tune the tone the AI replying to me? such as if I provided the dialogue of Sherlock Holmes, could it reply me with the tone that Sherlock talks? Ty!
this issue is opened on behalf of twitter user ring_hyacinth, tweet

epub format

please allow epub format for one of the types supported

Add new format - sitemap

Hi @taranjeet I was working on my mini project to chat over a small-sized blog and I found myself writing some piece of code to iterate over the sitemap of the website. I think it would be valueable if we can provide format support for a sitemap to automate multiple web page loading and chunking. Do you already have a issue tracking that or it is something that can be added
Right now I am doing something like this:

# Download sitemap.xml file from a website and extract all the links
def get_links(url):
    url = f'{url}/sitemap.xml'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        links = [link.text for link in soup.find_all('loc')]
        return links
    else:
        print(f'Error: {response.status_code}')
        return None

opened on behalf of discord user Papa Dosa, message link

Add support to load codebase

Thanks, Such a handy repo! Loving the user-friendly API. Can't wait to see it support a whole codebase(just like other types of documents) in the future：）
opened on behalf of twitter user ericman65204539, tweet

Add meta data

Is there a way to add more metadata on each document? something like document id - and get it back in the response?
opened on behalf of discord user ikinnrot, message link

Add tests

need to setup tests so that contributing to the repo becomes easier and faster

feature request: Add New Format "Image"

Embedchain will parse uploaded images, extract text information and embed.
Ex, Screenshot of a book chapter.

The parser package should be configurable, the default should be opensource.

Add support for summarization

Since the data is splitted in chunks, is it possible to implement a summarize function?
opened on behalf of discord user edo, message link

Issue with get_openai_answer

max_tokens parameter being set to 1000 is an issue. With having multiple sources (with long urls) and larger webpages, this is quickly eaten up. When the token amount is exceeded no warning is given except from openAI.

openai.error.RateLimitError: The server had an error while processing your request. Sorry about that!

def get_openai_answer(self, prompt):
messages = []
messages.append({
"role": "user", "content": prompt
})
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0613",
messages=messages,
temperature=0,
max_tokens=1000,
top_p=1,
)
return response["choices"][0]["message"]["content"]

TypeError afer running

What are the implications of allowing more documents as context?

Let's talk about this method:

def query(self, input_query):
        """
        Queries the vector database based on the given input query.
        Gets relevant doc based on the query and then passes it to an
        LLM as context to get the answer.

        :param input_query: The query to use.
        :return: The answer to the query.
        """
        result = self.collection.query(
            query_texts=[input_query,],
            n_results=1,
        )
        result_formatted = self._format_result(result)
        answer = self.get_answer_from_llm(input_query, result_formatted[0][0].page_content)
        return answer

As far as I can tell, (and I'm just reading, not necessarily understanding, correct me if I'm wrong), it will return the one single closest document. n_results=1

What if we have a more granular database, cut into smaller pieces?

E.g. the webpages and documents we added are only a paragraph long. Then it will only return that one paragraph. So let's keep imagining that a user asks a complex question for which the correct answer is stored in more than one document. Then it would only answer part of the question with limited knowledge.

Here's a simple example. Let's say we are in the car business and feed our database information about the Corvette, one page for each generation. Then a user asks how much does horsepower does the current Corvette make and how much did the first one make?. If my understanding is correct, it could not answer that question (for this specific question, ChatGPT knows the answer out of the box, but you get the point).

For these kinds of use cases I'm proposing to allow the retrieval of more than one document, configurable by the user. 1 can stay as the default. These are then all passed as context so a LLM can do it's magic and process the information.

The downside I can see is that it will require more tokens, and thus cost more. This is a compromise the user has to make for better results. The max token limit should also be considered, especially in cases where the database contains short and long text, for this edge case, max tokens should be configurable by the user, and in case a limit is set, the tokens of the prompt should be counted and cut off if necessary. edit: openai has a max tokens parameter that does all of this

P.S. Why are we prompting with prompt = f"""Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. {context} if we just use one piece of context.

I will propose a PR for his.

Javascript / Nodejs version of EmbedChain?

Hi, is there are JS version of EmbedChain (similar to what was done with LangChainJS) in the works?

Thanks for building this!
David

[Feature Request] Add CSV or Google Sheets support

Access Paid / login or gated content

some web pages are need login or paid,need curl to get content.i can create a PR

[feature] - Add Database, like SQL and SQLite3

Process a database as a data source

[BUG] Chroma DB Duplicate ID Error

This is my code:

import os
os.environ["OPENAI_API_KEY"] = "sk-???"
from embedchain import App
naval_chat_bot = App()
naval_chat_bot.add_local("pdf_file", "docs/masnavi-en.pdf")
print(naval_chat_bot.query("Who is the most powerful man?"))

I get chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates for. Where is the problem?

P.S: This was my second attempt. The first one with a different pdf document was successful.

Feature Request - Add DataFrames (Spark or Pandas) as Sources

Currently, embedchain allows the addition of various types of data sources such as YouTube videos, PDF files, and web pages to be processed and used in the application. This feature request proposes to extend this functionality to include DataFrames, specifically those from the Spark or Pandas libraries, as potential data sources.

DataFrames are a commonly used data structure for handling and manipulating data in Python, especially in data science and machine learning applications. They are particularly effective when dealing with large, structured datasets, which can include text data.

The ability to use DataFrames as a source of data would add a significant amount of flexibility to embedchain, as users could directly input their preprocessed and transformed data into the application. This could be beneficial in scenarios where the data is already available in a DataFrame format, such as when it has been preprocessed or transformed as part of a larger data pipeline.

The implementation of this feature would involve adding a new method to the App class (or modifying the existing .add() method) that accepts a DataFrame and its format (Spark or Pandas) as arguments. The method would then handle the loading of the data from the DataFrame into the application in the appropriate format, ready to be processed and used in the application.

This feature would increase the flexibility and usefulness of embedchain, making it more applicable to a wider range of scenarios and use-cases, and potentially attracting a broader user base. It would also align well with common data science workflows, which often involve the use of DataFrames for data manipulation and analysis.

Please consider adding this feature in a future update of embedchain.

[Feature Request] Auto-Detect data-type, make the it optional

First off... Great job!!! Simple and tight code. Much appreciate you making/sharing it.

There was one quick suggestion I had: In order to minimize boilerplate code, it would be good to modify the interface to make the file_type variable optional and detected based on the input content. If the variable is defined then the code would check the file to ensure that it is of the specified type.

This ease-of-life modification should be added early in development to minimize more extensive refactors down the line.

But I wholly understand if you have a different design goal for making this a required input.

Add support to add text and text string

Opened on behalf of discord user Edo, message link

Not installing

Trying to install using pip3 and it returns this error:

Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for hnswlib (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [199 lines of output]
      running bdist_wheel
      running build
      running build_ext
      creating var
      creating var/folders
      creating var/folders/8c
      creating var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn
      creating var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T
      x86_64-apple-darwin13.4.0-clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c /var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmp4e6jgsj0.cpp -o var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmp4e6jgsj0.o -std=c++14
      x86_64-apple-darwin13.4.0-clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c /var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmpsl27hkck.cpp -o var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmpsl27hkck.o -fvisibility=hidden
      building 'hnswlib' extension
      creating build
      creating build/temp.macosx-13-arm64-cpython-311
      creating build/temp.macosx-13-arm64-cpython-311/python_bindings
      x86_64-apple-darwin13.4.0-clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include -I/private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include -I/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -I./hnswlib/ -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c ./python_bindings/bindings.cpp -o build/temp.macosx-13-arm64-cpython-311/./python_bindings/bindings.o -O3 -stdlib=libc++ -mmacosx-version-min=10.7 -DVERSION_INFO=\"0.7.0\" -std=c++14 -fvisibility=hidden
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:199:
      ./hnswlib/hnswalg.h:755:27: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              for (int i = 0; i < dim; i++) {
                              ~ ^ ~~~
      ./python_bindings/bindings.cpp:102:13: warning: format specifies type 'int' but the argument has type 'pybind11::ssize_t' (aka 'long') [-Wformat]
                  buffer.ndim);
                  ^~~~~~~~~~~
      ./python_bindings/bindings.cpp:126:17: warning: format specifies type 'int' but the argument has type 'pybind11::ssize_t' (aka 'long') [-Wformat]
                      ids_numpy.ndim, feature_rows);
                      ^~~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:126:33: warning: format specifies type 'int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
                      ids_numpy.ndim, feature_rows);
                                      ^~~~~~~~~~~~
      ./python_bindings/bindings.cpp:121:58: warning: comparison of integers of different signs: 'std::__vector_base<long, std::allocator<long>>::value_type' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              if (!((ids_numpy.ndim == 1 && ids_numpy.shape[0] == feature_rows) ||
                                            ~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:383:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:386:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:389:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:392:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:395:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:647:28: warning: unused variable 'data' [-Wunused-variable]
                          float* data = (float*)items.data(row);
                                 ^
      ./python_bindings/bindings.cpp:667:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:670:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:853:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:856:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:876:1: warning: 'pybind11_init' is deprecated: PYBIND11_PLUGIN is deprecated, use PYBIND11_MODULE [-Wdeprecated-declarations]
      PYBIND11_PLUGIN(hnswlib) {
      ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/detail/common.h:432:20: note: expanded from macro 'PYBIND11_PLUGIN'
                  return pybind11_init();                                                               \
                         ^
      ./python_bindings/bindings.cpp:876:1: note: 'pybind11_init' has been explicitly marked deprecated here
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/detail/common.h:426:5: note: expanded from macro 'PYBIND11_PLUGIN'
          PYBIND11_DEPRECATED("PYBIND11_PLUGIN is deprecated, use PYBIND11_MODULE")                     \
          ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/detail/common.h:194:43: note: expanded from macro 'PYBIND11_DEPRECATED'
      #    define PYBIND11_DEPRECATED(reason) [[deprecated(reason)]]
                                                ^
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:199:
      ./hnswlib/hnswalg.h:95:11: warning: field 'link_list_locks_' will be initialized after field 'label_op_locks_' [-Wreorder-ctor]
              : link_list_locks_(max_elements),
                ^
      ./python_bindings/bindings.cpp:488:39: note: in instantiation of member function 'hnswlib::HierarchicalNSW<float>::HierarchicalNSW' requested here
                  new_index->appr_alg = new hnswlib::HierarchicalNSW<dist_t>(
                                            ^
      ./python_bindings/bindings.cpp:880:38: note: in instantiation of member function 'Index<float>::createFromParams' requested here
              .def(py::init(&Index<float>::createFromParams), py::arg("params"))
                                           ^
      ./python_bindings/bindings.cpp:667:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:892:28: note: in instantiation of member function 'Index<float>::knnQuery_return_numpy' requested here
                  &Index<float>::knnQuery_return_numpy,
                                 ^
      ./python_bindings/bindings.cpp:670:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:619:22: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
                  if (rows <= num_threads * 4) {
                      ~~~~ ^  ~~~~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:257:22: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (features != dim)
                  ~~~~~~~~ ^  ~~~
      ./python_bindings/bindings.cpp:898:28: note: in instantiation of member function 'Index<float>::addItems' requested here
                  &Index<float>::addItems,
                                 ^
      ./python_bindings/bindings.cpp:261:18: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (rows <= num_threads * 4) {
                  ~~~~ ^  ~~~~~~~~~~~~~~~
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:199:
      ./hnswlib/hnswalg.h:755:27: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              for (int i = 0; i < dim; i++) {
                              ~ ^ ~~~
      ./python_bindings/bindings.cpp:323:47: note: in instantiation of function template specialization 'hnswlib::HierarchicalNSW<float>::getDataByLabel<float>' requested here
                  data.push_back(appr_alg->template getDataByLabel<data_t>(id));
                                                    ^
      ./python_bindings/bindings.cpp:903:49: note: in instantiation of member function 'Index<float>::getDataReturnList' requested here
              .def("get_items", &Index<float, float>::getDataReturnList, py::arg("ids") = py::none())
                                                      ^
      ./python_bindings/bindings.cpp:383:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:467:27: note: in instantiation of member function 'Index<float>::getAnnData' requested here
              auto ann_params = getAnnData();
                                ^
      ./python_bindings/bindings.cpp:945:43: note: in instantiation of member function 'Index<float>::getIndexParams' requested here
                      return py::make_tuple(ind.getIndexParams()); /* Return dict (wrapped in a tuple) that fully encodes state of the Index object */
                                                ^
      ./python_bindings/bindings.cpp:386:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:389:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:392:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:395:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:198:
      ./hnswlib/bruteforce.h:105:27: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              for (int i = 0; i < k; i++) {
                              ~ ^ ~
      ./hnswlib/bruteforce.h:59:5: note: in instantiation of member function 'hnswlib::BruteforceSearch<float>::searchKnn' requested here
          ~BruteforceSearch() {
          ^
      ./python_bindings/bindings.cpp:748:13: note: in instantiation of member function 'hnswlib::BruteforceSearch<float>::~BruteforceSearch' requested here
                  delete alg;
                  ^
      /Users/acf/opt/anaconda3/bin/../include/c++/v1/memory:1397:5: note: in instantiation of member function 'BFIndex<float>::~BFIndex' requested here
          delete __ptr;
          ^
      /Users/acf/opt/anaconda3/bin/../include/c++/v1/memory:1658:7: note: in instantiation of member function 'std::default_delete<BFIndex<float>>::operator()' requested here
            __ptr_.second()(__tmp);
            ^
      /Users/acf/opt/anaconda3/bin/../include/c++/v1/memory:1612:19: note: in instantiation of member function 'std::unique_ptr<BFIndex<float>>::reset' requested here
        ~unique_ptr() { reset(); }
                        ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/pybind11.h:1872:40: note: in instantiation of member function 'std::unique_ptr<BFIndex<float>>::~unique_ptr' requested here
                  v_h.holder<holder_type>().~holder_type();
                                             ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/pybind11.h:1535:26: note: in instantiation of member function 'pybind11::class_<BFIndex<float>>::dealloc' requested here
              record.dealloc = dealloc;
                               ^
      ./python_bindings/bindings.cpp:957:9: note: in instantiation of function template specialization 'pybind11::class_<BFIndex<float>>::class_<>' requested here
              py::class_<BFIndex<float>>(m, "BFIndex")
              ^
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:198:
      ./hnswlib/bruteforce.h:113:27: warning: comparison of integers of different signs: 'int' and 'const size_t' (aka 'const unsigned long') [-Wsign-compare]
              for (int i = k; i < cur_element_count; i++) {
                              ~ ^ ~~~~~~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:853:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:960:44: note: in instantiation of member function 'BFIndex<float>::knnQuery_return_numpy' requested here
              .def("knn_query", &BFIndex<float>::knnQuery_return_numpy, py::arg("data"), py::arg("k") = 1, py::arg("filter") = py::none())
                                                 ^
      ./python_bindings/bindings.cpp:856:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:778:22: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (features != dim)
                  ~~~~~~~~ ^  ~~~
      ./python_bindings/bindings.cpp:961:44: note: in instantiation of member function 'BFIndex<float>::addItems' requested here
              .def("add_items", &BFIndex<float>::addItems, py::arg("data"), py::arg("ids") = py::none())
                                                 ^
      In file included from ./python_bindings/bindings.cpp:6:
      ./hnswlib/hnswlib.h:80:13: warning: unused function 'AVX512Capable' [-Wunused-function]
      static bool AVX512Capable() {
                  ^
      34 warnings generated.
      creating build/lib.macosx-13-arm64-cpython-311
      x86_64-apple-darwin13.4.0-clang++ -bundle -undefined dynamic_lookup -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -Wl,-pie -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs -Wl,-rpath,/Users/acf/opt/anaconda3/lib -L/Users/acf/opt/anaconda3/lib -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include build/temp.macosx-13-arm64-cpython-311/./python_bindings/bindings.o -o build/lib.macosx-13-arm64-cpython-311/hnswlib.cpython-311-darwin.so -stdlib=libc++ -mmacosx-version-min=10.7
      ld: warning: -pie being ignored. It is only used when linking a main executable
      ld: unsupported tapi file type '!tapi-tbd' in YAML file '/Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/lib/libSystem.tbd' for architecture x86_64
      clang-12: error: linker command failed with exit code 1 (use -v to see invocation)
      error: command '/Users/acf/opt/anaconda3/bin/x86_64-apple-darwin13.4.0-clang++' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for hnswlib
Failed to build hnswlib
ERROR: Could not build wheels for hnswlib, which is required to install pyproject.toml-based projects

epub format

can you add this format please

How to handle citations?

opened on behalf of discord user ArunPrakash, message link

Hoping to support custom openai base_url.

I hope for support for a custom base_url. This is because the original base_url from OpenAI is sometimes not directly accessible from my region.

Enhance query function and add more types of question

Is it possible to ask the bot articulate questions such as "Given the various documents, can you write the history of..."

Opened on behalf of discord user Edo, message link

Add support for Notion

sveltejs interface

wrap a @sveltejs interface around it and make an ai product.
opened on behalf of Twitter user Patrick, tweet link

Custom OpenAi configuration for Query

Hi
Is there any way to configure temperature and model usage for openai when run a query?

Thanks.

add ML models

Add Huggingface embeddings

I would appreciate it if you add Huggingface embeddings, because it would be free to use, in contrast to OpenAI's embeddings, which uses ada I believe. So something along those lines would be great:

`embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"

embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)`

Altough I must admit that I do not know the difference between openAI and this model when it comes to embeddings, if anyone knows, please let me know what those differences are.

add new format sqldatabase

specifically im working with snowflake but would love to be able to select a table, or set of tables as a format source from my data warehouse

[Feature Request] Add ReliableGPT to handle errors

Hi @taranjeet,

Facing issues with rate-limiting and context window limitations.

Would recommend wrapping the openai base call with reliableGPT.

from reliablegpt import reliableGPT
openai.ChatCompletion.create = reliableGPT(openai.ChatCompletion.create, ...)

Source: https://github.com/BerriAI/reliableGPT

Add linter

Insert Local File instead of link

How do I train the model with my local files? Suppose I have a pdf in root directory and I want to add it like mygpt.add("pdf_file", "book.pdf"). Is it possible?

Streaming responses

is there any way we could have streaming responses? just throwing this out here: https://sdk.vercel.ai/docs/concepts/streaming
opened on behalf of discord user cachho, message link

Error when DB only has 1 resource added to it via the .add module

Hello,
I had to issue this error on embedchain during the inference that arises when there is only one resource added to the chromadb. Below is the attached screenshot to the error on streamlit app of embedchain that I created.

Feature Request - Integrate Azure's OpenAI API as an Option

Currently, embedchain is designed to use OpenAI's API for creating embeddings and leveraging the power of GPT-3 for generating answers in the context of chatbots. This feature request proposes to include the option of using Azure's OpenAI API as an alternative.

Azure, a comprehensive suite of cloud services offered by Microsoft, also provides an implementation of OpenAI API. Integration with Azure's OpenAI API would provide a choice to the users to select between OpenAI's original API and Azure's version based on their specific requirements and preferences.

Add support for local text

This issue is meant to track PR from https://github.com/cachho.

Collating thoughts and final action here first

Reset the database

it would also be nice if there was a method to reset the database. I have no idea about chroma, I'm sure you can just delete the db folder.
this issue is opened on behalf of discord user cachho, message link

def get_openai_answer(self, prompt):
        messages = []
        messages.append({
            "role": "user", "content": prompt
        })
        response = openai.ChatCompletion.create(
            model="gpt-4-0613",
            messages=messages,
            temperature=0.25,
            max_tokens=1000,
            top_p=1,
        )
        return response["choices"][0]["message"]["content"]

It would be awesome to have a few parameters when querying for temperature,max_tokens, and top_p as well. Or globally/in env? not sure what's best, but happy to create a PR.