Git Product home page Git Product logo

embedchain's Introduction

Embedchain Logo

PyPI Downloads Slack Discord Twitter Open in Colab codecov


What is Embedchain?

Embedchain is an Open Source Framework for personalizing LLM responses. It makes it easy to create and deploy personalized AI apps. At its core, Embedchain follows the design principle of being "Conventional but Configurable" to serve both software engineers and machine learning engineers.

Embedchain streamlines the creation of personalized LLM applications, offering a seamless process for managing various types of unstructured data. It efficiently segments data into manageable chunks, generates relevant embeddings, and stores them in a vector database for optimized retrieval. With a suite of diverse APIs, it enables users to extract contextual information, find precise answers, or engage in interactive chat conversations, all tailored to their own data.

๐Ÿ”ง Quick install

Python API

pip install embedchain

โœจ Live demo

Checkout the Chat with PDF live demo we created using Embedchain. You can find the source code here.

๐Ÿ” Usage

Embedchain Demo

For example, you can create an Elon Musk bot using the following code:

import os
from embedchain import App

# Create a bot instance
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
app = App()

# Embed online resources
app.add("https://en.wikipedia.org/wiki/Elon_Musk")
app.add("https://www.forbes.com/profile/elon-musk")

# Query the app
app.query("How many companies does Elon Musk run and name those?")
# Answer: Elon Musk currently runs several companies. As of my knowledge, he is the CEO and lead designer of SpaceX, the CEO and product architect of Tesla, Inc., the CEO and founder of Neuralink, and the CEO and founder of The Boring Company. However, please note that this information may change over time, so it's always good to verify the latest updates.

You can also try it in your browser with Google Colab:

Open in Colab

๐Ÿ“– Documentation

Comprehensive guides and API documentation are available to help you get the most out of Embedchain:

๐Ÿ”— Join the Community

๐Ÿค Schedule a 1-on-1 Session

Book a 1-on-1 Session with the founders, to discuss any issues, provide feedback, or explore how we can improve Embedchain for you.

๐ŸŒ Contributing

Contributions are welcome! Please check out the issues on the repository, and feel free to open a pull request. For more information, please see the contributing guidelines.

For more reference, please go through Development Guide and Documentation Guide.

Anonymous Telemetry

We collect anonymous usage metrics to enhance our package's quality and user experience. This includes data like feature usage frequency and system info, but never personal details. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable EC_TELEMETRY=false. We prioritize data security and don't share this data externally.

Citation

If you utilize this repository, please consider citing it with:

@misc{embedchain,
  author = {Taranjeet Singh, Deshraj Yadav},
  title = {Embedchain: The Open Source RAG Framework},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/embedchain/embedchain}},
}

embedchain's People

Contributors

aaishikdutta avatar ahnedeee avatar aryankhanna475 avatar cachho avatar cclauss avatar deshraj avatar dev-khant avatar deven298 avatar dhravya avatar dtee1 avatar eltociear avatar gasolin avatar ianupamsingh avatar jonasiwnl avatar juananpe avatar kapilm26 avatar maccuryj avatar pc9 avatar prikshit7766 avatar rayhanpatel avatar rishiraj2594 avatar rupeshbansal avatar sahilyadav902 avatar sandrasgg avatar sersamgy avatar sidmohanty11 avatar subhajit20 avatar sukkritsharmaofficial avatar sw8fbar avatar taranjeet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embedchain's Issues

Non-feature request - Modularize the application

The Embedchain class has a lot of methods and it would add value in terms of code readability to abstract it a little bit. There are many open issues about integrating multiple llms, vector dbs or embedding. While I see a level of abstraction in the vector db folder and that can be leveraged for further integration options, I believe we should do something similar for the methods where we use the embedding models and the llm model. I have raised a PR for this #92 which attends to abstracting the data formats for loaders and chunkers . @taranjeet @cachho please let me know if this is something we can add, so we can have some further discussions on how to structure it for the more critical pieces like the embedding models and chat completions.

ImportError: cannot import name 'App' from partially initialized module 'embedchain' (most likely due to a circular import)

I encountered a strange problem: my Python code consists of only one file, and when the name of this Python file is the same as the name of the library it references: embedchain.py, an error is reported. ImportError: cannot import name 'App' from partially initialized module 'xxx' (most likely due to a circular import)

So, Just rename the file to anther name, and it will be fixed.

openai.error.ServiceUnavailableError: The server is overloaded or not ready yet.

My code:

import os
from keys import *
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

from embedchain import App

naval_chat_bot = App()

naval_chat_bot.add("web_page", "https://psymplicity.com/")

print(naval_chat_bot.query("what is the three-step approach to private mental health care"))

The Error:

Unable to connect optimized C data functions [No module named '_testbuffer'], falling back to pure Python
All data from https://psymplicity.com/ already exists in the database.
Traceback (most recent call last):
File "c:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\code\flask_app_2\embedchain_test.py", line 21, in
print(naval_chat_bot.query("what is the three-step approach to private mental health care"))
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\embedchain\embedchain.py", line 225, in query
answer = self.get_answer_from_llm(prompt)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\embedchain\embedchain.py", line 211, in get_answer_from_llm
answer = self.get_openai_answer(prompt)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\embedchain\embedchain.py", line 162, in get_openai_answer
response = openai.ChatCompletion.create(
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_resources\chat_completion.py", line 25, in create
return super().create(*args, **kwargs)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_resources\abstract\engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "C:\Users\moshe\OneDrive - University College London\Code\gpt-autopilot\venv\lib\site-packages\openai\api_requestor.py", line 743, in _interpret_response_line
raise error.ServiceUnavailableError(
openai.error.ServiceUnavailableError: The server is overloaded or not ready yet.

Issue on TypeVar

When trying to run the sample code I get this:
ImportError: cannot import name 'TypeVar' from 'typing_extensions' (/databricks/python/lib/python3.10/site-packages/typing_extensions.py)

I am running this in a Databricks notebook.

Project Tools

Setup Following Project Management Tools

  1. Project Package and Environment Manager: Poetry is recommended
  2. pytests and pylint setup
  3. Contributing Guide
  4. Sphnix Documentation and deploying on the readthedocs server
  5. Docstrings for API : Google Style is recommended.
  6. CI/CD workflows

I can help with above

Using GPT-4 for prompting

Hi there, I see that the framework is using GPT-3.5 last release , for doing prompting
How I can change to GPT-4 ?

My Best for this project !
Regards

Fine tune tone for the answer

  • Wondering if it is possible to fine-tune the tone the AI replying to me? such as if I provided the dialogue of Sherlock Holmes, could it reply me with the tone that Sherlock talks? Ty!
  • this issue is opened on behalf of twitter user ring_hyacinth, tweet

epub format

please allow epub format for one of the types supported

Add new format - sitemap

Hi @taranjeet I was working on my mini project to chat over a small-sized blog and I found myself writing some piece of code to iterate over the sitemap of the website. I think it would be valueable if we can provide format support for a sitemap to automate multiple web page loading and chunking. Do you already have a issue tracking that or it is something that can be added
Right now I am doing something like this:

# Download sitemap.xml file from a website and extract all the links
def get_links(url):
    url = f'{url}/sitemap.xml'
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        links = [link.text for link in soup.find_all('loc')]
        return links
    else:
        print(f'Error: {response.status_code}')
        return None

Add support to load codebase

  • Thanks, Such a handy repo! Loving the user-friendly API. Can't wait to see it support a whole codebase(just like other types of documents) in the future๏ผš๏ผ‰
  • opened on behalf of twitter user ericman65204539, tweet

Add meta data

  • Is there a way to add more metadata on each document? something like document id - and get it back in the response?
  • opened on behalf of discord user ikinnrot, message link

Add tests

  • need to setup tests so that contributing to the repo becomes easier and faster

feature request: Add New Format "Image"

Embedchain will parse uploaded images, extract text information and embed.
Ex, Screenshot of a book chapter.

The parser package should be configurable, the default should be opensource.

Issue with get_openai_answer

max_tokens parameter being set to 1000 is an issue. With having multiple sources (with long urls) and larger webpages, this is quickly eaten up. When the token amount is exceeded no warning is given except from openAI.

openai.error.RateLimitError: The server had an error while processing your request. Sorry about that!

def get_openai_answer(self, prompt):
messages = []
messages.append({
"role": "user", "content": prompt
})
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0613",
messages=messages,
temperature=0,
max_tokens=1000,
top_p=1,
)
return response["choices"][0]["message"]["content"]

What are the implications of allowing more documents as context?

Let's talk about this method:

def query(self, input_query):
        """
        Queries the vector database based on the given input query.
        Gets relevant doc based on the query and then passes it to an
        LLM as context to get the answer.

        :param input_query: The query to use.
        :return: The answer to the query.
        """
        result = self.collection.query(
            query_texts=[input_query,],
            n_results=1,
        )
        result_formatted = self._format_result(result)
        answer = self.get_answer_from_llm(input_query, result_formatted[0][0].page_content)
        return answer

As far as I can tell, (and I'm just reading, not necessarily understanding, correct me if I'm wrong), it will return the one single closest document. n_results=1

What if we have a more granular database, cut into smaller pieces?

E.g. the webpages and documents we added are only a paragraph long. Then it will only return that one paragraph. So let's keep imagining that a user asks a complex question for which the correct answer is stored in more than one document. Then it would only answer part of the question with limited knowledge.

Here's a simple example. Let's say we are in the car business and feed our database information about the Corvette, one page for each generation. Then a user asks how much does horsepower does the current Corvette make and how much did the first one make?. If my understanding is correct, it could not answer that question (for this specific question, ChatGPT knows the answer out of the box, but you get the point).

For these kinds of use cases I'm proposing to allow the retrieval of more than one document, configurable by the user. 1 can stay as the default. These are then all passed as context so a LLM can do it's magic and process the information.

The downside I can see is that it will require more tokens, and thus cost more. This is a compromise the user has to make for better results. The max token limit should also be considered, especially in cases where the database contains short and long text, for this edge case, max tokens should be configurable by the user, and in case a limit is set, the tokens of the prompt should be counted and cut off if necessary. edit: openai has a max tokens parameter that does all of this

P.S. Why are we prompting with prompt = f"""Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. {context} if we just use one piece of context.

I will propose a PR for his.

[BUG] Chroma DB Duplicate ID Error

This is my code:

import os
os.environ["OPENAI_API_KEY"] = "sk-???"
from embedchain import App
naval_chat_bot = App()
naval_chat_bot.add_local("pdf_file", "docs/masnavi-en.pdf")
print(naval_chat_bot.query("Who is the most powerful man?"))

I get chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates for. Where is the problem?

P.S: This was my second attempt. The first one with a different pdf document was successful.

Feature Request - Add DataFrames (Spark or Pandas) as Sources

Currently, embedchain allows the addition of various types of data sources such as YouTube videos, PDF files, and web pages to be processed and used in the application. This feature request proposes to extend this functionality to include DataFrames, specifically those from the Spark or Pandas libraries, as potential data sources.

DataFrames are a commonly used data structure for handling and manipulating data in Python, especially in data science and machine learning applications. They are particularly effective when dealing with large, structured datasets, which can include text data.

The ability to use DataFrames as a source of data would add a significant amount of flexibility to embedchain, as users could directly input their preprocessed and transformed data into the application. This could be beneficial in scenarios where the data is already available in a DataFrame format, such as when it has been preprocessed or transformed as part of a larger data pipeline.

The implementation of this feature would involve adding a new method to the App class (or modifying the existing .add() method) that accepts a DataFrame and its format (Spark or Pandas) as arguments. The method would then handle the loading of the data from the DataFrame into the application in the appropriate format, ready to be processed and used in the application.

This feature would increase the flexibility and usefulness of embedchain, making it more applicable to a wider range of scenarios and use-cases, and potentially attracting a broader user base. It would also align well with common data science workflows, which often involve the use of DataFrames for data manipulation and analysis.

Please consider adding this feature in a future update of embedchain.

[Feature Request] Auto-Detect data-type, make the it optional

First off... Great job!!! Simple and tight code. Much appreciate you making/sharing it.

There was one quick suggestion I had: In order to minimize boilerplate code, it would be good to modify the interface to make the file_type variable optional and detected based on the input content. If the variable is defined then the code would check the file to ensure that it is of the specified type.

This ease-of-life modification should be added early in development to minimize more extensive refactors down the line.

But I wholly understand if you have a different design goal for making this a required input.

Not installing

Trying to install using pip3 and it returns this error:

Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  ร— Building wheel for hnswlib (pyproject.toml) did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [199 lines of output]
      running bdist_wheel
      running build
      running build_ext
      creating var
      creating var/folders
      creating var/folders/8c
      creating var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn
      creating var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T
      x86_64-apple-darwin13.4.0-clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c /var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmp4e6jgsj0.cpp -o var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmp4e6jgsj0.o -std=c++14
      x86_64-apple-darwin13.4.0-clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c /var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmpsl27hkck.cpp -o var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/tmpsl27hkck.o -fvisibility=hidden
      building 'hnswlib' extension
      creating build
      creating build/temp.macosx-13-arm64-cpython-311
      creating build/temp.macosx-13-arm64-cpython-311/python_bindings
      x86_64-apple-darwin13.4.0-clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include -I/private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include -I/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -I./hnswlib/ -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c ./python_bindings/bindings.cpp -o build/temp.macosx-13-arm64-cpython-311/./python_bindings/bindings.o -O3 -stdlib=libc++ -mmacosx-version-min=10.7 -DVERSION_INFO=\"0.7.0\" -std=c++14 -fvisibility=hidden
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:199:
      ./hnswlib/hnswalg.h:755:27: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              for (int i = 0; i < dim; i++) {
                              ~ ^ ~~~
      ./python_bindings/bindings.cpp:102:13: warning: format specifies type 'int' but the argument has type 'pybind11::ssize_t' (aka 'long') [-Wformat]
                  buffer.ndim);
                  ^~~~~~~~~~~
      ./python_bindings/bindings.cpp:126:17: warning: format specifies type 'int' but the argument has type 'pybind11::ssize_t' (aka 'long') [-Wformat]
                      ids_numpy.ndim, feature_rows);
                      ^~~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:126:33: warning: format specifies type 'int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
                      ids_numpy.ndim, feature_rows);
                                      ^~~~~~~~~~~~
      ./python_bindings/bindings.cpp:121:58: warning: comparison of integers of different signs: 'std::__vector_base<long, std::allocator<long>>::value_type' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              if (!((ids_numpy.ndim == 1 && ids_numpy.shape[0] == feature_rows) ||
                                            ~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:383:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:386:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:389:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:392:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:395:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:647:28: warning: unused variable 'data' [-Wunused-variable]
                          float* data = (float*)items.data(row);
                                 ^
      ./python_bindings/bindings.cpp:667:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:670:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:853:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:856:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:876:1: warning: 'pybind11_init' is deprecated: PYBIND11_PLUGIN is deprecated, use PYBIND11_MODULE [-Wdeprecated-declarations]
      PYBIND11_PLUGIN(hnswlib) {
      ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/detail/common.h:432:20: note: expanded from macro 'PYBIND11_PLUGIN'
                  return pybind11_init();                                                               \
                         ^
      ./python_bindings/bindings.cpp:876:1: note: 'pybind11_init' has been explicitly marked deprecated here
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/detail/common.h:426:5: note: expanded from macro 'PYBIND11_PLUGIN'
          PYBIND11_DEPRECATED("PYBIND11_PLUGIN is deprecated, use PYBIND11_MODULE")                     \
          ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/detail/common.h:194:43: note: expanded from macro 'PYBIND11_DEPRECATED'
      #    define PYBIND11_DEPRECATED(reason) [[deprecated(reason)]]
                                                ^
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:199:
      ./hnswlib/hnswalg.h:95:11: warning: field 'link_list_locks_' will be initialized after field 'label_op_locks_' [-Wreorder-ctor]
              : link_list_locks_(max_elements),
                ^
      ./python_bindings/bindings.cpp:488:39: note: in instantiation of member function 'hnswlib::HierarchicalNSW<float>::HierarchicalNSW' requested here
                  new_index->appr_alg = new hnswlib::HierarchicalNSW<dist_t>(
                                            ^
      ./python_bindings/bindings.cpp:880:38: note: in instantiation of member function 'Index<float>::createFromParams' requested here
              .def(py::init(&Index<float>::createFromParams), py::arg("params"))
                                           ^
      ./python_bindings/bindings.cpp:667:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:892:28: note: in instantiation of member function 'Index<float>::knnQuery_return_numpy' requested here
                  &Index<float>::knnQuery_return_numpy,
                                 ^
      ./python_bindings/bindings.cpp:670:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:619:22: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
                  if (rows <= num_threads * 4) {
                      ~~~~ ^  ~~~~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:257:22: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (features != dim)
                  ~~~~~~~~ ^  ~~~
      ./python_bindings/bindings.cpp:898:28: note: in instantiation of member function 'Index<float>::addItems' requested here
                  &Index<float>::addItems,
                                 ^
      ./python_bindings/bindings.cpp:261:18: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (rows <= num_threads * 4) {
                  ~~~~ ^  ~~~~~~~~~~~~~~~
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:199:
      ./hnswlib/hnswalg.h:755:27: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              for (int i = 0; i < dim; i++) {
                              ~ ^ ~~~
      ./python_bindings/bindings.cpp:323:47: note: in instantiation of function template specialization 'hnswlib::HierarchicalNSW<float>::getDataByLabel<float>' requested here
                  data.push_back(appr_alg->template getDataByLabel<data_t>(id));
                                                    ^
      ./python_bindings/bindings.cpp:903:49: note: in instantiation of member function 'Index<float>::getDataReturnList' requested here
              .def("get_items", &Index<float, float>::getDataReturnList, py::arg("ids") = py::none())
                                                      ^
      ./python_bindings/bindings.cpp:383:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:467:27: note: in instantiation of member function 'Index<float>::getAnnData' requested here
              auto ann_params = getAnnData();
                                ^
      ./python_bindings/bindings.cpp:945:43: note: in instantiation of member function 'Index<float>::getIndexParams' requested here
                      return py::make_tuple(ind.getIndexParams()); /* Return dict (wrapped in a tuple) that fully encodes state of the Index object */
                                                ^
      ./python_bindings/bindings.cpp:386:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:389:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:392:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:395:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:198:
      ./hnswlib/bruteforce.h:105:27: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
              for (int i = 0; i < k; i++) {
                              ~ ^ ~
      ./hnswlib/bruteforce.h:59:5: note: in instantiation of member function 'hnswlib::BruteforceSearch<float>::searchKnn' requested here
          ~BruteforceSearch() {
          ^
      ./python_bindings/bindings.cpp:748:13: note: in instantiation of member function 'hnswlib::BruteforceSearch<float>::~BruteforceSearch' requested here
                  delete alg;
                  ^
      /Users/acf/opt/anaconda3/bin/../include/c++/v1/memory:1397:5: note: in instantiation of member function 'BFIndex<float>::~BFIndex' requested here
          delete __ptr;
          ^
      /Users/acf/opt/anaconda3/bin/../include/c++/v1/memory:1658:7: note: in instantiation of member function 'std::default_delete<BFIndex<float>>::operator()' requested here
            __ptr_.second()(__tmp);
            ^
      /Users/acf/opt/anaconda3/bin/../include/c++/v1/memory:1612:19: note: in instantiation of member function 'std::unique_ptr<BFIndex<float>>::reset' requested here
        ~unique_ptr() { reset(); }
                        ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/pybind11.h:1872:40: note: in instantiation of member function 'std::unique_ptr<BFIndex<float>>::~unique_ptr' requested here
                  v_h.holder<holder_type>().~holder_type();
                                             ^
      /private/var/folders/8c/dnq_8d0j6b10xklrxyqdt1fh0000gn/T/pip-build-env-8s3c61cb/overlay/lib/python3.11/site-packages/pybind11/include/pybind11/pybind11.h:1535:26: note: in instantiation of member function 'pybind11::class_<BFIndex<float>>::dealloc' requested here
              record.dealloc = dealloc;
                               ^
      ./python_bindings/bindings.cpp:957:9: note: in instantiation of function template specialization 'pybind11::class_<BFIndex<float>>::class_<>' requested here
              py::class_<BFIndex<float>>(m, "BFIndex")
              ^
      In file included from ./python_bindings/bindings.cpp:6:
      In file included from ./hnswlib/hnswlib.h:198:
      ./hnswlib/bruteforce.h:113:27: warning: comparison of integers of different signs: 'int' and 'const size_t' (aka 'const unsigned long') [-Wsign-compare]
              for (int i = k; i < cur_element_count; i++) {
                              ~ ^ ~~~~~~~~~~~~~~~~~
      ./python_bindings/bindings.cpp:853:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:960:44: note: in instantiation of member function 'BFIndex<float>::knnQuery_return_numpy' requested here
              .def("knn_query", &BFIndex<float>::knnQuery_return_numpy, py::arg("data"), py::arg("k") = 1, py::arg("filter") = py::none())
                                                 ^
      ./python_bindings/bindings.cpp:856:13: warning: cannot delete expression with pointer-to-'void' type 'void *' [-Wdelete-incomplete]
                  delete[] f;
                  ^        ~
      ./python_bindings/bindings.cpp:778:22: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (features != dim)
                  ~~~~~~~~ ^  ~~~
      ./python_bindings/bindings.cpp:961:44: note: in instantiation of member function 'BFIndex<float>::addItems' requested here
              .def("add_items", &BFIndex<float>::addItems, py::arg("data"), py::arg("ids") = py::none())
                                                 ^
      In file included from ./python_bindings/bindings.cpp:6:
      ./hnswlib/hnswlib.h:80:13: warning: unused function 'AVX512Capable' [-Wunused-function]
      static bool AVX512Capable() {
                  ^
      34 warnings generated.
      creating build/lib.macosx-13-arm64-cpython-311
      x86_64-apple-darwin13.4.0-clang++ -bundle -undefined dynamic_lookup -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -Wl,-pie -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs -Wl,-rpath,/Users/acf/opt/anaconda3/lib -L/Users/acf/opt/anaconda3/lib -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -isystem /Users/acf/opt/anaconda3/include -D_FORTIFY_SOURCE=2 -isystem /Users/acf/opt/anaconda3/include build/temp.macosx-13-arm64-cpython-311/./python_bindings/bindings.o -o build/lib.macosx-13-arm64-cpython-311/hnswlib.cpython-311-darwin.so -stdlib=libc++ -mmacosx-version-min=10.7
      ld: warning: -pie being ignored. It is only used when linking a main executable
      ld: unsupported tapi file type '!tapi-tbd' in YAML file '/Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/lib/libSystem.tbd' for architecture x86_64
      clang-12: error: linker command failed with exit code 1 (use -v to see invocation)
      error: command '/Users/acf/opt/anaconda3/bin/x86_64-apple-darwin13.4.0-clang++' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for hnswlib
Failed to build hnswlib
ERROR: Could not build wheels for hnswlib, which is required to install pyproject.toml-based projects

Add Huggingface embeddings

I would appreciate it if you add Huggingface embeddings, because it would be free to use, in contrast to OpenAI's embeddings, which uses ada I believe. So something along those lines would be great:

`embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"

embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)`

Altough I must admit that I do not know the difference between openAI and this model when it comes to embeddings, if anyone knows, please let me know what those differences are.

add new format sqldatabase

specifically im working with snowflake but would love to be able to select a table, or set of tables as a format source from my data warehouse

Insert Local File instead of link

How do I train the model with my local files? Suppose I have a pdf in root directory and I want to add it like mygpt.add("pdf_file", "book.pdf"). Is it possible?

Feature Request - Integrate Azure's OpenAI API as an Option

Currently, embedchain is designed to use OpenAI's API for creating embeddings and leveraging the power of GPT-3 for generating answers in the context of chatbots. This feature request proposes to include the option of using Azure's OpenAI API as an alternative.

Azure, a comprehensive suite of cloud services offered by Microsoft, also provides an implementation of OpenAI API. Integration with Azure's OpenAI API would provide a choice to the users to select between OpenAI's original API and Azure's version based on their specific requirements and preferences.

Reset the database

  • it would also be nice if there was a method to reset the database. I have no idea about chroma, I'm sure you can just delete the db folder.
  • this issue is opened on behalf of discord user cachho, message link

Add support for caching

how does the framework handle caching - does it embed everything again and add to database each time you run the script or does it know that a given data source is already embedded and in database therefore no need to incur that expense?

Note: This issue is opened on behalf of discord user bodech, message link

Feature Request: Parameters and OpenAI model

Parameters to specify OpenAI model and settings.

ex. I'm subclassing App and updating the model this way to test:

def get_openai_answer(self, prompt):
        messages = []
        messages.append({
            "role": "user", "content": prompt
        })
        response = openai.ChatCompletion.create(
            model="gpt-4-0613",
            messages=messages,
            temperature=0.25,
            max_tokens=1000,
            top_p=1,
        )
        return response["choices"][0]["message"]["content"]

It would be awesome to have a few parameters when querying for temperature,max_tokens, and top_p as well. Or globally/in env? not sure what's best, but happy to create a PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.