Git Product home page Git Product logo

llama-hub's Introduction

LlamaHub ๐Ÿฆ™

Caution

This repo has since be archived and is read-only. With the launch of LlamaIndex v0.10, we are deprecating this llama_hub repo - all integrations (data loaders, tools) and packs are now in the core llama-index Python repository. LlamaHub will continue to exist. We are revamping llamahub.ai point to all integrations/packs/datasets available in the llama-index repo.

Original creator: Jesse Zhang (GH: emptycrown, Twitter: @thejessezhang), who courteously donated the repo to LlamaIndex!

๐Ÿ‘ฅ Contributing

Interested in contributing? Skip over to our Contribution Section below for more details.

This is a simple library of all the data loaders / readers / tools / llama-packs / llama-datasets that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sources. These are general-purpose utilities that are meant to be used in LlamaIndex, LangChain and more!.

Loaders and readers allow you to easily ingest data for search and retrieval by a large language model, while tools allow the models to both read and write to third party data services and sources. Ultimately, this allows you to create your own customized data agent to intelligently work with you and your data to unlock the full capability of next level large language models.

For a variety of examples of data agents, see the notebooks directory. You can find example Jupyter notebooks for creating data agents that can load and parse data from Google Docs, SQL Databases, Notion, and Slack, and also manage your Google Calendar, and Gmail inbox, or read and use OpenAPI specs.

For an easier way to browse the integrations available, check out the website here: https://llamahub.ai/.

Screenshot 2023-07-17 at 6 12 32 PM

Usage (Use llama-hub as PyPI package)

These general-purpose loaders are designed to be used as a way to load data into LlamaIndex and/or subsequently used in LangChain.

Installation

pip install llama-hub

LlamaIndex

from llama_index import VectorStoreIndex
from llama_hub.google_docs import GoogleDocsReader

gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
index = VectorStoreIndex.from_documents(documents)
index.query('Where did the author go to school?')

LlamaIndex Data Agent

from llama_index.agent import OpenAIAgent
import openai
openai.api_key = 'sk-api-key'

from llama_hub.tools.google_calendar import GoogleCalendarToolSpec
tool_spec = GoogleCalendarToolSpec()

agent = OpenAIAgent.from_tools(tool_spec.to_tool_list())
agent.chat('what is the first thing on my calendar today')
agent.chat("Please create an event for tomorrow at 4pm to review pull requests")

For a variety of examples of creating and using data agents, see the notebooks directory.

LangChain

Note: Make sure you change the description of the Tool to match your use case.

from llama_index import VectorStoreIndex
from llama_hub.google_docs import GoogleDocsReader
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

# load documents
gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
langchain_documents = [d.to_langchain_format() for d in documents]

# initialize sample QA chain
llm = OpenAI(temperature=0)
qa_chain = load_qa_chain(llm)
question="<query here>"
answer = qa_chain.run(input_documents=langchain_documents, question=question)

Loader Usage (Use download_loader from LlamaIndex)

You can also use the loaders with download_loader from LlamaIndex in a single line of code.

For example, see the code snippets below using the Google Docs Loader.

from llama_index import VectorStoreIndex, download_loader

GoogleDocsReader = download_loader('GoogleDocsReader')

gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
index = VectorStoreIndex.from_documents(documents)
index.query('Where did the author go to school?')

Llama-Pack Usage

Llama-packs can be downloaded using the llamaindex-cli tool that comes with llama-index:

llamaindex-cli download-llamapack ZephyrQueryEnginePack --download-dir ./zephyr_pack

Or with the download_llama_pack function directly:

from llama_index.llama_pack import download_llama_pack

# download and install dependencies
LlavaCompletionPack = download_llama_pack(
  "LlavaCompletionPack", "./llava_pack"
)

Llama-Dataset Usage

(NOTE: in what follows we present the pattern for producing a RAG benchmark with the RagEvaluatorPack over a LabelledRagDataset. However, there are also other types of llama-datasets such as LabelledEvaluatorDataset and corresponding llama-packs for producing benchmarks on their respective tasks. They all follow the similar usage pattern. Please refer to the README's to learn more on each type of llama-dataset.)

The primary use of llama-dataset is for evaluating the performance of a RAG system. In particular, it serves as a new test set (in traditional machine learning speak) for one to build a RAG over, predict on, and subsequently perform evaluations comparing the predicted response versus the reference response. To perform the evaluation, the recommended usage pattern involves the application of the RagEvaluatorPack. We recommend reading the docs for the "Evaluation" module for more information on all of our llama-dataset's.

from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
  "PaulGrahamEssayDataset", "./data"
)

# build basic RAG system
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = VectorStoreIndex.as_query_engine()

# evaluate using the RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=rag_dataset,
    query_engine=query_engine
)
benchmark_df = rag_evaluate_pack.run()  # async arun() supported as well

Llama-datasets can also be downloaded directly using llamaindex-cli, which comes installed with the llama-index python package:

llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

After downloading them from llamaindex-cli, you can inspect the dataset and it source files (stored in a directory /source_files) then load them into python:

from llama_index import SimpleDirectoryReader
from llama_index.llama_dataset import LabelledRagDataset

rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(
    input_dir="./data/source_files"
).load_data()

How to add a loader/tool/llama-pack

Adding a loader/tool/llama-pack simply requires forking this repo and making a Pull Request. The Llama Hub website will update automatically when a new llama-hub release is made. However, please keep in mind the following guidelines when making your PR.

Step 0: Setup virtual environment, install Poetry and dependencies

Create a new Python virtual environment. The command below creates an environment in .venv, and activates it:

python -m venv .venv
source .venv/bin/activate

if you are in windows, use the following to activate your virtual environment:

.venv\scripts\activate

Install poetry:

pip install poetry

Install the required dependencies (this will also install llama_index):

poetry install

This will create an editable install of llama-hub in your venv.

Step 1: Create a new directory

For loaders, create a new directory in llama_hub, for tools create a directory in llama_hub/tools, and for llama-packs create a directory in llama_hub/llama_packs It can be nested within another, but name it something unique because the name of the directory will become the identifier for your loader (e.g. google_docs). Inside your new directory, create a __init__.py file specifying the module's public interface with __all__, a base.py file which will contain your loader implementation, and, if needed, a requirements.txt file to list the package dependencies of your loader. Those packages will automatically be installed when your loader is used, so no need to worry about that anymore!

If you'd like, you can create the new directory and files by running the following script in the llama_hub directory. Just remember to put your dependencies into a requirements.txt file.

./add_loader.sh [NAME_OF_NEW_DIRECTORY]

Step 2: Write your README

Inside your new directory, create a README.md that mirrors that of the existing ones. It should have a summary of what your loader or tool does, its inputs, and how it is used in the context of LlamaIndex and LangChain.

Step 3: Add your loader to the library.json file

Finally, add your loader to the llama_hub/library.json file (or for the equivalent library.json under tools/ or llama-packs/) so that it may be used by others. As is exemplified by the current file, add the class name of your loader or tool, along with its ID, author, etc. This file is referenced by the Llama Hub website and the download function within LlamaIndex.

Step 4: Make a Pull Request!

Create a PR against the main branch. We typically review the PR within a day. To help expedite the process, it may be helpful to provide screenshots (either in the PR or in the README directly) Show your data loader or tool in action!

How to add a llama-dataset

Similar to the process of adding a tool / loader / llama-pack, adding a llama- datset also requires forking this repo and making a Pull Request. However, for a llama-dataset, only its metadata is checked into this repo. The actual dataset and it's source files are instead checked into another Github repo, that is the llama-datasets repository. You will need to fork and clone that repo in addition to forking and cloning this one.

Please ensure that when you clone the llama-datasets repository, that you set the environment variable GIT_LFS_SKIP_SMUDGE prior to calling the git clone command:

# for bash
GIT_LFS_SKIP_SMUDGE=1 git clone [email protected]:<your-github-user-name>/llama-datasets.git  # for ssh
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git  # for https

# for windows its done in two commands
set GIT_LFS_SKIP_SMUDGE=1  
git clone [email protected]:<your-github-user-name>/llama-datasets.git  # for ssh

set GIT_LFS_SKIP_SMUDGE=1  
git clone https://github.com/<your-github-user-name>/llama-datasets.git  # for https

The high-level steps for adding a llama-dataset are as follows:

  1. Create a LabelledRagDataset (the initial class of llama-dataset made available on llama-hub)
  2. Generate a baseline result with a RAG system of your own choosing on the LabelledRagDataset
  3. Prepare the dataset's metadata (card.json and README.md)
  4. Submit a Pull Request to this repo to check in the metadata
  5. Submit a Pull Request to the llama-datasets repository to check in the LabelledRagDataset and the source files

To assist with the submission process, we have prepared a submission template notebook that walks you through the above-listed steps. We highly recommend that you use this template notebook.

(NOTE: you can use the above process for submitting any of our other supported types of llama-datasets such as the LabelledEvaluatorDataset.)

Running tests

python3.9 -m venv .venv
source .venv/bin/activate 
pip3 install -r test_requirements.txt

poetry run make test

Changelog

If you want to track the latest version updates / see which loaders are added to each release, take a look at our full changelog here!

FAQ

How do I test my loader before it's merged?

There is an argument called loader_hub_url in download_loader that defaults to the main branch of this repo. You can set it to your branch or fork to test your new loader.

Should I create a PR against LlamaHub or the LlamaIndex repo directly?

If you have a data loader PR, by default let's try to create it against LlamaHub! We will make exceptions in certain cases (for instance, if we think the data loader should be core to the LlamaIndex repo).

For all other PR's relevant to LlamaIndex, let's create it directly against the LlamaIndex repo.

How can I get a verified badge on LlamaHub?

We have just started offering badges to our contributors. At the moment, we're focused on our early adopters and official partners, but we're gradually opening up badge consideration to all submissions. If you're interested in being considered, please review the criteria below and if everything aligns, feel free to contact us via community Discord.

We are still refining our criteria but here are some aspects we consider:

Quality

  • Code Quality illustrated by the use of coding standards and style guidelines.
  • Code readability and proper documentation.

Usability

  • Self-contained module with no external links or libraries, and it is easy to run.
  • Module should not break any existing unit tests.

Safety

  • Safety considerations, such as proper input validation, avoiding SQL injection, and secure handling of user data.

Community Engagement & Feedback

  • The module's usefulness to the library's users as gauged by the number of likes, downloads, etc.
  • Positive feedback from module users.

Note:

  • It's possible that we decide to award a badge to a subset of your submissions based on the above criteria.
  • Being a regular contributor doesn't guarantee a badge, we will still look at each submission individually.

Other questions?

Feel free to hop into the community Discord or tag the official Twitter account!

llama-hub's People

Contributors

ahmetkca avatar ajhofmann avatar alexbowe avatar anoopshrma avatar athe-kunal avatar bborn avatar bearguy avatar chnsagitchen avatar devinburnette avatar disiok avatar emanuelcampos avatar emptycrown avatar godwin3737 avatar iamarunbrahma avatar jerryjliu avatar logan-markewich avatar msftwarelab avatar nerdai avatar pandazki avatar ravi03071991 avatar reletreby avatar rotemweiss57 avatar rwood-97 avatar sangwongenip avatar smyja avatar superpan avatar thucpn avatar tjaffri avatar utility-aagrawal avatar wenqiglantz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama-hub's Issues

Notion Loader

ModuleNotFoundError: No module named 'langchain.chains.prompt_selector' when running the code
from llama_index import download_loader

Making llama-hub a Git submodule

Can this repo be a git submodule for llama-index repo? This way the dependency is more explicit and one can easily update the submodule to point to most recent commit. I think the way things are now, if a client downloads Gmail reader for example, then it gets updated here, the client will not receive those updates until they explicitly pass refresh_cache parameter.

Github Repo Errors, KeyError

For some repositories (I can't find a consistent one between public/private org/user) I get key errors

For example, when pulling from kubernetes/website, I get

<llama_index.readers.llamahub_modules.github_repo.base.GithubRepositoryReader object at 0x7f79f271be50>
current path: 
Traceback (most recent call last):
  File "/home/sam/git/mcsh/test2/repo.py", line 21, in <module>
    docs = loader.load_data(branch="main")
  File "/home/sam/.local/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/github_repo/base.py", line 312, in load_data
    return self._load_data_from_branch(branch)
  File "/home/sam/.local/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/github_repo/base.py", line 277, in _load_data_from_branch
    blobs_and_paths = self._loop.run_until_complete(
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/sam/.local/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/github_repo/base.py", line 348, in _recurse_tree
    tree_data: GitTreeResponseModel = await self._github_client.get_tree(
  File "/home/sam/.local/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/github_repo/github_client.py", line 365, in get_tree
    return GitTreeResponseModel.from_json(
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/api.py", line 65, in from_json
    return cls.from_dict(kvs, infer_missing=infer_missing)
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/api.py", line 72, in from_dict
    return _decode_dataclass(cls, kvs, infer_missing)
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/core.py", line 201, in _decode_dataclass
    init_kwargs[field.name] = _decode_generic(field_type,
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/core.py", line 263, in _decode_generic
    res = _get_type_cons(type_)(xs)
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/core.py", line 317, in <genexpr>
    items = (_decode_dataclass(type_arg, x, infer_missing)
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/core.py", line 159, in _decode_dataclass
    field_value = kvs[field.name]
KeyError: 'url'

For a private org, I get

<llama_index.readers.llamahub_modules.github_repo.base.GithubRepositoryReader object at 0x7f1b7701fe50>
Traceback (most recent call last):
  File "/home/sam/git/mcsh/test2/repo.py", line 21, in <module>
    docs = loader.load_data(branch="main")
  File "/home/sam/.local/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/github_repo/base.py", line 312, in load_data
    return self._load_data_from_branch(branch)
  File "/home/sam/.local/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/github_repo/base.py", line 272, in _load_data_from_branch
    branch_data: GitBranchResponseModel = self._loop.run_until_complete(
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/sam/.local/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/github_repo/github_client.py", line 340, in get_branch
    return GitBranchResponseModel.from_json(
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/api.py", line 65, in from_json
    return cls.from_dict(kvs, infer_missing=infer_missing)
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/api.py", line 72, in from_dict
    return _decode_dataclass(cls, kvs, infer_missing)
  File "/home/sam/.local/lib/python3.10/site-packages/dataclasses_json/core.py", line 159, in _decode_dataclass
    field_value = kvs[field.name]
KeyError: 'commit'

And when I reference that private org's commit hash instead of the branch, I get

Traceback (most recent call last):
  File "/home/sam/git/mcsh/test2/repo.py", line 21, in <module>
    docs = loader.load_data(commit="cec867380358c611a78eaf58a0c282062b7d877f")
TypeError: GithubRepositoryReader.load_data() got an unexpected keyword argument 'commit'

Works on other repos though, including mine

Ideas for better twitter data

I've found that just getting individual tweets can lack context. I was thinking about making something that gets tweets in the following format:

TWEET: do you like coffee?
USERNAME REPLY: yes
###

Where USERNAME is the twitter user data has been scraped on.

This breaks down a little for longer reply chains:

TWEET: do you like coffee?
USERNAME REPLY: yes
REPLY: why?
USERNAME REPLY: good drink
USERNAME REPLY: and doubles as a weapon when hot
###

and very long threads:

USERNAME: a thread on how many tweets to include in a thread 1/1000

Threads don't need to have this sort of context (which can be very long) so they can still be individual documents but replies do.

Simple solution:
For any given tweet, check if there's a reply and if yes write to a .txt in the first format shown.

Problems:

  • can only get one reply
  • can't differentiate from a thread

Any ideas?

Elasticsearch: Current extra_info content breaks document indexing

Currently, the extra_info field contains the entire ElasticSearch document JSON object.

Document(text=value, extra_info=hit["_source"], embedding=embedding)

This ends up with something like this:
text: my content...
extra_info: {'id': '168613', 'groupId': '10719', 'publishDate': '19700101000000', 'language': 'English', 'content': 'my content...'}

However, this logic is causing an error, see run-llama/llama_index#748

TBH I have no idea how and where this should be fixed.

Bug in Github Loader - RuntimeError: This event loop is already running.

@ahmetkca
I'm attempting to get GithubRepositoryReader working using the docs, but seem to be hitting this asyncio issue.
This is running locally on Jupyter notebook.

llama_index.__version__ 
#'0.4.25'

Any ideas?

See below code to replicate.

from llama_index import download_loader
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient

github_client = GithubClient(os.getenv('GITHUB_TOKEN'))

loader = GithubRepositoryReader(
    github_client,
    owner =                  "jerryjliu",
    repo =                   "llama_index",
    filter_directories =     (["gpt_index", "docs"], GithubRepositoryReader.FilterType.INCLUDE),
    filter_file_extensions = ([".py"], GithubRepositoryReader.FilterType.INCLUDE),
    verbose =                True,
    concurrent_requests =    10,
)

docs_branch = loader.load_data(branch="main")

RuntimeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 docs_branch = loader.load_data(branch="main")

llama_index/readers/llamahub_modules/github_repo/base.py:312, in GithubRepositoryReader.load_data(self, commit_sha, branch)
309 return self._load_data_from_commit(commit_sha)
311 if branch is not None:
--> 312 return self._load_data_from_branch(branch)
314 raise ValueError("You must specify one of commit or branch.")

llama_index/readers/llamahub_modules/github_repo/base.py:272, in GithubRepositoryReader._load_data_from_branch(self, branch)
262 def _load_data_from_branch(self, branch: str) -> List[Document]:
263 """
264 Load data from a branch.
265
(...)
270 :return: list of documents
271 """
--> 272 branch_data: GitBranchResponseModel = self._loop.run_until_complete(
273 self._github_client.get_branch(self._owner, self._repo, branch)
274 )
276 tree_sha = branch_data.commit.commit.tree.sha
277 blobs_and_paths = self._loop.run_until_complete(
278 self._recurse_tree(tree_sha)
279 )

asyncio/base_events.py:625, in BaseEventLoop.run_until_complete(self, future)
614 """Run until the Future is done.
615 
616 If the argument is a coroutine, it is wrapped in a Task.

(...)
622 Return the Future's result, or raise its exception.
623 """
624 self._check_closed()
--> 625 self._check_running()
627 new_task = not futures.isfuture(future)
628 future = tasks.ensure_future(future, loop=self)

asyncio/base_events.py:584, in BaseEventLoop._check_running(self)
582 def _check_running(self):
583     if self.is_running():

--> 584 raise RuntimeError('This event loop is already running')
585 if events._get_running_loop() is not None:
586 raise RuntimeError(
587 'Cannot run the event loop while another loop is running')

RuntimeError: This event loop is already running

Hardcoded filenames and oauth port

At least in the two loaders I tried (gmail and gdocs), there are hardcoded filename strings and even worse hardcoded port numbers (and 8080 is a quite common port). Suggest to make those values, especially the port, configurable.

SimpleDirectoryReader issue

In addition to #81, Importing the SimpleDirectoryReader from llama_index and then trying to parse certain files throughout a directory as shown in this page gives the following error: 'str' object has no attribute 'parser_config_set'. Any ideas? TIA!

S3 loader not working in AWS EC2 environment

Hello everyone, I'm facing this error when I try to run llama index download_loader function but I 'm getting this error running on AWS Linux with Python 3.8. With some research I found that this problem is often related to python trying to open a folder as a file.

Note: this error is happening just calling S3Reader function from download_loader.
---> S3Reader = download_loader("S3Reader") <--- just that line to reproduce this error.

Someone could help me with this? thks

Captura de tela 2023-03-13 191421

PDF Loader Issue

AttributeError: module 'PyPDF2' has no attribute 'PdfReader'

Traceback

File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "/Users/serge/Downloads/ClassGPT-main/app/01_โ“_Ask.py", line 52, in <module>
    res = query_gpt(chosen_class, chosen_pdf, query)
File "/Users/serge/Downloads/ClassGPT-main/app/utils.py", line 51, in query_gpt
    documents = loader.load_data(pdf_tmp_path)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/llama_index/readers/llamahub_modules/file/pdf/base.py", line 22, in load_data
    pdf = PyPDF2.PdfReader(fp)

citations or references

I have been trying to wrap my head around getting references for responses to readers that load data from multiple links(if we scraped from 30 links.)? I mean getting the exact link where a response was gotten from and not just displaying all the links we scraped. any idea how?

RemoteDepthReader fails

    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 695, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 792, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for "['https://ocw.mit.edu/courses/5-05-principles-of-inorganic-chemistry-iii-spring-2005/pages/syllabus/']"

This is using the example on the loader page

Loader idea: Integration with Segment

Request

Since Segment already aggregates data from a bunch of sources, being able to load data from there directly into Documents would be great.

How to increase the length of the output ?

I am using this code to document some code files:

import os
from llama_index import SimpleDirectoryReader
from llama_index import GPTSimpleVectorIndex

os.environ['OPENAI_API_KEY'] = "sk-..."  
documents = SimpleDirectoryReader('', ['c:/harbour/src/vm/arrays.c']).load_data()
index = GPTSimpleVectorIndex( documents )
response = index.query("list, explain and document all the functions and parameters for each function provided in such file")
print( response )

the printed response stops before finishing. How to increase the output length ?

Elasticsearch: Set doc_id based on a specific ES document field

It would be handy to have the ability to map doc_id to a specific ES document field and pass it to the Document constructor. Currently, it is not passed so it is generated automatically.

Document(text=value, extra_info=hit["_source"], embedding=embedding)

When the same document is fetched next time, this auto-generated ID differs so it is impossible to decide if the content is the same or if the Llama index needs to be updated.

RDFReader still failing in llama_index

Although this works with gpt-index==0.4.15, it is still failing with llama-index==0.4.15.

from pathlib import Path
from llama_index import download_loader

RDFReader = download_loader("RDFReader")

loader = RDFReader()
documents = loader.load_data(file=Path('./example.nt'))

Stack trace:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 7
      4 RDFReader = download_loader("RDFReader")
      6 loader = RDFReader()
----> 7 documents = loader.load_data(file=Path('./example.nt'))

File ~/projects/python/miniconda/lib/python3.10/site-packages/llama_index/readers/llamahub_modules/file/rdf/base.py:51, in RDFReader.load_data(self, file, extra_info)
     47 """Parse file."""
     49 lang = extra_info["lang"] if extra_info is not None else "en"
---> 51 self.g_local = Graph()
     52 self.g_local.parse(file)
     54 self.g_global = Graph()

NameError: name 'Graph' is not defined

Loader for Semantic Mapping of Hierarchical Medical Terminologies using Large Language Models

I want to use a large language model to perform a semantic mapping between two medical terminologies which has hierarchical structures. Semantic mapping refers to the process of finding relationships between concepts in different terminologies based on their meaning.

Let us consider the task of mapping between SNOMED-CT and ICD10 as an example.

SNOMED CT stands for "Systematized Nomenclature of Medicine โ€“ Clinical Terms." A healthcare terminology that uses a hierarchical data structure based on concepts, descriptions, and relationships. Concepts are the building blocks of SNOMED CT and represent clinical ideas or entities. Each concept has a unique numeric identifier and belongs to a specific hierarchy, such as body structure, clinical finding, procedure, or substance. Each concept can have multiple descriptions. Relationships define connections between concepts, creating a rich semantic network.

SNOMED CT files are typically distributed in RF2 (Release Format 2) format. RF2 files are structured as tab-delimited text files, with each file representing a specific type of data, such as concepts, descriptions, or relationships.

Below is a sample set of files to give an idea about the format

Concept File:

id	effectiveTime	active	moduleId	definitionStatusId

123456789	20210131	1	900000000000207008	900000000000074008
234567890	20210131	1	900000000000207008	900000000000074008

Description file:

id	effectiveTime	active	moduleId	conceptId	languageCode	typeId	term	caseSignificanceId

111111111	20210131	1	900000000000207008	123456789	en	900000000000003001	Sample term 1	900000000000020002
222222222	20210131	1	900000000000207008	234567890	en	900000000000003001	Sample term 2	900000000000020002

Relationship file:

id	effectiveTime	active	moduleId	sourceId	destinationId	relationshipGroup	typeId	characteristicTypeId	modifierId

333333333	20210131	1	900000000000207008	123456789	234567890	0	116680003	900000

ICD10 stands for International Classification of Diseases, 10th Revision. It is a medical classification system developed by the World Health Organization. ICD-10 is divided into 21 chapters, each corresponding to a specific group of conditions or diseases based on their etiology, body system, or purpose. Chapters contain blocks. Blocks are usually based on etiology, anatomical site, or clinical presentation. Each block contains several three-character categories. Categories can be further divided into subcategories.

ICD10 files can be downloaded in XML format. Below is a sample file to give you an idea of the format.

<?xml version="1.0" encoding="UTF-8"?>
<ICD10>
  <chapter code="I">
    <title>Diseases of the circulatory system</title>
    <block code="I10-I15">
      <title>Hypertensive diseases</title>
      <item code="I10">
        <description>Essential (primary) hypertension</description>
      </item>
      <item code="I11">
        <description>Hypertensive heart disease</description>
      </item>
    </block>
    <block code="I20-I25">
      <title>Ischemic heart diseases</title>
      <item code="I20">
        <description>Angina pectoris</description>
      </item>
      <item code="I21">
        <description>Acute myocardial infarction</description>
      </item>
    </block>
  </chapter>
</ICD10>

Mapping between SNOMED CT and ICD-10 can be complex due to differences in granularity, structure, and purpose. My goal is to use a large language model for this mapping. I would greatly appreciate your help with the following.

  1. What would be the preferred document structure to capture the hierarchy? For making the mapping, the model might need to traverse different nodes of the hierarchy. ( Hierarchy as relation data in the case of RF2)
  2. Is there a conversational format you could suggest to convert terminologies into text documents?
  3. What would be the best llama index for the task?

no attribute 'run' error in SimpleWebPageReader

I'm trying to follow the steps in https://llamahub.ai/l/web-simple_web. But I encountered an error of AttributeError: 'RequestsWrapper' object has no attribute 'run'.
https://github.com/emptycrown/llama-hub/blob/c7e4dc3b1e0df2f91b9d0c875c8deacd196d5a6d/loader_hub/web/simple_web/base.py#L39

I investigated the source code in langchain repo and found out that run method was removed from the latest version (v0.0.101).
langchain-ai/langchain@82baecc#diff-b140ae2f5bd8d71c563513083522371721735183a2afb04ebec66f6f593f49e7

We should use get instead of run.
https://github.com/hwchase17/langchain/blob/82baecc89297582e81b3a4c1ff49feee03ba711f/langchain/requests.py#L21

Example For KnowlegeBaseWebReader Needs To Be Updated

When Using Example:
https://llamahub.ai/l/web-knowledge_base

I get error:
Traceback (most recent call last):
File "/home/scott/dev/openai/wso2-docs/wso2-apim-docs.py", line 11, in
loader = KnowledgeBaseWebReader()
TypeError: KnowledgeBaseWebReader.init() missing 3 required positional arguments: 'root_url', 'link_selectors', and 'article_path'

I see that the parameters from load_data has been moved to KnowlegeBaseWebReader Init:
loader_hub/web/knowledge_base/base.py

I changed it to be like this and I see it's running to a point but I'm not getting output yet.
I also see that the original url wasn't working since they changed their site so that might be why.

import sys
sys.path.append('../') # add the parent directory to the module search path
from openaikey import *

from llama_index import GPTSimpleVectorIndex, download_loader
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
from langchain.chains.conversation.memory import ConversationBufferMemory

KnowledgeBaseWebReader = download_loader("KnowledgeBaseWebReader")
loader = KnowledgeBaseWebReader(
root_url='https://www.intercom.com/help/en/?on_pageview_event=nav&on_pageview_event=nav',
link_selectors = ['.article-list a', '.article-list a'],
article_path='/articles',
title_selector='.article-title',
subtitle_selector='.article-subtitle',
body_selector='.article-body'
)
documents = loader.load_data()
index = GPTSimpleVectorIndex(documents)

tools = [
Tool(
name="Website Index",
func=lambda q: index.query(q),
description=f"Useful when you want answer questions about a product that has a public knowledge base.",
),
]
llm = OpenAI(temperature=0)
memory = ConversationBufferMemory(memory_key="chat_history")
agent_chain = initialize_agent(
tools, llm, agent="zero-shot-react-description", memory=memory
)

output = agent_chain.run(input="What languages does Intercom support?")

Does not load PDFs.

I saw from the readme that this loader will call the MP3 loader, and immediately tested whether it could do the same with a PDF.

Didn't work, will dig in whether that functionality is here or on the downstream loaders.

Missing URL reference for S3 loader

Last commit has an issue in the S3 connector, Iยดm not being able to connect to the S3 bucket. @claudenm Please check line 71 for the loader, I think there should be accesing the attribute s3_endpoint_url and not s3_url...

File "/home/vflow/.local/lib/python3.9/site-packages/llama_index/readers/llamahub_modules/s3/base.py", line 71, in load_data
s3_client = session.client("s3", endpoint_url=self.s3_url)
AttributeError: 'S3Reader' object has no attribute 's3_url'
'S3Reader' object has no attribute 's3_url' [file '/home/vflow/.local/lib/python3.9/site-packages/llama_index/readers/llamahub_modules/s3/base.py', line 71]

Google Docs connector error - OSError: [Errno 98] Address already in use

I'm trying to connect google collab using llama-index & openai library and google doc.

Code I'm running:

`
from llama_index import GPTSimpleVectorIndex, download_loader
OPENAI_API_KEY = "abc"

GoogleDocsReader = download_loader('GoogleDocsReader')

gdoc_ids = ['abc']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
index = GPTSimpleVectorIndex(documents)`

I uploaded the credentials.json file as recommended in the doc in:
/content/credentials.json

When I'm running the script I get the following error:

`---------------------------------------------------------------------------
OSError Traceback (most recent call last)
in
17 gdoc_ids = ['abc']
18 loader = GoogleDocsReader()
---> 19 documents = loader.load_data(document_ids=gdoc_ids)
20 index = GPTSimpleVectorIndex(documents)

8 frames
/usr/lib/python3.9/socketserver.py in server_bind(self)
464 if self.allow_reuse_address:
465 self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
--> 466 self.socket.bind(self.server_address)
467 self.server_address = self.socket.getsockname()
468

OSError: [Errno 98] Address already in use`

What should I do ?

SlackReader is not skipping main messages before earliest_date but is skipping messages inside threads

Description
When using the SlackReader with a specified earliest_date, main messages that were posted before the earliest_date are still being returned, but messages inside threads are being skipped correctly. Here's an example of the code I used:

earliest_date = datetime(2023, 3, 26)
loader = SlackReader(
    slack_token,
    earliest_date=earliest_date,
)

I expected that only messages posted after the earliest_date would be returned, but this is not the case. However, when I added a message to a thread of an older main message (posted on March 22), both the newer thread message and the older main message were returned, as expected.

BilibiliTranscriptReader not working properly

I've tried the given example url and some other urls, they all raise the same warning like this:

UserWarning: No subtitles found for video: https://www.bilibili.com/video/BV1Km4y1z723/. Return Empty transcript.
  warnings.warn(
[Document(text='', doc_id='9ea3b653-2696-468c-8586-045882f39530', embedding=None, doc_hash='dc937b59892604f5a86ac96936cd7ff09e25f18ae6b758e8014a24c7fa039e91', extra_info=None)]

Seems that the bilibili_api is not working properly? @AlexZhangji

WikipediaReader = download_loader("WikipediaReader") error

PptxReader = download_loader("PptxReader")
C:\ProgramData\Anaconda3\lib\site-packages\llama_index\readers\llamahub_modules/file/pptx/requirements.txt
Traceback (most recent call last):

File "", line 1, in
PptxReader = download_loader("PptxReader")

File "C:\ProgramData\Anaconda3\lib\site-packages\llama_index\readers\download.py", line 192, in download_loader
pkg_resources.require([str(r) for r in requirements])

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 956, in require
needed = self.resolve(parse_requirements(requirements))

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 815, in resolve
dist = self._resolve_dist(

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 844, in _resolve_dist
env = Environment(self.entries)

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 1044, in init
self.scan(search_path)

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 1077, in scan
self.add(dist)

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 1096, in add
dists.sort(key=operator.attrgetter('hashcmp'), reverse=True)

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 2631, in hashcmp
self.parsed_version,

File "C:\ProgramData\Anaconda3\lib\site-packages\pkg_resources_init_.py", line 2685, in parsed_version
raise packaging.version.InvalidVersion(f"{str(ex)} {info}") from None

InvalidVersion: Invalid version: 'c3986' (package: -)

Any ideas. It does appear the pycache directory and a base.cpython-38 file are not created like other readers.

Loader Issue

I'm trying to follow the steps in https://llamahub.ai/l/file-unstructured and while running SimpleDirectoryReader = download_loader("SimpleDirectoryReader"), I get the following error No such file or directory: '/Users/some_username/opt/anaconda3/lib/python3.9/site-packages/llama_index/readers/llamahub_modules/file/base.py' Any ideas what I'm doing wrong and suggestions to fix it? Thank you in advance!

Substack scraper not working

Following the example given here https://llamahub.ai/l/web-beautiful_soup_web, when I try to use the _substack_reader method to parse a substack:

from llama_index import download_loader

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader()
documents = loader.load_data(urls=['https://ijeomaoluo.substack.com/p/healing-isnt-easy'], custom_hostname="substack.com")

I get the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".modules/web-beautiful_soup_web.py", line 136, in load_data
    data, metadata = self.website_extractor[hostname](soup, url)
TypeError: _substack_reader() takes 1 positional argument but 2 were given

This looks like a regression introduced in this commit:
23ae492
Would just have to fix method signature to have 2 arguments instead of 1 to fix it.

Discord Loader: Loop already running

When running the discord loader example (I am using my own discord_token and my own channel _id):

from llama_index import download_loader
import os

DiscordReader = download_loader('DiscordReader')

os.environ['DISCORD_TOKEN'] = 'O...s'

channel_ids = [6...0]  # Replace with your channel_id
reader = DiscordReader()
documents = reader.load_data(channel_ids=channel_ids)

I get a "Loop already running" runtime error:

File c:\Users\TheDr\Anaconda3\envs\python310\lib\site-packages\llama_index\readers\llamahub_modules/discord/base.py:128, in DiscordReader.load_data(self, channel_ids, limit, oldest_first)
    123     if not isinstance(channel_id, int):
    124         raise ValueError(
    125             f"Channel id {channel_id} must be an integer, "
    126             f"not {type(channel_id)}."
    127         )
--> 128     channel_content = self._read_channel(
    129         channel_id, limit=limit, oldest_first=oldest_first
    130     )
    131     results.append(
    132         Document(channel_content, extra_info={"channel": channel_id})
    133     )
    134 return results

File c:\Users\TheDr\Anaconda3\envs\python310\lib\site-packages\llama_index\readers\llamahub_modules/discord/base.py:96, in DiscordReader._read_channel(self, channel_id, limit, oldest_first)
     92 def _read_channel(
     93     self, channel_id: int, limit: Optional[int] = None, oldest_first: bool = True
     94 ) -> str:
     95     """Read channel."""
---> 96     result = asyncio.get_event_loop().run_until_complete(
     97         read_channel(
     98             self.discord_token, channel_id, limit=limit, oldest_first=oldest_first
     99         )
    100     )
    101     return result
File c:\Users\TheDr\Anaconda3\envs\python310\lib\asyncio\base_events.py:617, in BaseEventLoop.run_until_complete(self, future)
    606 """Run until the Future is done.
    607 
    608 If the argument is a coroutine, it is wrapped in a Task.
   (...)
    614 Return the Future's result, or raise its exception.
    615 """
    616 self._check_closed()
--> 617 self._check_running()
    619 new_task = not futures.isfuture(future)
    620 future = tasks.ensure_future(future, loop=self)

File c:\Users\TheDr\Anaconda3\envs\python310\lib\asyncio\base_events.py:577, in BaseEventLoop._check_running(self)
    575 def _check_running(self):
    576     if self.is_running():
--> 577         raise RuntimeError('This event loop is already running')
    578     if events._get_running_loop() is not None:
    579         raise RuntimeError(
    580             'Cannot run the event loop while another loop is running')

RuntimeError: This event loop is already running

No module named "modules" when trying to use github_repo

@ahmetkca I'm following the doc to use github_repo

download_loader("GithubRepositoryReader")
from modules.github_repo import GithubClient, GithubRepositoryReader

But I got an error:

ModuleNotFoundError: No module named 'modules'

What did I miss? could be a silly question since I'm new to Python. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.