mobarski / ask-my-pdf Goto Github PK

Question answering system for PDF files

License: MIT License

Python 99.35% Batchfile 0.34% Shell 0.30%

gpt3 pdf qna streamlit openai openai-api

ask-my-pdf's Introduction

Ask my PDF

Thank you for your interest in my application. Please be aware that this is only a Proof of Concept system and may contain bugs or unfinished features. If you like this app you can ❤️ follow me on Twitter for news and updates.

Ask my PDF - Question answering system built on top of GPT3

🎲 The primary use case for this app is to assist users in answering questions about board game rules based on the instruction manual. While the app can be used for other tasks, helping users with board game rules is particularly meaningful to me since I'm an avid fan of board games myself. Additionally, this use case is relatively harmless, even in cases where the model may experience hallucinations.

🌐 The app can be accessed on the Streamlit Community Cloud at https://ask-my-pdf.streamlit.app/. 🔑 However, to use the app, you will need your own OpenAI's API key.

📄 The app implements the following academic papers:

In-Context Retrieval-Augmented Language Models aka RALM
Precise Zero-Shot Dense Retrieval without Relevance Labels aka HyDE (Hypothetical Document Embeddings)

Installation

Clone the repo:

git clone https://github.com/mobarski/ask-my-pdf
Install dependencies:

pip install -r ask-my-pdf/requirements.txt
Run the app:

cd ask-my-pdf/src

run.sh or run.bat

High-level documentation

RALM + HyDE

RALM + HyDE + context

Environment variables used for configuration

General configuration:

STORAGE_SALT - cryptograpic salt used when deriving user/folder name and encryption key from API key, hexadecimal notation, 2-16 characters
STORAGE_MODE - index storage mode: S3, LOCAL, DICT (default)
STATS_MODE - usage stats storage mode: REDIS, DICT (default)
FEEDBACK_MODE - user feedback storage mode: REDIS, NONE (default)
CACHE_MODE - embeddings cache mode: S3, DISK, NONE (default)

Local filesystem configuration (storage / cache):

STORAGE_PATH - directory path for index storage
CACHE_PATH - directory path for embeddings cache

S3 configuration (storage / cache):

S3_REGION - region code
S3_BUCKET - bucket name (storage)
S3_SECRET - secret key
S3_KEY - access key
S3_URL - URL
S3_PREFIX - object name prefix
S3_CACHE_BUCKET - bucket name (cache)
S3_CACHE_PREFIX - object name prefix (cache)

Redis configuration (for persistent usage statistics / user feedback):

REDIS_URL - Redis DB URL (redis[s]://:password@host:port/[db])

Community version related options:

OPENAI_KEY - API key used for the default user
COMMUNITY_DAILY_USD - default user's daily budget
COMMUNITY_USER - default user's code

ask-my-pdf's People

Contributors

Stargazers

Watchers

Forkers

justmg techthiyanes saravananwat abdoiiii sseanliu asehmi chantysothy moris-polanco klaudioz sudharsan2020 mrcabbage972 ahamza360 fsmosca articoder thanujrv tmanager22 geschenkwald bec0l patrickgdl mpolanco-ifyl ai-jie01 amirdib acidix mangospace hlb liangdu dhnanjay kirillkazakov8 ilirosmanaj analogpvt moreleo podsnigame startupcode001 mounta11n gurusura smesh777 iakithega dominickdelucia shizhang-liu shiqinwen zhangrongfei mateuscassaniga csh2022 dpicca waqassiddiqi aascode kriswilkinson carelfdewaal artart788 3p14 lukszam nikolabul ravish-dhawan brianjking zhiyong-lv aousabdo jiackylee jsgro marcosclub analog-innovation-private-limited amdfad qixiaobo tribe-health robbieinoz alexkehl dty0606 danxiangjie mparje skyprince999 abbasali-io akamil-etsy cgathuru-nt slobentanzer ai-infusion badrinaths yunghoy 3rd1t williamtran29 aicodehunt delscipio brianmarkowitz sebastianswe thunderwilson piyumaha12 jayeskumar manoj2530 lannfs khryptorgraphics hidddee richardyc xiaoyanhe0713 cmnisal datumradix jupediaz 2059594176 iq-scm chooyen23 sarkarda watheqalshowaiter ye-muath

ask-my-pdf's Issues

AttributeError: module 'tiktoken' has no attribute 'encoding_for_model'

Getting the following error when launching the app

File "C:\Users\xxx\Anaconda3\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script exec(code, module.__dict__) File "C:\Users\xxx\ask-my-pdf\src\gui.py", line 20, in <module> import model File "C:\Users\xxx\ask-my-pdf\src\model.py", line 12, in <module> import ai File "C:\Users\xxx\ask-my-pdf\src\ai.py", line 39, in <module> tokenizer_model = openai.model('text-davinci-003') File "C:\Users\xxx\Anaconda3\lib\site-packages\ai_bricks\api\openai.py", line 30, in model return _class(name, **kwargs) File "C:\Users\xxx\Anaconda3\lib\site-packages\ai_bricks\api\openai.py", line 57, in __init__ self.encoder = tiktoken.encoding_for_model(name)

Multiple files

Question:
Is it possible to load multiple files at a time?
This way I could ask a question and it could search all the resource documents to compile an answer.

ClientError: An error occurred (InvalidArgument) when calling the ListObjects operation: Unknown

I got the following error:

ClientError: An error occurred (InvalidArgument) when calling the ListObjects operation: Unknown
Traceback:
File "/usr/local/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "/Users/canalescl/personal/replit/ask-my-pdf/src/gui.py", line 248, in <module>
    ui_pdf_file()
File "/Users/canalescl/personal/replit/ask-my-pdf/src/gui.py", line 91, in ui_pdf_file
    filenames += ss['storage'].list()
File "/Users/canalescl/personal/replit/ask-my-pdf/src/storage.py", line 46, in list
    return [self.decode(name) for name in self._list()]
File "/Users/canalescl/personal/replit/ask-my-pdf/src/storage.py", line 184, in _list
    resp = self.s3.list_objects(
File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)

A new folder, cloned from zero, after "Enter your OpenAI API key" and click Enter

This model's maximum context length is 8191 tokens, however you requested 13831 tokens (13831 in your prompt; 0 for the completion).

InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 13831 tokens (13831 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
Traceback:
File "/usr/local/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 561, in _run_script
self._session_state.on_script_will_rerun(rerun_data.widget_states)
File "/usr/local/lib/python3.9/site-packages/streamlit/runtime/state/safe_session_state.py", line 68, in on_script_will_rerun
self._state.on_script_will_rerun(latest_widget_states)
File "/usr/local/lib/python3.9/site-packages/streamlit/runtime/state/session_state.py", line 474, in on_script_will_rerun
self._call_callbacks()
File "/usr/local/lib/python3.9/site-packages/streamlit/runtime/state/session_state.py", line 487, in _call_callbacks
self._new_widget_state.call_callback(wid)
File "/usr/local/lib/python3.9/site-packages/streamlit/runtime/state/session_state.py", line 242, in call_callback
callback(*args, **kwargs)

Question

is it necessary to use open ia API key or if you have any advices to create the api from scratch?

Max Fragments 1 still returns 3 fragments

I've set max fragments to 1 before and after to 0 and the model passes 3 fragments when obtaining the answer.

persintency among sessions

could it be possible to recover stored vector indexes across sessions (same API key ) at least within 90 days?

Questions and answers

Would it be possible to read the document and then get GPT to generate x number of questions and answers based on the text.
This could be used to train an AI bot to help answer questions.

My scenario would be around training an AI to be able to explain a company specific topic. Say you got it to read all internal documents on a certain subject - generated over many years. if it then generated 1000's of questions and answers - you could then train a bot to help users.

where i should and .env file ?

should i add on root?
also for what is this specify please?
thanks you so much
Redis configuration (for persistent usage statistics / user feedback):

ai_bricks.api

I have tried everything and I cannot install this module. Any pointers/tips?

from ai_bricks.api import openai

This line of code doesn't work at all....

[question] chatpdf.com receive much accurate answers than ask-my-pdf?

I compare these two resourses,with the same pdf. chatpdf is with much-much-much more precise answers. It is very strange?

Stream Responses like chatGPT

Is there any way to stream text by segmenting the fragments from model.query? The loading times to render the entire text block for larger PDFs is a bit too long.

Deprecation Error

DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

Have you tried pdfplumber instead?

Storage/cache mode does not work when local/disk

Hi,
no problem running your demo but something on my side went wrong when trying to setup in run.sh these parameters:
STORAGE_MODE=LOCAL and CACHE_MODE=DISK.
No data is saved under cache/storage folder on disk.
Same problems on REDIS but maybe is something linked to issues above.
Any idea?
Thank you