alejandro-ao / langchain-ask-pdf Goto Github PK

An AI-app that allows you to upload a PDF and ask questions about it. It uses OpenAI's LLMs to generate a response.

Python 100.00%

langchain-ask-pdf's Introduction

Langchain Ask PDF (Tutorial)

You may find the step-by-step video tutorial to build this application on Youtube.

This is a Python application that allows you to load a PDF and ask questions about it using natural language. The application uses a LLM to generate a response about your PDF. The LLM will not answer questions unrelated to the document.

How it works

The application reads the PDF and splits the text into smaller chunks that can be then fed into a LLM. It uses OpenAI embeddings to create vector representations of the chunks. The application then finds the chunks that are semantically similar to the question that the user asked and feeds those chunks to the LLM to generate a response.

The application uses Streamlit to create the GUI and Langchain to deal with the LLM.

Installation

To install the repository, please clone this repository and install the requirements:

pip install -r requirements.txt

You will also need to add your OpenAI API key to the .env file.

Usage

To use the application, run the main.py file with the streamlit CLI (after having installed streamlit):

streamlit run app.py

Contributing

This repository is for educational purposes only and is not intended to receive further contributions. It is supposed to be used as support material for the YouTube tutorial that shows how to build the project.

langchain-ask-pdf's People

Contributors

Stargazers

Watchers

Forkers

jan-karsten-kuhnke tonywhite11 zmjmickey gitezri darcyjunior kazerog hisworkmanship katlego-chagane jskdr priyakantcharokar nolfwin realsubinbabu manwaltep wmaousley rbates82 odebroqueville oscarh2485 mattduffield ho-cyber kromian irisfeng sergeindamix bloodjello fredys-beitech serviteur superwhyun si3mshady kevinzhang19870314 xelszy sagerock gladiopeace brianmarkowitz andrew3920029 saiftyfirst quito96 t521293 nulloxide mrdavidnash manoelbenicio hyojunguy oscargu moeez777 dazeb jsl777 rapidai kobrinsam axsddlr igagonzalez piitschy iportilla tommyzhn kimtth jorgeseifert vziy98 wally-kroeker ksmit323 jimmytsoi ruslanguns dracostrongborn krtk94 mustafasalihi91 aabhishekdata ricogithubb pterameta woodyphil nineking424 hongthana cgoder elpolini jrothman drgonzalomora biznach doremi31618 nerdynigel jempf mingchao2023 eniodhano artempanin yasharora2020 asm0jkeee yuping322 lcsouzamenezes taltaf913 cjrujo makong9 jmwdpk mhdella neskuchny charliechenyuzhang diatime bonjomondo agustin90-ush onchainvibe andy3278 darokcx voyager2009 mparje marcgsanchez rohithbv hashith00

langchain-ask-pdf's Issues

ModuleNotFoundError: No module named 'altair.vegalite.v4

While running a Streamlit app from the command line, I encountered an error. Here are the details:

Environment:

Python version: 3.11
Streamlit version: (please provide the version if available)
Operating System: Windows

Command used to run the application:
E:\AI STUFF\PDF GPT\langchain-ask-pdf>streamlit run app.py

**Error message:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Program Files\Python311\Scripts\streamlit.exe\__main__.py", line 4, in <module>
  File "C:\Program Files\Python311\Lib\site-packages\streamlit\__init__.py", line 55, in <module>
    from streamlit.delta_generator import DeltaGenerator as _DeltaGenerator
  File "C:\Program Files\Python311\Lib\site-packages\streamlit\delta_generator.py", line 43, in <module>
    from streamlit.elements.arrow_altair import ArrowAltairMixin
  File "C:\Program Files\Python311\Lib\site-packages\streamlit\elements\arrow_altair.py", line 36, in <module>
    from altair.vegalite.v4.api import Chart
ModuleNotFoundError: No module named 'altair.vegalite.v4'

It seems the module 'altair.vegalite.v4' is missing. However, I believe all dependencies were correctly installed.

Please, could you help to investigate and resolve this issue?

Labels: bug, help wanted, good first issue

Did not find openai_api_key,

Im running this on Windows 10
I added the API KEY in the .env file but i keep getting this error
ValidationError: 1 validation error for OpenAIEmbeddings __root__ Did not find openai_api_key, please add an environment variable OPENAI_API_KEYwhich contains it, or passopenai_api_key as a named parameter. (type=value_error) Traceback: File "C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script exec(code, module.__dict__) File "C:\Users\DELL\Desktop\Projects\langchain-ask-pdf-main\app.py", line 55, in <module> main() File "C:\Users\DELL\Desktop\Projects\langchain-ask-pdf-main\app.py", line 37, in main embeddings = OpenAIEmbeddings() ^^^^^^^^^^^^^^^^^^ File "pydantic\main.py", line 341, in pydantic.main.BaseModel.__init__

getting below error

ValidationError: 1 validation error for OpenAIEmbeddings root Did not find openai_api_key, please add an environment variable OPENAI_API_KEY which contains it, or pass openai_api_key as a named parameter. (type=value_error)

ModuleNotFoundError: No module named 'altair.vegalite.v4'

Even after pip install I am getting no module found

Get a longer answer

Hello,

Is it possible to get longer answers?
Because the answers are too vague.

Question: is Langchain to have a larger database so that it can provide more relevant answers? Let me explain: I'd like him to go beyond a simple summary of the content, but to provide his understanding so that his answer can guide us in understanding the document. Of course, LLMs can have hallucinations, so we need to be aware of this.

Thanks.

Feature : Autoload a PDF file By URL

Autoload the PDF in local file

With this feature, we delete the file input

Explaincation :

The BytesIO and PdfReader classes from the PyPDF2 library are imported to handle the PDF file.
The CharacterTextSplitter, OpenAIEmbeddings, FAISS, load_qa_chain, OpenAI, and get_openai_callback classes are imported from the langchain library. These classes are used to build the question-answering system.
The OpenAI API key is set as an environment variable using the os module.
The st.set_page_config and st.header functions from the streamlit library are used to set the title and header of the web app.
The PDF file is loaded using the open function and read in binary mode using the rb flag. The contents of the file are then stored in a BytesIO object.
The text content of the PDF file is extracted using the extract_text method of the PdfReader class. The text is concatenated into a single string.
The CharacterTextSplitter class is used to split the text into smaller chunks. These chunks are used to build a knowledge base for the question-answering system.
The OpenAIEmbeddings class is used to generate embeddings for the text chunks. These embeddings are used to perform similarity searches when answering questions.
The st.text_input function is used to prompt the user to ask a question about the PDF file.
If the user enters a question, the similarity search is performed using the knowledge_base.similarity_search method. The resulting documents are passed to the load_qa_chain function to create a question-answering chain.
The run method of the question-answering chain is called with the input documents and user question as arguments. The result is stored in the response variable.
The result is displayed using the st.write function.

The code : app.py

from io import BytesIO
import requests
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

def main():
    st.set_page_config(page_title="Ask your PDF")
    st.header("Ask your PDF 💬")
    
    # load the PDF file
    url = 'https://www.example.com/example.pdf'
    response = requests.get(url)
    pdf = BytesIO(response.content)
    
    # extract the text
    pdf_reader = PdfReader(pdf)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()

    # split into chunks
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
      
    # create embeddings
    embeddings = OpenAIEmbeddings()
    knowledge_base = FAISS.from_texts(chunks, embeddings)
      
    # show user input
    user_question = st.text_input("Ask a question about your PDF:")
    if user_question:
        docs = knowledge_base.similarity_search(user_question)
        
        llm = OpenAI()
        chain = load_qa_chain(llm, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=user_question)
            print(cb)
           
        st.write(response)
    

if __name__ == '__main__':
    main()

Run & Test

streamlit run .\app.py

Additions to requirements.txt for New Virtual Environments

I discovered a number of missing requirements needed to spin up your project. It felt too trivial to fork and create a pull request so I've just included a patch below.

requirements.txt.patch

Cool Project!

Cheers

What color is the sky? - How to add custom templates to ConversationalRetrievalChain?

This is actually important to add. If you ask about unrelated topic, like "What color is the sky", it will still answer related to the video...

I tried combine_docs_chain_kwargs={"prompt": prompt} with no success. Also, I can't make the code work with LLMChain, like:

` chat_prompt = ChatPromptTemplate.from_messages(
[system_message_prompt, human_message_prompt]
)

chain = LLMChain(llm=chat, prompt=chat_prompt)`

Deploy `langchain-ask-pdf` as APIs locally/on cloud using `langchain-serve`

Repo - langchain-serve.

Exposes APIs from function definitions locally as well as on the cloud.
Very few lines of code changes and ease of development remain the same as local.
Supports both REST & WebSocket endpoints
Serverless/autoscaling endpoints with automatic tls certs on the cloud.
Real-time streaming, human-in-the-loop support - which is crucial for chatbots.
We can extend the simple existing app pdf-qna on langchain-serve.

Disclaimer: I'm the primary author of langchain-serve.

hi

hi. please help me. how to create custom model from many pdfs in Persian language? tank you.

requirements.txt modification needed

Thank you Alejandro!
I got chatbot to work on my Windows 11 PC with the following requirement.txt

langchain==0.0.166
PyPDF2==3.0.1
python-dotenv==1.0.0
streamlit==1.18.1
faiss-cpu==1.7.4
altair<5
openapi==1.1.0
tiktoken

This server could not verify that you are authorized to access the document requested.

AttributeError: module 'openai' has no attribute 'error'

File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "/Users/computer/Desktop/multi_doc/langchain-ask-pdf/app.py", line 55, in
main()
File "/Users/computer/Desktop/multi_doc/langchain-ask-pdf/app.py", line 38, in main
knowledge_base = FAISS.from_texts(chunks, embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/faiss.py", line 384, in from_texts
embeddings = embedding.embed_documents(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 234, in embed_documents
return self._get_len_safe_embeddings(texts, engine=self.deployment)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 175, in _get_len_safe_embeddings
response = embed_with_retry(
^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 57, in embed_with_retry
retry_decorator = _create_retry_decorator(embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 45, in _create_retry_decorator
retry_if_exception_type(openai.error.Timeout)

ModuleNotFoundError: No module named 'altair.vegalite.v4

I'm getting: ModuleNotFoundError: No module named 'altair.vegalite.v4 after runnning streamlit run app.py.

Bug: PDF images vs text

ERROR

Reproduce

upload PDF that is based on images i.e employee handbook, etc..
error will show

Possible Solution

try/except if PDF cannot be read give user error message instead of traceback

create custom model in other language.

hi. please help me. how to create custom model from many pdfs in Persian language? tank you.

getting module not found

even i installed the module it's showing me not found
streamlit run app.py
Traceback (most recent call last):
File "/home/thakuradi/.local/bin/streamlit", line 5, in
from streamlit.web.cli import main
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/init.py", line 55, in
from streamlit.delta_generator import DeltaGenerator as _DeltaGenerator
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/delta_generator.py", line 36, in
from streamlit import config, cursor, env_util, logger, runtime, type_util, util
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/cursor.py", line 18, in
from streamlit.runtime.scriptrunner import get_script_run_ctx
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/init.py", line 16, in
from streamlit.runtime.runtime import Runtime as Runtime
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/runtime.py", line 29, in
from streamlit.runtime.app_session import AppSession
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/app_session.py", line 35, in
from streamlit.runtime import caching, legacy_caching
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/caching/init.py", line 21, in
from streamlit.runtime.caching.cache_data_api import (
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_data_api.py", line 37, in
from streamlit.runtime.caching import cache_utils
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 31, in
from streamlit import type_util
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/type_util.py", line 42, in
from pandas import DataFrame, Index, MultiIndex, Series
File "/home/thakuradi/.local/lib/python3.11/site-packages/pandas/init.py", line 22, in
from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/thakuradi/.local/lib/python3.11/site-packages/pandas/compat/init.py", line 24, in
import pandas.compat.compressors
File "/home/thakuradi/.local/lib/python3.11/site-packages/pandas/compat/compressors.py", line 7, in
import bz2
File "/usr/local/lib/python3.11/bz2.py", line 17, in
from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named '_bz2'

Support for non-english languages?

such as Thai, Chinese, Japanese

alejandro-ao / langchain-ask-pdf Goto Github PK

langchain-ask-pdf's Introduction

Langchain Ask PDF (Tutorial)

How it works

Installation

Usage

Contributing

langchain-ask-pdf's People

Contributors

Stargazers

Watchers

Forkers

langchain-ask-pdf's Issues

Autoload the PDF in local file

Explaincation :

The code : app.py

Run & Test

ERROR

Reproduce

Possible Solution

Recommend Projects

Recommend Topics

Recommend Org