Git Product home page Git Product logo

langchain-ask-pdf's Introduction

Langchain Ask PDF (Tutorial)

You may find the step-by-step video tutorial to build this application on Youtube.

This is a Python application that allows you to load a PDF and ask questions about it using natural language. The application uses a LLM to generate a response about your PDF. The LLM will not answer questions unrelated to the document.

How it works

The application reads the PDF and splits the text into smaller chunks that can be then fed into a LLM. It uses OpenAI embeddings to create vector representations of the chunks. The application then finds the chunks that are semantically similar to the question that the user asked and feeds those chunks to the LLM to generate a response.

The application uses Streamlit to create the GUI and Langchain to deal with the LLM.

Installation

To install the repository, please clone this repository and install the requirements:

pip install -r requirements.txt

You will also need to add your OpenAI API key to the .env file.

Usage

To use the application, run the main.py file with the streamlit CLI (after having installed streamlit):

streamlit run app.py

Contributing

This repository is for educational purposes only and is not intended to receive further contributions. It is supposed to be used as support material for the YouTube tutorial that shows how to build the project.

langchain-ask-pdf's People

Contributors

alejandro-ao avatar andy3278 avatar axsddlr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

langchain-ask-pdf's Issues

ModuleNotFoundError: No module named 'altair.vegalite.v4

While running a Streamlit app from the command line, I encountered an error. Here are the details:

Environment:

Python version: 3.11
Streamlit version: (please provide the version if available)
Operating System: Windows

Command used to run the application:
E:\AI STUFF\PDF GPT\langchain-ask-pdf>streamlit run app.py

**Error message:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Program Files\Python311\Scripts\streamlit.exe\__main__.py", line 4, in <module>
  File "C:\Program Files\Python311\Lib\site-packages\streamlit\__init__.py", line 55, in <module>
    from streamlit.delta_generator import DeltaGenerator as _DeltaGenerator
  File "C:\Program Files\Python311\Lib\site-packages\streamlit\delta_generator.py", line 43, in <module>
    from streamlit.elements.arrow_altair import ArrowAltairMixin
  File "C:\Program Files\Python311\Lib\site-packages\streamlit\elements\arrow_altair.py", line 36, in <module>
    from altair.vegalite.v4.api import Chart
ModuleNotFoundError: No module named 'altair.vegalite.v4'

It seems the module 'altair.vegalite.v4' is missing. However, I believe all dependencies were correctly installed.

Please, could you help to investigate and resolve this issue?

Labels: bug, help wanted, good first issue

Did not find openai_api_key,

Im running this on Windows 10
I added the API KEY in the .env file but i keep getting this error
ValidationError: 1 validation error for OpenAIEmbeddings __root__ Did not find openai_api_key, please add an environment variable OPENAI_API_KEYwhich contains it, or passopenai_api_key as a named parameter. (type=value_error) Traceback: File "C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script exec(code, module.__dict__) File "C:\Users\DELL\Desktop\Projects\langchain-ask-pdf-main\app.py", line 55, in <module> main() File "C:\Users\DELL\Desktop\Projects\langchain-ask-pdf-main\app.py", line 37, in main embeddings = OpenAIEmbeddings() ^^^^^^^^^^^^^^^^^^ File "pydantic\main.py", line 341, in pydantic.main.BaseModel.__init__

getting below error

ValidationError: 1 validation error for OpenAIEmbeddings root Did not find openai_api_key, please add an environment variable OPENAI_API_KEY which contains it, or pass openai_api_key as a named parameter. (type=value_error)

Get a longer answer

Hello,

Is it possible to get longer answers?
Because the answers are too vague.

Question: is Langchain to have a larger database so that it can provide more relevant answers? Let me explain: I'd like him to go beyond a simple summary of the content, but to provide his understanding so that his answer can guide us in understanding the document. Of course, LLMs can have hallucinations, so we need to be aware of this.

Thanks.

Feature : Autoload a PDF file By URL

Autoload the PDF in local file

With this feature, we delete the file input

Explaincation :

  • The BytesIO and PdfReader classes from the PyPDF2 library are imported to handle the PDF file.
  • The CharacterTextSplitter, OpenAIEmbeddings, FAISS, load_qa_chain, OpenAI, and get_openai_callback classes are imported from the langchain library. These classes are used to build the question-answering system.
  • The OpenAI API key is set as an environment variable using the os module.
  • The st.set_page_config and st.header functions from the streamlit library are used to set the title and header of the web app.
  • The PDF file is loaded using the open function and read in binary mode using the rb flag. The contents of the file are then stored in a BytesIO object.
  • The text content of the PDF file is extracted using the extract_text method of the PdfReader class. The text is concatenated into a single string.
  • The CharacterTextSplitter class is used to split the text into smaller chunks. These chunks are used to build a knowledge base for the question-answering system.
  • The OpenAIEmbeddings class is used to generate embeddings for the text chunks. These embeddings are used to perform similarity searches when answering questions.
  • The st.text_input function is used to prompt the user to ask a question about the PDF file.
  • If the user enters a question, the similarity search is performed using the knowledge_base.similarity_search method. The resulting documents are passed to the load_qa_chain function to create a question-answering chain.
  • The run method of the question-answering chain is called with the input documents and user question as arguments. The result is stored in the response variable.
  • The result is displayed using the st.write function.

The code : app.py

from io import BytesIO
import requests
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

def main():
    st.set_page_config(page_title="Ask your PDF")
    st.header("Ask your PDF ๐Ÿ’ฌ")
    
    # load the PDF file
    url = 'https://www.example.com/example.pdf'
    response = requests.get(url)
    pdf = BytesIO(response.content)
    
    # extract the text
    pdf_reader = PdfReader(pdf)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()

    # split into chunks
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
      
    # create embeddings
    embeddings = OpenAIEmbeddings()
    knowledge_base = FAISS.from_texts(chunks, embeddings)
      
    # show user input
    user_question = st.text_input("Ask a question about your PDF:")
    if user_question:
        docs = knowledge_base.similarity_search(user_question)
        
        llm = OpenAI()
        chain = load_qa_chain(llm, chain_type="stuff")
        with get_openai_callback() as cb:
            response = chain.run(input_documents=docs, question=user_question)
            print(cb)
           
        st.write(response)
    

if __name__ == '__main__':
    main()

Run & Test

streamlit run .\app.py

What color is the sky? - How to add custom templates to ConversationalRetrievalChain?

This is actually important to add. If you ask about unrelated topic, like "What color is the sky", it will still answer related to the video...

I tried combine_docs_chain_kwargs={"prompt": prompt} with no success. Also, I can't make the code work with LLMChain, like:

` chat_prompt = ChatPromptTemplate.from_messages(
[system_message_prompt, human_message_prompt]
)

chain = LLMChain(llm=chat, prompt=chat_prompt)`

Deploy `langchain-ask-pdf` as APIs locally/on cloud using `langchain-serve`

Repo - langchain-serve.

  • Exposes APIs from function definitions locally as well as on the cloud.
  • Very few lines of code changes and ease of development remain the same as local.
  • Supports both REST & WebSocket endpoints
  • Serverless/autoscaling endpoints with automatic tls certs on the cloud.
  • Real-time streaming, human-in-the-loop support - which is crucial for chatbots.
  • We can extend the simple existing app pdf-qna on langchain-serve.

Disclaimer: I'm the primary author of langchain-serve.

hi

hi. please help me. how to create custom model from many pdfs in Persian language? tank you.

requirements.txt modification needed

Thank you Alejandro!
I got chatbot to work on my Windows 11 PC with the following requirement.txt

langchain==0.0.166
PyPDF2==3.0.1
python-dotenv==1.0.0
streamlit==1.18.1
faiss-cpu==1.7.4
altair<5
openapi==1.1.0
tiktoken

AttributeError: module 'openai' has no attribute 'error'

File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
exec(code, module.dict)
File "/Users/computer/Desktop/multi_doc/langchain-ask-pdf/app.py", line 55, in
main()
File "/Users/computer/Desktop/multi_doc/langchain-ask-pdf/app.py", line 38, in main
knowledge_base = FAISS.from_texts(chunks, embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/faiss.py", line 384, in from_texts
embeddings = embedding.embed_documents(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 234, in embed_documents
return self._get_len_safe_embeddings(texts, engine=self.deployment)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 175, in _get_len_safe_embeddings
response = embed_with_retry(
^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 57, in embed_with_retry
retry_decorator = _create_retry_decorator(embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 45, in _create_retry_decorator
retry_if_exception_type(openai.error.Timeout)

Bug: PDF images vs text

ERROR

image

Reproduce

  1. upload PDF that is based on images i.e employee handbook, etc..
  2. error will show

Possible Solution

try/except if PDF cannot be read give user error message instead of traceback

getting module not found

even i installed the module it's showing me not found
streamlit run app.py
Traceback (most recent call last):
File "/home/thakuradi/.local/bin/streamlit", line 5, in
from streamlit.web.cli import main
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/init.py", line 55, in
from streamlit.delta_generator import DeltaGenerator as _DeltaGenerator
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/delta_generator.py", line 36, in
from streamlit import config, cursor, env_util, logger, runtime, type_util, util
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/cursor.py", line 18, in
from streamlit.runtime.scriptrunner import get_script_run_ctx
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/init.py", line 16, in
from streamlit.runtime.runtime import Runtime as Runtime
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/runtime.py", line 29, in
from streamlit.runtime.app_session import AppSession
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/app_session.py", line 35, in
from streamlit.runtime import caching, legacy_caching
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/caching/init.py", line 21, in
from streamlit.runtime.caching.cache_data_api import (
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_data_api.py", line 37, in
from streamlit.runtime.caching import cache_utils
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 31, in
from streamlit import type_util
File "/home/thakuradi/.local/lib/python3.11/site-packages/streamlit/type_util.py", line 42, in
from pandas import DataFrame, Index, MultiIndex, Series
File "/home/thakuradi/.local/lib/python3.11/site-packages/pandas/init.py", line 22, in
from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/thakuradi/.local/lib/python3.11/site-packages/pandas/compat/init.py", line 24, in
import pandas.compat.compressors
File "/home/thakuradi/.local/lib/python3.11/site-packages/pandas/compat/compressors.py", line 7, in
import bz2
File "/usr/local/lib/python3.11/bz2.py", line 17, in
from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named '_bz2'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.