Light

nlmatics / llmsherpa Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 102.0 163 KB

Developer APIs to Accelerate LLM Projects

Home Page: https://www.nlmatics.com

License: MIT License

Python 20.69% Jupyter Notebook 79.31%

llmsherpa's People

Contributors

Stargazers

Watchers

Forkers

jsv4 shuxiangzhang thanhpham1987 ell-hol ankushmulkar davgit serignecisse sukantag kalufinnle popupbuddy touristshaun techthiyanes nashid blackwhites muharremokutan b-zwarg doutianbao inf800 moshewe diningsystem clbrge diamondsea cacoderquan mmmkkk888 henrywoo miguelramosfdz codehornets haoyitedaniu moorbles tonywhite11 madhat2r fluid-ai ndrdst reema93jain sivasurend blisssan mjdhasan kamilnowakflyps francyjglisboa codeaudit tinghao724 sumit6597 lawrenceemenike machallboyd mbyanfei drewskidang joennlae kai-hubs kevinprinsloo cupcoder ayushmodi038 shashipal95 yuksel-arslan pedramrzm evelynmitchell overcyber pks20iitk namtho7078 qzcai sbadisa ink-splatters healthmemmo afro-lingo svorwerk-flextg lokeshjonnakuti mjwgoh ai-mou sessycode favazmuhammed stormpham sharpboy2008 lebigot mrenlivex mandanafasounaki kurcontko glazyee ishan-marikar thomsonreuters sonalshad arioliv yasinsb oreh zhangjiekui feihu618 eagerworks sciumo ozozozd octag0no dmkwon isalzh kevinwengh j-lim-sigma anotherdayinparadise cgslivv xiaomujiang nu11b0t jinkjonks hellerphilipp ramarnat prabhabharadwaj

llmsherpa's Issues

APIConnectionError: Connection error.

LocalProtocolError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/httpcore/_exceptions.py in map_exceptions(map)
9 try:
---> 10 yield
11 except Exception as exc: # noqa: PIE786

106 frames
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

During handling of the above exception, another exception occurred:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

During handling of the above exception, another exception occurred:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

During handling of the above exception, another exception occurred:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

LocalProtocolError Traceback (most recent call last)
LocalProtocolError: Illegal header value b'Bearer '

The above exception was the direct cause of the following exception:

APIConnectionError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/openai/_base_client.py in _request(self, cast_to, options, remaining_retries, stream, stream_cls)
903 )
904
--> 905 raise APIConnectionError(request=request) from err
906
907 log.debug(

APIConnectionError: Connection error.

Connection Error

I am using latest LLLSherpa to chunk a pdf but I always get this SSLCertVerificationError. I am using python 3.12 and using simple code llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url). Looks like it is a known issue and could have been resolved by disabling SSL check but I could not find anyway to handle it as connections are made by LayoutPDFReader with no handle to disable SSL check. Please guide me.

Error Details:
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)

MaxRetryError: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: /api/document/developer/parseDocument?renderFormat=all (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))

Source Citations

Trying to sort out how to best set up llmsherpa to cooperate with a llama-index Query Engine system where I need to retrieve source citations. I was having a hard time when using it as a loader, wondering if there's a way to implement it as a parser/sentence splitter?

Would appreciate any help you can offer - the results after swapping out PDFMiner were undeniable - instant 80% boost in accuracy and understanding.

Nodes and llama idnex

when using the llama index example
from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))

are the chunks converted to nodes??? I'm tyring to use pinecone but it requires documents or nodes for ingestion but this happens after the index is created

KeyError: result

Hi, I am facing the same error, I have 5 PDF files. The code works for 4 except one, I tried running it a few times like @jalkestrup but it still throws the error:
KeyError: 'result'
I would really appreciate any support from authors/community on this!

P.S. Since the issue wasn't resolved hence reopening the issue :)

LayoutPDFReader_Demo.ipynb test Error confirmation request and fine-tuning section GPT multilingual translation request code

I am reporting a malfunction while testing based on LayoutPDFReader_Demo.ipynb.

1. PDF download recognition fails from external URL

pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"

,,,
UnboundLocalError Traceback (most recent call last)
in <cell line: 6>()
4 pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
5 pdf_reader = LayoutPDFReader(llmsherpa_api_url)
----> 6 doc = pdf_reader.read_pdf(pdf_url)
,,,

2. Read file from inside (success)
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" Download.
Uploaded “1910.13461.pdf” to the “downloads” folder.

3. Question: What code should I enter to request translation of the fine-tuning section text into another language through GPT?

,,,
from IPython.core.display import display, HTML
selected_section = None
// find a section in the document by title
for section in doc.sections():
if section.title == '3 Fine-tuning BART':
selected_section = section
break
HTML(section.to_html(include_children=True, recurse=True))
,,,

4. custom summary of this text using a prompt: (Error)

resp = OpenAI().complete(f"read this text and answer question: {question}:\n{context}")

LocalProtocolError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/httpcore/_exceptions.py in map_exceptions(map)
9 try:
---> 10 yield
11 Exception as exc: # noqa: PIE786

I also tried asking a bug question about GPT, but couldn't find a suitable fix, so I'm leaving my question here.

Golang version of llmsherpa

Hey everyone, @NightTrek and I made a golang version of llmsherpa. We have done some limited tests on it and it seems to work. Please take a look!

https://github.com/NightTrek/go-llmsherpa

PDF size limit - apologies if I caused problems

Apologies if I cause problems just recently.
Uploaded a huge complicated PDF file (over 200MB) to test the module.
Is there a limit on a file size we need to observe to not cause issues ?

How to add custom parser API URL

Looking at the documentation, to parse a pdf one has to pass a parser url which is basically an api that does all the magic (chunks, sections, paragraphs etc.).
I was wondering where this api is hosted and if ever we can self host this api to use in a local environment.

In the doc here it is written the following : Use customer url for your private instance here

Feature Request - Splitting Bounding Boxes Across Pages

I am not sure if this is being handled. During my test I found that box coordinates does not take in to account of intersection.

I am writing to propose a feature enhancement for your project, specifically regarding the current handling of bounding boxes (bbox) in the context of PDF generation.

Currently, when generating a PDF, a single bbox is produced for the chunk of text located in the intersection of the page. While this approach is effective, I would like to suggest an enhancement that involves splitting the bbox into different pages, providing more granularity in representing the layout of text across pages.
Second feature is to add the PDF (mediabox, cropbox or rect) width and height on every page in the API response. This will provide a much better usabilty of the bbox to add annotation/highlight layer using the bbox

Timeout Error

When I try to load a PDF that is 541 pages long (~9.5 MB), I get the following error message:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I assume it's due to the large file size? I don't have the same issue loading smaller files.

[Question]: what is block_class parameter in block json representation?

I'd like to know what does this value represent. Does it have something to do with the tag of the block? Should I consider it when making my own parsing of the blocks?

Keep up the great work, this package has so much potential!!!

Get repeated headers and footers

Hi, is there a way to get the footers and repeated headers? maybe on a different object or as an option.

Bug in API function: Repeated content in HTML when trying to convert PDF to HTML.

First of all, I would like to appreciate the great work, you have done to convert PDF to well tagged HTML pages. Many Thanks for this contribution.

The Issue I faced that I am getting repeated pages while converting pdf to HTML . ..
To recreate the issue, use following code...

Use this file to get code with indentations.
pdf_to_html_llmsherpa.txt

Actual code --

def convert_pdf_to_html(pdf_file, output_html):
try:
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_file)
print(doc.to_html())

    # Write to a html file
    with open(output_html, 'w', encoding='utf-8') as html_file:
        html_file.write(f'{doc.to_html()}')

    print(f"Conversion successful. HTML file saved to {output_html}")
except Exception as e:
    print(f"Error during conversion: {e}")

pdf_file_path = 'pdf_upload/AbanPearlPteLtd310322.pdf'
output_html_path = 'pdf_upload/AbanPearlPteLtd310322_modified_2.html'

convert_pdf_to_html(pdf_file_path, output_html_path)

AbanPearlPteLtd310322.pdf

Ingestion Failed

I am getting ingestion failed when I try to hit the endpoint https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all

to fetch chunks on a any Pdf document

This is the response:

{'return_dict': {}, 'status': 'ingest_failed'}

I tried printing out the response_json

How can I check about 'block_class'?

At first, I want to say thank for your great job
I was surprised because this API can recognize Korean.
Very impressive.

Everything is perfect except specific form of table.
So, I want to remove this type of table result from LayoutPDFReader result
But I cannot extract specific form of table which LayoutPDFReader is not properly recognized

But I noticed, with 'block_class' I can detect it.
So I want to know and check what is 'block_class'.

Thanks and sorry for my pool English

Failed to establish a new connection: [Errno 111] Connection refused'))

I used the test_llmsherpa_api.ipynb file but got a connection error.

WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8910>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8a90>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8c70>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes
---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/urllib3/connection.py](https://localhost:8080/#) in _new_conn(self)
    202         try:
--> 203             sock = connection.create_connection(
    204                 (self._dns_host, self.port),

20 frames
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

NewConnectionError                        Traceback (most recent call last)
NewConnectionError: <urllib3.connection.HTTPConnection object at 0x78a6541b8e20>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py](https://localhost:8080/#) in increment(self, method, url, response, error, _pool, _stacktrace)
    513         if new_retry.is_exhausted():
    514             reason = error or ResponseError(cause)
--> 515             raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    516 
    517         log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPConnectionPool(host='localhost', port=5001): Max retries exceeded with url: /api/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x78a6541b8e20>: Failed to establish a new connection: [Errno 111] Connection refused'))`

```

Skips first few lines in PDF.

I'm running the server in docker:

image: ghcr.io/nlmatics/nlm-ingestor:latest

I've only tested with one 300page PDF and it seems to skip the first couple lines of the PDF. It doesn't seem to be an issue but It makes me wonder if anything else is being skipped. This is the same whether I convert to text, use sections, or convert to html.

What might be the cause?

list of pdfs as input

pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"

does it support list or only a single pdf?

Add more metadata info - page label and filename?

I'm trying to use it as a pdf reader for llama index, which usually also has details like page label with each document. Anyway to add that info too? How would I go about customizing it to do that myself?

Parsing bytes-like objects directly

Hi,

I've been experimenting with llmsherpa with a small streamlit app to which I upload PDFs, and I saw that I can't use the uploaded files (which are bytes-like) directly.

I have code to handle that directly, though I wonder if this feature is needed by anyone else... If there's interest I can open a PR.

InternalServerError: Error code: 503

from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

InternalServerError Traceback (most recent call last)
Input In [17], in <cell line: 5>()
4 index = VectorStoreIndex([])
5 for chunk in doc.chunks():
----> 6 index.insert(Document(text=chunk.to_context_text(), extra_info={'embed_model':'text-embedding-V2'}))
7 query_engine = index.as_query_engine()

InternalServerError: Error code: 503 - {'error': {'message': 'There are no available channels for model text embedding ada-002 under the current group VIP (request id: 20240108173434543639222FMD9UnZh)', 'type': 'new_api_error'}}

question:How to define parameter change models？

Bug in load_data when using full path

This code would fail:

full_path = 'C:\\temp\\A\\test.pdf'
documents = pdf_loader.load_data(full_path )

However, if relative path is given it works fine.

It looks like the issue is in file_reader.py:63
is_url = urlparse(path_or_url).scheme != ""

In case of full path the scheme will be the letter of the drive (C in this case) which would make it treat it as a URL instead of a path.

pdf problem?

i scan a document and pdf allows copy and paste the text but i get this error with layoutpdf

     39 parser_response = self._parse_pdf(pdf_file)
     40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
     42 return Document(blocks)

KeyError: 'result'```
is it because of the format? is there anything i could do to turn the pdf into a more readable format for your api?

Not able to get all the subsection names inside a section

Hi,I am using the attached pdf for testing.There is no whitespace between subsection title and subsection content.It is not able to extract all the subsection titles present within a section.I tried with a different pdf where white space is there ,It was working pretty good.Could you please guide how we can extract specific subsection title along with corresponding content ?
RWXcE3.pdf.pdf

llmsherpa api url

Hi Team,

I am trying to parse pdf using llmsherpa LayoutPDFReader
When I pass below llmsherpa api url to LayoutPDFReader its giving error saying host is not found.

Can you please tell if below is correct api url for llmsherpa?

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"

I also pasted link in url, its showing below error:

Thank you
Reema Jain

just want to get the paragraph information

The project's ability to obtain text paragraph information from PDF files is exciting. If I just want to get the paragraph information inside the PDF instead of using LLM, can I do it without requesting the API? Or can you recommend some related technologies and warehouses to me?

when trying to load multiple documents with joblib, get error cannot pickle

I am trying to parallelize ingestion of multiple, locally-stored PDFs, in my vectorstore.

when trying to load multiple documents with joblib, get error cannot pickle

PicklingError: Could not pickle the task to send it to the workers.

is this because of the API call involving accessing an external server for every PDF I am loading with llmsherpa?
What would be a workaround for this? Making this async (if yes, how)?

I think this is important for production.

thank you

timeout error?

I am getting this error while reading local pdf (quite bit though). FWIW, Unstructured API managed it well.


File [/opt/homebrew/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:41](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:41), in LayoutPDFReader.read_pdf(self, path_or_url)
     39 parser_response = self._parse_pdf(pdf_file)
     40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
     42 return Document(blocks)

KeyError: 'result'```

Azure MarketPlace cannot provide stable service

We have recently deployed the LLMSherpa service through Azure Marketplace and are experiencing an intermittent issue. Specifically, the API suddenly stops responding to any requests, failing to return results. This issue persists even after restarting both the Azure machine and the service itself. However, we have noticed that the service spontaneously resumes normal operation the following day. Additionally, we are unable to log into the deployed machine to check if the service is functioning normally, and we cannot access any logs.

Has anyone else encountered similar issues with services deployed via Azure Marketplace? If so, could you please share any insights or solutions to this problem? Additionally, we are interested in knowing if there are alternative, more stable API services available that we could consider.

Thank you for your assistance.

Bug in API function: Incorrect behavior with repeated sections.

The issue arises when extracting HTML content from a document using the .to_html() method after reading a PDF with

doc = pdf_reader.read_pdf(pdf_url)
doc.to_html(include_children=True, recurse=True)

When iterating through the sections, the loop processes both the parent and child sections, causing repetitive content in the HTML output.
Resulting in unintended duplication.

Here is the relevant code:

    def to_html(self):
        """
        Returns html for the document by iterating through all the sections
        """
        html_str = "<html>"
        for section in self.sections():
            html_str = html_str + section.to_html(include_children=True, recurse=True)
        html_str = html_str + "</html>"
        return html_str

Add async API

It would be great if read_pdf supported an async variant so we could await the result. This would allow us to easily perform concurrent work while waiting the multiple seconds the API can take to respond.

llmsherpa url error

Hi! I'm trying to parse a pdf like the example:

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

But I'm receiving the error:
MaxRetryError: HTTPSConnectionPool(host='arxiv.org', port=443): Max retries exceeded with url: /pdf/1910.13461.pdf (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000001BEE59F6410>: Failed to resolve 'arxiv.org' ([Errno 11002] getaddrinfo failed)"))

I've also tried with local files, so I think my problem is related to the API. Does anyone know how to solve this?

Feature Request - API Call Parameters to set chunk minimum and maximum length.

Hello, looks great so far. Would appreciate the ability to include parameters in the API call that specify both a minimum and maximum character length for the resulting chunks.

Minimum chunk size would look across the resulting chunk objects and do simple concatenation until they are over some value for total text length.
Maximum chunk size would split chunk text into multiple segments while preserving the title/section smart labeling you already do.

MaxRetryError HTTPSConnectionPool

MaxRetryError HTTPSConnectionPool(host-'readers.11msherpa.com', port-443): Max retries exceeded with url:

ImportError: cannot import name 'AsyncAzureOpenAI' from 'openai'

This error was generated using your code example.

Traceback (most recent call last):
File "C:\Users\x\wc-chat-pdf-py-willis\layout-pdf-reader.py", line 2, in
from llama_index.readers.schema.base import Document
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index_init_.py", line 17, in
from llama_index.embeddings.langchain import LangchainEmbedding
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index\embeddings_init_.py", line 7, in
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
File "C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\llama_index\embeddings\azure_openai.py", line 3, in
from openai import AsyncAzureOpenAI, AzureOpenAI
ImportError: cannot import name 'AsyncAzureOpenAI' from 'openai' (C:\Users\x\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai_init_.py)

Can't get Demo to run with PDF file: List Index out of range

I just quickly wanted to test out the performance of this interesting approach by using the demo colab notebook, but it yields an error when loading the PDF on the very first lines... ?

Incomplete Document Loading

Loading this document (from a local machine) returns only 3 incomplete sections.

Anyway to load the entire document?

Missing Urllib3 Dependency

Still experimenting with the library, but it looks great so far. It looks like urllib3 is missing from the required dependencies. I opened a PR #1 to add this into setup.py

Getting a json parse exception when trying to use the `content=` parameter of LayoutPDFReader.read_pdf() with a None value.

Steps taken:

from llmsherpa.readers import LayoutPDFReader
from pathlib import Path

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
parser= LayoutPDFReader(llmsherpa_api_url)
path = Path('yp') / 'tests' / 'content' / 'Ambrx EX-2.1.pdf'
with open(path, 'rb') as f:
    content = f.read()
parser.read_pdf(None, content)

resulting stack trace:

Traceback (most recent call last):
  File "/home/mboyd/.pycharm_helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
           ^^^^^^
  File "<input>", line 1, in <module>
  File "/home/mboyd/.virtualenvs/yp-demo/lib/python3.12/site-packages/llmsherpa/readers/file_reader.py", line 72, in read_pdf
    response_json = json.loads(parser_response.data.decode("utf-8"))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

However, parser.read_pdf('', contents=content) DOES successfully parse, as an empty string evaluates to false and cleanly converts to valid JSON in _parse_pdf(), unlike None. None would be the normal pythonic way of specifying no value, however.

LayoutPDFReader._parse_pdf returns error when pdf contains empty pages

I tried processing a pdf file using the LayoutPDFReader.read_pdf() method, but got a KeyError for response_json['return_dict']['result']['blocks'], since the response did not contain results, because there was an error (on a side node: would be nice to have a specific error in this case instead of a key error, clearly stating that the file could not be processed and the reason why).

I split my pdf in pages and processed each page separately to understand what the issue was. Turns out that the error existed every time an empty page was being processed. I am not sure whether this is the case for empty pages of all types of pdfs or just for some pdf types (there are small differences between text pdfs depending on how they were created). It only occurred on one of the pdfs I was processing, but it was also the only pdf with empty pages...

Better: do not fail processing of a whole document if it has one empty page, but simply skip that page.

Langchain integration

HI, Thanks for this amazing lib. Any plans for Langchain or Open AI Assistants API ihtegration

Inquiry about Open-Sourcing the Entire Project

Hello, @ansukla

I've taken a keen interest in your project and am considering its integration into our product. However, we have concerns about potential issues with the provided API and would ideally prefer to deploy it locally to ensure stability and performance.

With this in mind, I'd like to inquire if there are any plans to open-source the entire project. If so, could you provide an estimated timeline for when this might be available?

Your guidance on this would be greatly appreciated.

Thank you for your time and consideration!

Best regards,

keyerror: result

Running test script in colab:

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "dagpenge_LH_merged.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

returns

KeyError Traceback (most recent call last)
in <cell line: 6>()
4 pdf_url = "dagpenge_LH_merged.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
5 pdf_reader = LayoutPDFReader(llmsherpa_api_url)
----> 6 doc = pdf_reader.read_pdf(pdf_url)

/usr/local/lib/python3.10/dist-packages/llmsherpa/readers/file_reader.py in read_pdf(self, path_or_url)
39 parser_response = self._parse_pdf(pdf_file)
40 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 41 blocks = response_json['return_dict']['result']['blocks']
42 return Document(blocks)
43 # def read_file(file_path):

KeyError: 'result'

I often get this error when trying to run the demo script. It also occurred yesterday, but then running it a few times "solved" the issue. It does not now.

Ingest failed

Return ingest failed with some pdf file
Uploading HS4_OM_ES_202212226.pdf…

Can I get coordinates for each chunk?

First of all, thank you for implementing such a wonderful library.

I tried using it and got some good results.

I thought it would be better if there was coordinate information for each chunk, but do you have any plans to implement it in the future?

I think it would be technically difficult to obtain data that includes coordinate information...

error loading a document

When i load a document for about 800 pages the connection runs out, is it a common happening for huge files??

Parse nodes on a para-point level

Hi,

I'm trying to parse a document which has a lot of points which in turn has sub points. Goal is to split the text point-wise and parse them as llama-index nodes.
For Example,
I would like to have this as a single node:

However, when I parse and iterate through chunks (doc.chunks()), the heirarchy for points and subpoints aren't getting assigned.

All these chunks are independent and have no relationship with each other other than with the section heading:

Based on my understanding, we can probably try the following:

Manually Assign the parent node (para) to the 4 sub points (lists)
Parse the document into nodes on a section level and then use sentence splitters using llama index API (might not be optimal).

Kindly let me know if there's any alternatives for this.

Thanks!

JSONDecodeError

self.pdf_reader = readers.LayoutPDFReader(
self.api_url,
)
document = self.pdf_reader.read_pdf(final_location)

returns

Traceback (most recent call last):
File "", line 1, in
File "", line 21, in read_document
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/site-packages/llmsherpa/readers/file_reader.py", line 72, in read_pdf
response_json = json.loads(parser_response.data.decode("utf-8"))
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/joncheng/opt/miniconda3/envs/owlbear/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Looks like there's an issue with handling jsondecoding errors.

I just provided a local file path.

read_pdf fails on specific pdf locally, not through hosted api

PDF in question:
JTR.pdf

This api call works great
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

This local call fails
llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"](http://localhost:5010/api/parseDocument?renderFormat=all%22) pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

The local version is running from the latest docker build. Other pdfs work fine. Is there a way to get a better error message? Currently receiving: KeyError: 'return_dict'

I noticed there are other issues open around this error but did not find any matching this case where it works on one and not the other.

I appreciate your time and any insight. Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.