nlmatics / nlm-ingestor Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 139.0 141.7 MB

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.

Home Page: https://www.nlmatics.com

License: Apache License 2.0

Dockerfile 0.11% Python 56.34% Jupyter Notebook 43.53% Shell 0.02%

nlm-ingestor's People

Contributors

Stargazers

Watchers

Forkers

leochencipher suryatmodulus zero-sigma lguzzon-scratchbook zivanovicb noelo ffos trocker jithinraj hbcbh1999 mbrukman drewskidang karbon0x keyman9848 liuxing9848 joennlae techthiyanes shuxiangzhang rumsrami akshatadm prosperous-ai-inc jaumemir simonyu0518 kai-hubs ngohgia hadjebi tawat-ki cupcoder nmfisher lengsy-ai ianschmitz prnake aledbf maximedb pashpashpash healthmemmo doftorul mjwgoh sessycode svorwerk-flextg ppaletto stormpham stefanknegt lpradovera eutika baobo5625 hongshancapital shirleyatmist andreshat riclab kurcontko rajeshpandey-gep ishan-marikar moveyor glazyee jondelga neo-sushi arioliv kalyankumarpichuka first-automation livelxw feihu618 navjot-arcs nuwacom sciumo ozozozd achevrot dmkwon jsv4 kevinwengh smokeballdev javeliniq hamedmp jinkjonks nu11b0t hellerphilipp ramarnat shenliqin osean-man eagerworks unlockre kmouratidis venkatta namtho7078 issambox mgl acen20 julianeubauer weinix pstutz jamesvillarrubia vansh-pundir-rc sshuster michaelfeil octag0no pablo-gonzalez-ctic henko-compliance qnlbnsl alexlyte datarefactorynexus

nlm-ingestor's Issues

HTLM AND XML INGESTOR

Do you have an example of loading an xml/html. Trouble figuring it out. Great product tho!

my code
from llmsherpa.readers import LayoutPDFReader
import os, sys
directory_path = "/data/pdf_test/llmsherpa"
sys.path.insert(0, directory_path)
llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all&useNewIndentParser=true"
pdf_url = "https://solutions.weblite.ca/pdfocrx/scansmpl.pdf"
do_ocr = True
if do_ocr:
llmsherpa_api_url = llmsherpa_api_url + "&applyOcr=yes"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)
print(doc.to_html())

Error reporting
Traceback (most recent call last):
File "/data/pdf_test/t.py", line 13, in
doc = pdf_reader.read_pdf(pdf_url)
File "/root/anaconda3/envs/nlm/lib/python3.9/site-packages/llmsherpa/readers/file_reader.py", line 73, in read_pdf
blocks = response_json['return_dict']['result']['blocks']
KeyError: 'return_dict'

Is it recommended to use the new indent parser?

Hi! I'm looking to use the nlm-ingestor + llmsherpa to ingest PDFs.
I saw that there is an option to use a different algorithm with the useNewIndentParser flag. What is the difference with the old parser?
Is it recommended for use in a production app? Is it still experimental or a WIP?

Thanks!

nlm-ingestor is SUPER SLOW

As mentioned here: #37

Chunking even small PDFs (<20 pages) takes longer than 30 seconds! This is a huge problem in any production environment. Why is this happening?

How to deploy this thing in production building image with the docker file giving error.

Suggestions for Fast Production Server

I have instantiated my own nlm-ingestor api service on a Dedicated 8GB Linode instance (for testing purposes) using the provided docker container.

I have some questions regarding building a fast production server for parsing PDFs. I have code based on the getting started example provided, which sends this file:

https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10q/uber_10q_march_2022.pdf

to the nlm-ingestor api to get parsed and retrieve the chunks. This process, just for the above file, takes ~30 seconds. This is indeed faster than some other options, but for my use-case I need to be able to bring that time down to ~10 seconds. Are there any provided guidelines or suggestions for improving the speed of the PDF parsing service?

Local only use

Hello,

Thank you for this nice library.

Is there a way to use the nlm-ingestor without an internet connection? It seem to download a tokenizer from openai.
I get the following error:

  File "/opt/conda/envs/DWS-CPU/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_ingestor/ingestion_daemon/__main__.py", line 8, in <module>
    from nlm_ingestor.ingestor import ingestor_api
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_ingestor/ingestor/__init__.py", line 3, in <module>
    from nlm_utils.utils import generate_version
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/__init__.py", line 4, in <module>
    import nlm_utils.model_client
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/model_client/__init__.py", line 1, in <module>
    from .classification import ClassificationClient
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/model_client/classification.py", line 10, in <module>
    from nlm_utils.model_client.flan_t5_client import FlanT5Client
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/model_client/flan_t5_client.py", line 7, in <module>
    from nlm_utils.utils.answer_type import answer_type_map
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/utils/__init__.py", line 8, in <module>
    from .utils import ensure_bool
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/utils/utils.py", line 4, in <module>
    oai_tokenizer = tiktoken.get_encoding(
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/registry.py", line 73, in get_encoding
    enc = Encoding(**constructor())
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken_ext/openai_public.py", line 72, in cl100k_base
    mergeable_ranks = load_tiktoken_bpe(
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/load.py", line 147, in load_tiktoken_bpe
    contents = read_file_cached(tiktoken_bpe_file, expected_hash)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/load.py", line 64, in read_file_cached
    contents = read_file(blobpath)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/load.py", line 25, in read_file
    resp = requests.get(blobpath)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe6ded15d30>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Thank you,
Maxime.

IndexError: list index out of range

Seeing the following error for one of my PDFs:

127.0.0.1 - - [13/Feb/2024 14:58:59] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 14:58:59] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:main:Parsing document: 3f367d70-dccc-47ce-a17d-c6689fcb88d2.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmpt5dhdwuq.pdf with name 3f367d70-dccc-47ce-a17d-c6689fcb88d2.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser
testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False}
INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF
INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 45.7038ms on workspace
processing page: 1 Number of p_tags.... 2
processing page: 4 Number of p_tags.... 4
group buf still has: 1 •
processing blocks in page: 4
error uploading file, stacktrace: Traceback (most recent call last):
File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init
blocks, _block_texts, _sents, file_data, result, page_dim, num_pages = parse_blocks(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
title_page_fonts = top_pages_info(parsed_doc)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 254, in top_pages_info
temp, title_candidates = retrieve_title_candidates(i)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 237, in retrieve_title_candidates
for freq in sorted_freq[list(sorted_freq.keys())[key_idx]]:
IndexError: list index out of range

Sorry I'm unable to share the file. Updating the condition in pdf_ingestor.py line 35 to check if len(sorted_freq) is greater than key_idx instead of 0 has allowed me to get past this, but it's not clear to me if that's the best fix or not.

Missing arm64/v8 architecture

When trying to pull on a M1 MacBook i get the following error:

$ docker pull ghcr.io/nlmatics/nlm-ingestor:latest
latest: Pulling from nlmatics/nlm-ingestor
no matching manifest for linux/arm64/v8 in the manifest list entries

Looking at the action that builds and deploys the image, it looks like platforms isn't included:

nlm-ingestor/.github/workflows/docker-publish.yml

Lines 73 to 79 in 8f1bcb4

 with: 

 context: . 

 push: ${{ github.event_name != 'pull_request' }} 

 tags: ${{ steps.meta.outputs.tags }} 

 labels: ${{ steps.meta.outputs.labels }} 

 cache-from: type=gha 

 cache-to: type=gha,mode=max

However it was set above when setting up docker buildx:

nlm-ingestor/.github/workflows/docker-publish.yml

Line 48 in 8f1bcb4

platforms: linux/amd64,linux/arm64,linux/arm64/v8,windows/amd64,linux/arm/v7

I'm not entirely sure what platforms does in the context of the setup-buildx-action, but looking at the docs for multi-platform build it looks like if QEMU was setup as an action prior to build, we could simply specify the platforms as part of the build-push-action.

Would you accept a PR to fix this and deploy a multi-platform image? If so what are the platforms you would like to target?

Recommendation for production server

The documentation says that the provided server is good for a development environment. Do you have any examples or suggestions on how to run this in a production environment?

Related to the previous question, I saw that the provided server enqueues the requests one after another.
We have to index a lot of documents, and we plan to do that in parallel. Do you have any recommendations to serve ~100 concurrent requests?

Thanks!

Health endpoint

I deployed this as an ECS service, but ECS needs a way to check the health of the service (to determine if the deployment was successful). Generally, you can do something like this:

{
    "Command": [
        "CMD-SHELL",
        "curl -f http://localhost:5001/ || exit 1"
    ],
    "Interval": 30,
    "Retries": 5,
    "StartPeriod": 20,
    "Timeout": 5
}

but this does not work since the only defined endpoint is a POST to /api/parseDocument that requires an input file. Is there any way of doing this? A GET to / returns 404. What about adding an endpoint to / that just returns 200 to check service status?
Thanks!

Encoding error with non-ASCII character.

There is some sort of encoding error with '½'

Happy to submit a PR if someone can point me in the right direction for this conversion.

nlm-ingestor-1  | testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False}
.....

processing blocks in page:  317
nlm-ingestor-1  | processing blocks in page:  318
nlm-ingestor-1  | ERROR:root:could not convert string to float: '½'
nlm-ingestor-1  | ERROR:root:could not convert string to float: '½'
nlm-ingestor-1  | ERROR:root:could not convert string to float: '½'
nlm-ingestor-1  | ERROR:__main__:error uploading file, stacktrace: Traceback (most recent call last):
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
nlm-ingestor-1  |     ingest_status, return_dict = ingestor_api.ingest_document(
nlm-ingestor-1  |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
nlm-ingestor-1  |     pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
nlm-ingestor-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
nlm-ingestor-1  |     blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
nlm-ingestor-1  |                                                                             ^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
nlm-ingestor-1  |     parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
nlm-ingestor-1  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
nlm-ingestor-1  |     self.parse(pages)
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse
nlm-ingestor-1  |     self.json_dict = block_renderer.BlockRenderer(self).render_json()
nlm-ingestor-1  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json
nlm-ingestor-1  |     table_block["left"],
nlm-ingestor-1  |     ~~~~~~~~~~~^^^^^^^^
nlm-ingestor-1  | KeyError: 'left'
nlm-ingestor-1  | Traceback (most recent call last):
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
nlm-ingestor-1  |     ingest_status, return_dict = ingestor_api.ingest_document(
nlm-ingestor-1  |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
nlm-ingestor-1  |     pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
nlm-ingestor-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
nlm-ingestor-1  |     blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
nlm-ingestor-1  |                                                                             ^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
nlm-ingestor-1  |     parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
nlm-ingestor-1  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
nlm-ingestor-1  |     self.parse(pages)
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse
nlm-ingestor-1  |     self.json_dict = block_renderer.BlockRenderer(self).render_json()
nlm-ingestor-1  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
nlm-ingestor-1  |   File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json
nlm-ingestor-1  |     table_block["left"],
nlm-ingestor-1  |     ~~~~~~~~~~~^^^^^^^^
nlm-ingestor-1  | KeyError: 'left'

Docker pull issue

hi
i am using this command docker pull ghcr.io/nlmatics/nlm-ingestor:latest
i am getting this error
Error response from daemon: Get "https://ghcr.io/v2/": EOF

thanks

Is it possible to run this fully local, so sensitive PII PDFs dont leave the network?

Hey I work with some PII data PDFs, I would love to use this tool for handling them, but is it possible to run it without the PDF data leaving the network?

Query: How would it integrate with other LLM apis.

Hello,

I am looking for PDF parser/extractor to read data for ready for LLM to give me values form it. Is it possible to do with this project ?

PDF -> nlm-ingestor -> PDF extracted o/p -> LLM

Thanks.

Not able to install nlm_ingestor

Hi, I have been trying to install nlm-ingestor, it keeps saying the above errors, any hints would be greatly appreciated

box_style not being taken into account

I have a PDF page where there is a block call it 'E', a paragraph, positioned at the bottom of the page. Block E is aligned with the others, but it has a different background colour.

BoxStyle(top=532.92,

The top item in the page is 'box_style': BoxStyle(top=77.0, and there are several further paragraphs, A, B ,C, D

Visually the PDF shows as

A
B
C
D
E

However in nlm-ingestor it's imported as

E
A
B
C
D

If I open in Acrobat and hit ctrl + A and copy paste to notepad it also shows as
E
A
B
C
D

If I export from Acrobat as .txt., .docx, etc., it will however be exported correctly as

A
B
C
D
E

Is there a way to get this to import pased on the visual positioning of the elements? (I don't really know how PDFs are structured, but I guess this is something like
div id="e" style="top:500" />
div id="a" style="top:50" />
div id="b" style="top:150" />
div id="c" style="top:250" />
div id="d" style="top:350" />
)

KeyError: 'left'

Seeing the following error for one of my PDFs:

127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 14:05:43] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:main:Parsing document: 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmp4n1214e0.pdf with name 6dbf73f5-9d13-4d29-b330-898e98d755c2.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser
testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False}
INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF
INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 109.6956ms on workspace
processing page: 0 Number of p_tags.... 141
processing page: 1 Number of p_tags.... 52
processing blocks in page: 1
error uploading file, stacktrace: Traceback (most recent call last):
File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init
self.parse(pages)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 562, in parse
self.json_dict = block_renderer.BlockRenderer(self).render_json()
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 347, in render_json
table_block["left"],
KeyError: 'left'

Sorry I'm unable to share the file. Updating the condition in block_renderer.py line 351 to check if "left" is in table_block has allowed me to get past this, but it's not clear to me if that's the best fix or not.

Fails to deploy as a service on Google Cloud Run

I am pointing the Cloud Run service to a fork of this repo with the Dockerfile to build with continuous deployment, with default service settings except a custom Container Port set to 5001. It successfully builds the image, but the deployment step fails with the following message:

Trigger execution failed: source code could not be built or deployed, find more information in [build logs]
Revision 'nlm-ingestor-00012-ntl' is not ready and cannot serve traffic. 
The user-provided container failed to start and listen on the port defined 
provided by the PORT=5001 environment variable.

Is anyone else having this problem? Are there any settings I need to tweak other than setting the container port to 5001?

Update: I got the same deployment step error when I used the pre-built docker image ghcr.io/nlmatics/nlm-ingestor:latest instead of continuous deployment.

Issue with finding tables and sections

EOL Notice (11).pdf
I have a PDF file as attached above, the parse is not able to recognisze the tables inside of it. And at the same time it is only showing me that there are only two sections in this named as:
Product Discontinuance Notice - PDN 23_0061 Rev. -
PDN Title:

I have also tried this using OCR=yes and at the same time I have also used NewIndentParser=True.
If anyone knows why this is happening, do help me out.

.pptx files are correctly chunks, but page_idx is always 0

Can you provide guidance on when page_idx wouldn't be available?

Will page_idx ever not be available or is it always 0 indexed and enumerated? Or if the uploaded pdf starts at page 46 for example, will it start at 46? Or always 0? Thanks

.pages files are chunked correctly but page_idx is always 0

UnicodeEncodeError when trying to save as HTML

I am using Docker image and below simple code for parsing the test.pdf file (An overlooked danger of ketogenic diets: Making the case that ketone bodies induce vascular damage by the same mechanisms as glucose):

from llmsherpa.readers import LayoutPDFReader
from rich import print

llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all"
pdf_url = "./test.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

# It works
print(doc.to_text())

# It breaks with `UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 966: character maps to <undefined>`
with open("./test.html", "w") as f:
    f.write(doc.to_html())

Unfortunately, I am getting and error:

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 966: character maps to <undefined>

but surprisingly, the doc.to_text() works normally.

I am on Windows 11, Python 3.12.1. I am attaching the test.pdf.

Disable rules/paranthesized header

I found the PARENTHESIZED_HDR regex is causing problems.

E.g., given the text

You can always check the car's manual if you're stuck.

(The manual should be located in the glove box.)

Otherwise please call for help.

Then the line (the manual) is being marked as a header. I disabled the PARENTHESIZED_HDR regex, because it didn't seem useful, but maybe there could be a config file to disable rules like this one?

bbox error in BlockRender

table_block maybe not contains left and top.

nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py

Lines 342 to 351 in cdfc639

 if 'is_table_end' in block: 

 is_rendering_table = False 

 table_block = render_dict["blocks"][-1] 

 table_block["table_rows"] = table_rows 

 table_block["bbox"] = [ 

 table_block["left"], 

 table_block["top"], 

 table_block["left"] + block["box_style"][3], 

 table_block["top"] + block["box_style"][4], 

 ] if "box_style" in block else []

Health checks fail because port 5001 not exposed by default

Specifically, I had to write and build a 3-line Dockerfile to expose this so I could host it in a Cloud Run service on GCP.

How to handle PPT format?

Hello, after converting PPT to PDF and using layoutPDFReader for parsing, the results are not satisfactory. How can I directly perform structural analysis on PPT?

Trivially small chunks returned

llmsherpha (from PDFs parsed using docker image) seems to be good at keeping tables in single chunks - however, other than that, it seems to be returning trivially small chunks.
These include:

Single characters (like a copyright symbol)
Small runs of characters like "******************"
Single words
Single sentences

Unfortunately, each item from bulleted and numbered lists are each coming across as a separate chunk rather than having a single chunk with all list items included.

I'd expect related items to be in single chunks, and unrelated items to also be merged into larger chunks. (The sweet-spot seems to be about 1000 tokens) - and I don't see a way to tell the algorithm what the average chunk size and overlap should be when there are no heuristics applied that would otherwise determine valid semantic chunk boundaries.

Bug

KeyError Traceback (most recent call last)
in <cell line: 15>()
13 llmsherpa_api_url = llmsherpa_api_url + "&applyOcr=yes"
14 pdf_reader = LayoutPDFReader(llmsherpa_api_url)
---> 15 doc = pdf_reader.read_pdf(pdf_url)

/usr/local/lib/python3.10/dist-packages/llmsherpa/readers/file_reader.py in read_pdf(self, path_or_url, contents)
71 parser_response = self._parse_pdf(pdf_file)
72 response_json = json.loads(parser_response.data.decode("utf-8"))
---> 73 blocks = response_json['return_dict']['result']['blocks']
74 return Document(blocks)

KeyError: 'return_dict'

https://s201.q4cdn.com/262069030/files/doc_financials/2023/ar/Walmart-10K-Reports-Optimized.pdf

For this url

Lost pages

pythonlearn.pdf

I used a local docker server to parse the above document, which has 239 pages. However, the ingestor only parsed 158 pages, and the remaining content was discarded. Is this a bug?

Here is the logs:

processing page: 140 Number of p_tags.... 178
processing page: 141 Number of p_tags.... 4
processing page: 142 Number of p_tags.... 251
processing page: 143 Number of p_tags.... 303
processing page: 144 Number of p_tags.... 322
processing page: 145 Number of p_tags.... 287
processing page: 146 Number of p_tags.... 330
processing page: 147 Number of p_tags.... 308
processing page: 148 Number of p_tags.... 265
processing page: 149 Number of p_tags.... 312
processing page: 150 Number of p_tags.... 298
processing page: 151 Number of p_tags.... 346
processing page: 152 Number of p_tags.... 412
processing page: 153 Number of p_tags.... 287
processing page: 154 Number of p_tags.... 193
processing page: 155 Number of p_tags.... 5
processing page: 156 192.168.65.1 - - [18/Apr/2024 14:24:54] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -

ZeroDivisionError: float division by zero

Seeing the following error for one of my PDFs:

127.0.0.1 - - [13/Feb/2024 15:51:32] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [13/Feb/2024 15:51:32] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 200 -
INFO:main:Parsing document: c8fc5a1d-e188-4c12-9b17-8367b29a5fb0.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:Parsing application/pdf at /tmp/tmpnl6e2o_d.pdf with name c8fc5a1d-e188-4c12-9b17-8367b29a5fb0.pdf
INFO:nlm_ingestor.ingestor.ingestor_api:using pdf parser
testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': False, 'parse_pages': (), 'apply_ocr': False}
INFO:nlm_ingestor.ingestor.pdf_ingestor:Parsing PDF
INFO:nlm_ingestor.ingestor.pdf_ingestor:PDF Parsing finished in 760.0334ms on workspace
processing page: 0 Number of p_tags.... 2
processing page: 1 Number of p_tags.... 112
processing page: 2 Number of p_tags.... 116
processing page: 3 Number of p_tags.... 120
processing page: 4 Number of p_tags.... 93
processing page: 5 Number of p_tags.... 91
processing page: 6 Number of p_tags.... 106
processing page: 7 Number of p_tags.... 107
processing page: 8 Number of p_tags.... 110
processing page: 9 Number of p_tags.... 95
processing page: 10 Number of p_tags.... 113
processing page: 11 Number of p_tags.... 106
G, GWI, -> Portfolios~
mismatch 2 4
processing page: 12 Number of p_tags.... 50
processing page: 13 Number of p_tags.... 2
processing page: 14 Number of p_tags.... 107
processing page: 15 Number of p_tags.... 216
processing page: 16 Number of p_tags.... 107
processing blocks in page: 2
processing blocks in page: 3
processing blocks in page: 3
processing blocks in page: 4
processing blocks in page: 5
processing blocks in page: 5
processing blocks in page: 6
processing blocks in page: 7
processing blocks in page: 8
processing blocks in page: 8
error uploading file, stacktrace: Traceback (most recent call last):
File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in init
self.parse(pages)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
self.organize_and_indent_blocks()
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 2676, in organize_and_indent_blocks
block_idx, footer_count = self.build_table(block_idx, organized_blocks, table_start_idx,
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3066, in build_table
block_idx, table_end_idx = self.make_table_with_footers(block_idx, footer_count, footers,
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3232, in make_table_with_footers
and (prev_box[1]/left > 1.1) # or is_aligned)
ZeroDivisionError: float division by zero

Sorry I'm unable to share the file. Updating left in visual_ingestor.py calculate_block_bounds() to a minimum of 1 has allowed me to get past this, but it's not clear to me if that's the best fix or not.

.doc files are correctly chunked, but page_idx is always 0

docker image is not producing any result

It just stalls forever. . .

Docker file available for hosting into lambda as container?

Hi,

I am trying to host the docker into lambda and do we have any docker file available for this? What is kind of CPU and memory we need for such instance? Our process will be into batch mode and does not need low latency throughput.

Thanks

Unable to finish setup of nlm-ingestor due to missing distutils module

ModuleNotFoundError: No module named 'distutils'

Got this error when installing the ingestor and when running the ingestor

API url issues

I'm having trouble using the custom url. When i use the example given it works fine but when using my own sever i get this issue json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
(.conda) (base) root@LangLlama:/wsl_projects/titan/nlm-ingestor# /root/wsl_projects/titan/nlm-ingestor/.conda/bin/python /root/wsl_projects/titan/nlm-ingestor/customrag.py
Traceback (most recent call last):
File "/root/wsl_projects/titan/nlm-ingestor/customrag.py", line 47, in
process_pdfs(pdf_directory)
File "/root/wsl_projects/titan/nlm-ingestor/customrag.py", line 17, in process_pdfs
docs = pdf_reader.read_pdf(pdf_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/wsl_projects/titan/nlm-ingestor/.conda/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py", line 73, in read_pdf
blocks = response_json['return_dict']['result']['blocks']
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'result'
(.conda) (base) root@LangLlama:/wsl_projects/titan/nlm-ingestor#

Question: Is it possible to retrieve the pdf position (bbox) for table rows

At this moment I only retrieve the position in a pdf file for the table.
Is it possible to get the position of the table row?

Connection being reset by peer

I'm running the nlm-ingestor on a pipeline where I'm processing thousands of total documents (~100 in parallel). I created multiple nlm-ingestor services behind a load balancer to distribute the load. But even if I create a lot of services, I get this randomly by llmsherpa:

{"reason":"('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))","status":"fail"}

And this is the stack trace:

...
File "/home/ubuntu/.local/lib/python3.10/site-packages/llmsherpa/readers/file_reader.py", line 74, in read_pdf
    blocks = response_json['return_dict']['result']['blocks']
KeyError: 'return_dict'

So what is happening is that for some reason the nlm-ingestor service drops the connection (maybe there are too many?) and llmsherpa doesn't get a proper response_json with a return_dict value.

Have you encountered this issue? Any idea on how to properly debug what could be happening?
Thanks!

&applyOcr=yes - no OCR taking place (skipping image pages)

I'm using &applyOcr=yes, but there's no indication that any OCR is taking place.
I'm getting back the HTML from PDF text ok, but pages that are images of (clear) text from my PDFs are completely skipped.
I'm using the latest docker image from the notebook.
thanks

For anyone hoping to deploy this as a lambda

Dockerfile

syntax=docker/dockerfile:experimental

FROM python:3.11-bookworm
RUN apt-get update && apt-get -y --no-install-recommends install libgomp1
ENV APP_HOME /app

install Java

RUN mkdir -p /usr/share/man/man1 &&
apt-get update -y &&
apt-get install -y openjdk-17-jre-headless

install essential packages

RUN apt-get install -y
libxml2-dev libxslt-dev
build-essential libmagic-dev

install tesseract

RUN apt-get install -y
tesseract-ocr
lsb-release
&& echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/notesalexp.list > /dev/null
&& apt-get update -oAcquire::AllowInsecureRepositories=true
&& apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true -y --allow-unauthenticated
&& apt-get update
&& apt-get install -y
tesseract-ocr libtesseract-dev
&& wget -P /usr/share/tesseract-ocr/5/tessdata/ https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
RUN apt-get install unzip -y &&
apt-get install git -y &&
apt-get autoremove -y
WORKDIR ${APP_HOME}
COPY ./requirements.txt ./requirements.txt
RUN pip install --upgrade pip setuptools
RUN apt-get install -y libmagic1
RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts
RUN pip install -r requirements.txt

Set NLTK Data directory environment variable to ensure it uses a known location

RUN mkdir -p /usr/local/share/nltk_data && chmod a+rwx /usr/local/share/nltk_data
ENV NLTK_DATA /usr/local/share/nltk_data

Download necessary NLTK data using the defined base directory

RUN python -m nltk.downloader -d /usr/local/share/nltk_data stopwords
RUN python -m nltk.downloader -d /usr/local/share/nltk_data punkt
RUN pip install awslambdaric

COPY . ./

ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]

Set up the command for the Lambda handler

CMD [ "handler.parse" ]

handler.py

import base64
import json
import tempfile
import os
import traceback
from werkzeug.utils import secure_filename
from nlm_ingestor.ingestor import ingestor_api
from nlm_utils.utils import file_utils
import subprocess
import os
import time
import threading

def parse_document(file_content, filename, render_format="all", use_new_indent_parser=False, apply_ocr=False):
parse_options = {
"parse_and_render_only": True,
"render_format": render_format,
"use_new_indent_parser": use_new_indent_parser,
"parse_pages": (),
"apply_ocr": apply_ocr
}

try:
    # Create a temporary file to save the decoded content
    tempfile_handler, tmp_file_path = tempfile.mkstemp(suffix=os.path.splitext(filename)[1])
    with os.fdopen(tempfile_handler, 'wb') as tmp_file:
        tmp_file.write(file_content)

    # calculate the file properties
    props = file_utils.extract_file_properties(tmp_file_path)
    print(f"Parsing document: {filename}")
    return_dict, _ = ingestor_api.ingest_document(
        filename,
        tmp_file_path,
        props["mimeType"],
        parse_options=parse_options,
    )
    return return_dict or {}

except Exception as e:
    traceback.print_exc()
    return {"status": "fail", "reason": str(e)}

finally:
    if os.path.exists(tmp_file_path):
        os.unlink(tmp_file_path)

def read_output(process):
while True:
output = process.stdout.readline()
if output == '':
break
print(output.strip())

def start_tika():
print('see jar', os.path.exists("jars/tika-server-standard-nlm-modified-2.4.1_v6.jar"))
tika_path = "jars/tika-server-standard-nlm-modified-2.4.1_v6.jar"
java_path = "/usr/bin/java" # Use the common path for Java
process = subprocess.Popen([java_path, "-jar", tika_path],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
# thread = threading.Thread(target=read_output, args=(process,))
# thread.start()

# Main thread can perform other tasks here, or wait for the output thread to finish
# thread.join()
print("Tika Server process completed.")

Call this function early in your Lambda handler

import requests

def test_tika():
try:
response = requests.get('http://localhost:9998/tika')
if response.status_code == 200:
print("Tika Server is reachable and ready!")
return True
else:
print("Tika Server is not ready. Status Code:", response.status_code)
return False
except Exception as e:
print("Failed to connect to Tika Server:", str(e))
return False

def parse(event, context):
print(context)
if 'body' not in event:
return {
"statusCode": 400,
"body": json.dumps({"message": "No data provided"})
}
start_tika()

working = test_tika()
while not working:
    time.sleep(3)
    working = test_tika()

# Decode the file from base64
file_content = base64.b64decode(event['body'])
filename = "uploaded_document.pdf"  # This needs to be passed or inferred some way

# Extract additional parameters
params = event.get('queryStringParameters', {})
render_format = params.get('render_format', 'all')
use_new_indent_parser = params.get('use_new_indent_parser', 'no') == 'yes'
apply_ocr = params.get('apply_ocr', 'no') == 'yes'

# Process the document
result = parse_document(
    file_content, filename, render_format, use_new_indent_parser, apply_ocr
)

return {
    "statusCode": 200,
    "return_dict": result
}

How to use ingestors?

The documentation provides an example of using LayoutPDFReader class to process PDF documents. But it also says about various ingestors (XML, HTML, text, etc.) but not a single example of how can we use it and how it is connected to a LayoutPDFReader. Maybe there's a LayoutTextReader or something similar?

PDF extraction

I have created pdf from its docx version in which sections and subsections were created by built in heading styles instead of numbering .It is not able to recognise few subsections inside sections

TypeError: 'NoneType' object is not subscriptable

Seeing the following error for one of my PDFs with the new indent parser:

127.0.0.1 - - [13/Feb/2024 15:55:29] "POST /api/parseDocument?renderFormat=all&useNewIndentParser=yes HTTP/1.1" 200 -
testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': True, 'parse_pages': (), 'apply_ocr': False}
processing page: 0 Number of p_tags.... 15
processing page: 1 Number of p_tags.... 19
processing page: 2 Number of p_tags.... 19
processing page: 3 Number of p_tags.... 6
processing page: 4 Number of p_tags.... 7
processing page: 5 Number of p_tags.... 12
processing page: 6 Number of p_tags.... 15
processing page: 7 Number of p_tags.... 3
processing page: 8 Number of p_tags.... 14
processing page: 9 Number of p_tags.... 5
processing page: 10 Number of p_tags.... 16
processing page: 11 Number of p_tags.... 5
processing page: 12 Number of p_tags.... 11
processing page: 13 Number of p_tags.... 14
processing page: 14 Number of p_tags.... 11
processing page: 15 Number of p_tags.... 7
processing page: 16 Number of p_tags.... 11
processing page: 17 Number of p_tags.... 12
processing page: 18 Number of p_tags.... 14
processing page: 19 Number of p_tags.... 18
processing page: 20 Number of p_tags.... 39
processing page: 21 Number of p_tags.... 1
processing page: 22 Number of p_tags.... 1
processing page: 23 Number of p_tags.... 1
processing page: 24 Number of p_tags.... 1
processing page: 25 Number of p_tags.... 1
processing page: 26 Number of p_tags.... 1
processing blocks in page: 1
processing blocks in page: 2
processing blocks in page: 3
processing blocks in page: 4
processing blocks in page: 5
processing blocks in page: 6
processing blocks in page: 8
processing blocks in page: 9
processing blocks in page: 10
processing blocks in page: 11
processing blocks in page: 12
processing blocks in page: 13
processing blocks in page: 14
processing blocks in page: 15
processing blocks in page: 16
processing blocks in page: 17
processing blocks in page: 18
processing blocks in page: 19
processing blocks in page: 20
error uploading file, stacktrace: Traceback (most recent call last):
File "/root/nlm-ingestor/nlm_ingestor/ingestion_daemon/main.py", line 44, in parse_document
ingest_status, return_dict = ingestor_api.ingest_document(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in init
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
File "/root/nlm-ingestor/nlm_ingestor/ingestor/pdf_ingestor.py", line 175, in parse_blocks
indent_parser.indent()
File "/root/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 254, in indent
self.indent_leafs()
File "/root/git/nlm-ingestor/nlm_ingestor/ingestor/visual_ingestor/new_indent_parser.py", line 244, in indent_leafs
block['level'] = curr_header['level'] + 1
TypeError: 'NoneType' object is not subscriptable

Error when parsing a PDF

I am running the development server using docker on my local machine.

The API url I'm using is:

http://localhost:5010/api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes

When posting my PDF to the server, I receive the following error in logs:

Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
    self.organize_and_indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
    indent.indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
    indent, level_stack, indent_reason = get_level(class_name)
                                         ^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
    parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
                      ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''

testing.. {'parse_and_render_only': True, 'render_format': 'all', 'use_new_indent_parser': True, 'parse_pages': (), 'apply_ocr': True}
processing page:  0  Number of p_tags....  5
processing page:  1  Number of p_tags....  17
processing page:  2  Number of p_tags....  199
processing page:  3  Number of p_tags....  184
processing page:  4  Number of p_tags....  14
splitting line: 700+ 300+
processing page:  5  Number of p_tags....  28
processing page:  6  Number of p_tags....  43
processing page:  7  Number of p_tags....  46
processing page:  8  Number of p_tags....  33
processing page:  9  Number of p_tags....  15
processing page:  10  Number of p_tags....  24
processing page:  11  Number of p_tags....  68
processing page:  12  Number of p_tags....  28
processing page:  13  Number of p_tags....  40
processing page:  14  Number of p_tags....  42
processing page:  15  Number of p_tags....  47
processing page:  16  Number of p_tags....  47
processing page:  17  Number of p_tags....  59
processing page:  18  Number of p_tags....  39
processing page:  19  Number of p_tags....  42
processing page:  20  Number of p_tags....  49
processing page:  21  Number of p_tags....  49
processing page:  22  Number of p_tags....  55
processing page:  23  Number of p_tags....  34
processing page:  24  Number of p_tags....  20
processing page:  25  Number of p_tags....  42
processing page:  26  Number of p_tags....  49
processing page:  27  Number of p_tags....  40
processing page:  28  Number of p_tags....  47
processing page:  29  Number of p_tags....  34
processing page:  30  Number of p_tags....  21
processing page:  31  Number of p_tags....  42
processing page:  32  Number of p_tags....  47
processing page:  33  Number of p_tags....  48
processing page:  34  Number of p_tags....  20
processing page:  35  Number of p_tags....  34
processing page:  36  Number of p_tags....  16
processing page:  37  Number of p_tags....  31
processing page:  38  Number of p_tags....  34
processing page:  39  Number of p_tags....  35
processing page:  40  Number of p_tags....  46
processing page:  41  Number of p_tags....  50
processing page:  42  Number of p_tags....  45
processing page:  43  Number of p_tags....  39
processing page:  44  Number of p_tags....  48
processing page:  45  Number of p_tags....  47
processing page:  46  Number of p_tags....  41
processing page:  47  Number of p_tags....  44
processing page:  48  Number of p_tags....  44
processing page:  49  Number of p_tags....  46
processing page:  50  Number of p_tags....  47
processing page:  51  Number of p_tags....  21
processing page:  52  Number of p_tags....  40
processing page:  53  Number of p_tags....  18
processing page:  54  Number of p_tags....  10
processing page:  55  Number of p_tags....  4
processing page:  56  Number of p_tags....  39
processing page:  57  Number of p_tags....  92
processing page:  58  Number of p_tags....  162
processing page:  59  Number of p_tags....  150
processing page:  60  Number of p_tags....  21
processing page:  61  Number of p_tags....  33
processing page:  62  Number of p_tags....  7
processing blocks in page:  1
processing blocks in page:  2
processing blocks in page:  3
processing blocks in page:  4
processing blocks in page:  5
processing blocks in page:  6
processing blocks in page:  7
processing blocks in page:  8
processing blocks in page:  9
processing blocks in page:  10
processing blocks in page:  11
processing blocks in page:  12
processing blocks in page:  12
processing blocks in page:  13
processing blocks in page:  14
processing blocks in page:  15
processing blocks in page:  16
processing blocks in page:  17
processing blocks in page:  18
processing blocks in page:  19
processing blocks in page:  19
processing blocks in page:  20
processing blocks in page:  20
processing blocks in page:  21
processing blocks in page:  22
processing blocks in page:  23
processing blocks in page:  24
processing blocks in page:  25
processing blocks in page:  26
processing blocks in page:  26
processing blocks in page:  27
processing blocks in page:  28
processing blocks in page:  29
processing blocks in page:  30
processing blocks in page:  30
processing blocks in page:  31
processing blocks in page:  32
processing blocks in page:  33
processing blocks in page:  34
processing blocks in page:  35
processing blocks in page:  36
processing blocks in page:  37
processing blocks in page:  38
processing blocks in page:  39
processing blocks in page:  40
processing blocks in page:  41
processing blocks in page:  42
processing blocks in page:  43
processing blocks in page:  44
processing blocks in page:  45
processing blocks in page:  46
processing blocks in page:  47
processing blocks in page:  48
processing blocks in page:  49
processing blocks in page:  50
processing blocks in page:  51
processing blocks in page:  53
processing blocks in page:  52
processing blocks in page:  54
processing blocks in page:  55
processing blocks in page:  56
processing blocks in page:  57
processing blocks in page:  58
processing blocks in page:  59
processing blocks in page:  60
processing blocks in page:  61
processing blocks in page:  62
error uploading file, stacktrace:  error uploading file, stacktrace: Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
    self.organize_and_indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
    indent.indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
    indent, level_stack, indent_reason = get_level(class_name)
                                         ^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
    parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
                      ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''
Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
    ingest_status, return_dict = ingestor_api.ingest_document(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 172, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 551, in parse
    self.organize_and_indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 3053, in organize_and_indent_blocks
    indent.indent_blocks()
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 682, in indent_blocks
    indent, level_stack, indent_reason = get_level(class_name)
                                         ^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/indent_parser.py", line 321, in get_level
    parent_list_idx = list_indents[l["list_type"]]["parent_list_idx"]
                      ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: ''
172.17.0.1 - - [03/Apr/2024 07:54:14] "POST /api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes HTTP/1.1" 500 -

I am unable to share the PDF here due to NDA reasons but I can answer any questions regarding the PDF pages if you have any.

If i understand the trace correctly, page 62 is causing the error, so here's a screenshot of the PDF's page 62:

memory leaks

Have you encountered any memory leaks when using the Docker version?

How to use HTML parser?

I've been playing around with LLMSherpa and the ingestor but am stuck on setting up the HTML parser.

I was able to send the request by modifying the LayoutPDFReader snippets for parsing a PDF file below,

def parse_pdf( pdf_file):
    auth_header = {}
    parser_response = api_connection.request("POST", "http://localhost:5010/api/parseDocument?renderFormat=all", fields={'file': pdf_file})
    return parser_response

def read_html(path_or_url, contents=None):
    """
    Reads pdf from a url or path

    Parameters
    ----------
    path_or_url: str
        path or url to the pdf file e.g. https://someexapmple.com/myfile.pdf or /home/user/myfile.pdf
    contents: bytes
        contents of the pdf file. If contents is given, path_or_url is ignored. This is useful when you already have the pdf file contents in memory such as if you are using streamlit or flask.
    """
    file_name = os.path.basename(path_or_url)
    with open(path_or_url, "rb") as f:
        file_data = f.read()
        pdf_file = (file_name, file_data, 'text/html')
    parser_response = parse_pdf(pdf_file)
    response_json = json.loads(parser_response.data.decode("utf-8"))
    blocks = response_json['return_dict']['result']['blocks']
    return Document(blocks)

The server is getting the error:

error uploading file, stacktrace: Traceback (most recent call last):
 File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
   ingest_status, return_dict = ingestor_api.ingest_document(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 47, in ingest_document
   htmli = html_ingestor.HTMLIngestor(doc_location)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/app/nlm_ingestor/ingestor/html_ingestor.py", line 32, in __init__
   self.json_dict = br.render_json()
                    ^^^^^^^^^^^^^^^^
 File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 259, in render_json
   block["box_style"][1],
   ~~~~~^^^^^^^^^^^^^
KeyError: 'box_style'
Traceback (most recent call last):
 File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 44, in parse_document
   ingest_status, return_dict = ingestor_api.ingest_document(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 47, in ingest_document
   htmli = html_ingestor.HTMLIngestor(doc_location)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/app/nlm_ingestor/ingestor/html_ingestor.py", line 32, in __init__
   self.json_dict = br.render_json()
                    ^^^^^^^^^^^^^^^^
 File "/app/nlm_ingestor/ingestor/visual_ingestor/block_renderer.py", line 259, in render_json
   block["box_style"][1],
   ~~~~~^^^^^^^^^^^^^
KeyError: 'box_style'
172.17.0.1 - - [26/Jan/2024 19:13:34] "POST /api/parseDocument?renderFormat=all HTTP/1.1" 500

My sample HTML file is copy/pasted from inspect source on https://github.com/nlmatics/nlm-ingestor. Any thoughts? Is the HTML parser only meant for TIKA responses to generating HTML for DOCX, PPTX? Thanks!

Does the docker image come with the modified tika server?

I'm trying to confirm I have the modified tika server when I run the docker image. In the docs it says that one must set up their own modified tika server and set the env appropriately.

I logged into the docker container and confirmed that the env variable is not set:

# env | grep TIKA_SERVER_ENDPOINT

Additionally, I don't see clear instructions on how to set up one's own modified tika server... can someone point me in the right direction for setting up the modified tika server? Or is it already built into the docker image?

Many thanks!

numpy error while parsing

I'm getting the following error when parsing some PDFs, but not with others. Unfortunately I cannot share the files, but I can share some metadata upon request.

nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
nlm-ingestor  |   return _methods._mean(a, axis=axis, dtype=dtype,
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
nlm-ingestor  |   ret = ret.dtype.type(ret / rcount)
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:269: RuntimeWarning: Degrees of freedom <= 0 for slice
nlm-ingestor  |   ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in divide
nlm-ingestor  |   arrmean = um.true_divide(arrmean, div, out=arrmean,
nlm-ingestor  | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:261: RuntimeWarning: invalid value encountered in scalar divide
nlm-ingestor  |   ret = ret.dtype.type(ret / rcount)

Endpoint: http://nlm-ingestor:5001/api/parseDocument?renderFormat=all called through LLMSherpa library

Any suggestion?

Dependency versions too strict

I think the version requirements in setup.py is a little too strict. For the following packages, only one version is allowed. This causes issues when I was installing nlm-ingestor from a requirements.txt.

install_requires=[
    ...
    "symspellpy==6.7.0",
    "pandas==1.2.4",
    "mistune==2.0.3",
    "lxml==4.9.1",
    ...
]

Error message from pip install -r requirements.txt:

ERROR: Cannot install -r requirements.txt (line 107), -r requirements.txt (line 129), -r requirements.txt (line 25), -r requirements.txt (line 5) and pandas==2.0.3 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested pandas==2.0.3
    altair 5.2.0 depends on pandas>=0.25
    gradio 3.36.1 depends on pandas
    layoutparser 0.3.4 depends on pandas
    nlm-ingestor 0.1.5 depends on pandas==1.2.4
    The user requested pandas==2.0.3
    altair 5.2.0 depends on pandas>=0.25
    gradio 3.36.1 depends on pandas
    layoutparser 0.3.4 depends on pandas
    nlm-ingestor 0.1.4 depends on pandas==1.2.4
    The user requested pandas==2.0.3
    altair 5.2.0 depends on pandas>=0.25
    gradio 3.36.1 depends on pandas
    layoutparser 0.3.4 depends on pandas
    nlm-ingestor 0.1.3 depends on pandas==1.2.4
    The user requested pandas==2.0.3
    altair 5.2.0 depends on pandas>=0.25
    gradio 3.36.1 depends on pandas
    layoutparser 0.3.4 depends on pandas
    nlm-ingestor 0.1.2 depends on pandas==1.2.4
    The user requested pandas==2.0.3
    altair 5.2.0 depends on pandas>=0.25
    gradio 3.36.1 depends on pandas
    layoutparser 0.3.4 depends on pandas
    nlm-ingestor 0.1.1 depends on pandas==1.2.4

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

Is it possible to use a range of versions for these packages, e.g. symspellpy>=6.7.0, pandas>=1.2.4?

Thanks.

	with:
	context: .
	push: ${{ github.event_name != 'pull_request' }}
	tags: ${{ steps.meta.outputs.tags }}
	labels: ${{ steps.meta.outputs.labels }}
	cache-from: type=gha
	cache-to: type=gha,mode=max

	if 'is_table_end' in block:
	is_rendering_table = False
	table_block = render_dict["blocks"][-1]
	table_block["table_rows"] = table_rows
	table_block["bbox"] = [
	table_block["left"],
	table_block["top"],
	table_block["left"] + block["box_style"][3],
	table_block["top"] + block["box_style"][4],
	] if "box_style" in block else []

nlmatics / nlm-ingestor Goto Github PK

nlm-ingestor's People

Contributors

Stargazers

Watchers

Forkers

nlm-ingestor's Issues

syntax=docker/dockerfile:experimental

install Java

install essential packages

install tesseract

Set NLTK Data directory environment variable to ensure it uses a known location

Download necessary NLTK data using the defined base directory

Set up the command for the Lambda handler

Call this function early in your Lambda handler

Recommend Projects

Recommend Topics

Recommend Org