aws-samples / amazon-textract-textractor Goto Github PK

Analyze documents with Amazon Textract and generate output in multiple formats.

License: Apache License 2.0

Python 18.57% Jupyter Notebook 81.43%

amazon-textract-textractor's Introduction

Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

amazon-textract-caller (to simplify calling Amazon Textract without additional dependencies)
amazon-textract-response-parser (to parse the JSON response returned by Textract APIs)
amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image)
amazon-textract-prettyprinter (convert Amazon Textract response to CSV, text, markdown, ...)
amazon-textract-geofinder (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)

Installation

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

pandas (pip install "amazon-textract-textractor[pandas]") installs pandas which is used to enable DataFrame and CSV exports.
pdf (pip install "amazon-textract-textractor[pdf]") includes pdf2image and enables PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
torch (pip install "amazon-textract-textractor[torch]") includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
dev (pip install "amazon-textract-textractor[dev]") includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]".

Documentation

Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/

Examples

While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.

Setup

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

from textractor import Textractor

extractor = Textractor(profile_name="default")

Text recognition

# file_source can be an image, list of images, bytes or S3 path
document = extractor.detect_document_text(file_source="tests/fixtures/single-page-1.png")
print(document.lines)
#[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]

Table extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.TABLES]
)
# Saves the table in an excel document for further processing
document.tables[0].to_excel("output.xlsx")

Form extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.FORMS]
)
# Use document.get() to search for a key with fuzzy matching
document.get("email")
# [E-mail Address : [email protected]]

Analyze ID

document = extractor.analyze_id(file_source="tests/fixtures/fake_id.png")
print(document.identity_documents[0].get("FIRST_NAME"))
# 'MARIA'

Receipt processing (Analyze Expense)

document = extractor.analyze_expense(file_source="tests/fixtures/receipt.jpg")
print(document.expense_documents[0].summary_fields.get("TOTAL")[0].text)
# '$1810.46'

If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples.

CLI

Textractor also comes with the textractor script, which supports calling, printing and overlaying directly in the terminal.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES

See the documentation for more examples.

Tests

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna (@srividh-r).

Contributing

See CONTRIBUTING.md

Citing

Textractor can be cited using:

@software{amazontextractor,
  author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya},
  title = {{Amazon Textractor}},
  url = {https://github.com/aws-samples/amazon-textract-textractor},
  version = {1.7.5},
  year = {2024}
}

Or using the CITATION.cff file.

License

This library is licensed under the Apache 2.0 License.

_{^{Excavator image by macrovector on Freepik}}

amazon-textract-textractor's People

Contributors

Stargazers

Watchers

Forkers

cbsindia bhralmeida desaetiis roydeboys benscoterbk aftermatthew poeblu harshitam1997 sahays yossi2cohen andres-mejia approbotic roselleebarle04 puayny himani07 tsunamitreats changux stevenmontealegre sumitsingh2415 shailendrasingh ersawant ram3126 fredenow jbahire msandeep79 godeep48 lior-neuman tbouchik gtostock itechca dcava rana0076 knkulidp dwtcourses carmor-iconet yoditgetahun prateek-tyagi haha7bang weisisheng akash-negi janahang schadem amitkayal sydox athewsey dinakaranonline foyez94 angstyloop anjankumarv richardscottoz sunainakhannadovel michaelhsieh42 bradeggins vicrojo odouroflove separrac irbian pkundu25 qpc-database code-heroes-pty-ltd dhawalkp brandon-vidoori ridhap ssnghkj awsome1977 payneinbrklyn arlindnocaj der-ofenmeister abelrugaju yvrm jela-nalica tb102122 dobeerman vpineda7 timwukp anjanvb kamaleddin afifaniks trivedisorabh ramakrushnapradhan thomasdelteil kkourmousis thecloudplanteam tlebryk-cohere rudolfsberzins philippegabert bigdatasciencegroup pandinosaurus mamiololo01 mistobaan jordan-barrett-jm krzim dannellyz unhoang grantrosse lohkna007 sckinner lyy901207 vinyasmusic paike

amazon-textract-textractor's Issues

OutputConfig json has none value

sample

    "Blocks": [{
        "BlockType": "PAGE",
        "ColumnIndex": null,
        "ColumnSpan": null,
        "Confidence": null,
        "EntityTypes": null,
        "Geometry": {
            "BoundingBox": {

Find phrase on page not working with multipage docs

The find phase on page is not working for multipage documents with search terms that are actual 2 words.
I am able to reproduce the error with a document and the below 2 calls.
First one works but the second call fails with the error that the page numbers are not equal.

response = document.find_phrase_on_page("Any", page_number=2,area_selection=AreaSelection(
                top_left=t2.TPoint(y=0, x=0),lower_right=t2.TPoint(y=doc_height, x=doc_width),
                page_number=2))

response = document.find_phrase_on_page("Any words", page_number=2,area_selection=AreaSelection(
                top_left=t2.TPoint(y=0, x=0),lower_right=t2.TPoint(y=doc_height, x=doc_width),
                page_number=2))

I was able to narrow it down to the related function and call but I am not sure how resolve it.
I would appreciate some guidance to resolve it.

tgeofinder.py\TGeoFinder__find_phrase_on_page during calling

words_to_right = self.get_words_to_the_right(
   anker=TGeoFinder.get_area_selection_for_twords([current_word]),number_of_words_to_return=1)

For that call the page_numbers is not part of the call and therefore the wrong area is returned.

Textract missing most of the text in documents.

I am processing some fairly simple pdfs from S3 using textract document detection. For most of these documents, the returned JSON contains very little text. For example, using the pdf located here, returns only 778 words and 127 lines from a 96 page pdf. Also note that the pdf is very simple in structure so I'm very confused why textract is less effective than something like PyPDF2.

Please let me know if there is something I'm missing here? Also, if there is any data on the types of documents that are suitable/not suitable for textract would be helpful.

Many thanks.

Repeated data in medical-insights-entities.csv and medical-insights-phi.json

I am planning to use Comprehend Medical in production in a new biomedical research product we are working on. I used Textractor to process an 1143 page pdf of a single patient's medical records. And it worked - amazing!

However, I noticed that the entire set of extracted entities are repeated in the {}-medical-insights-entities.csv file that is generated for each page of the pdf. The file contents look like this ...

Text,Type,Category,Score,BeginOffset,EndOffset
"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18
68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28
Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50
11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61
917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172
MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180

... there are lots of valid entries and then the entire set repeats.

"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18
68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28
Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50
11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61
917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172
MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180

... etc. The same pattern occurs in the {}-medical-insights-phi.json file. "Id": 0 through "Id": 48 occur as expected and then they repeat - all within the same enclosing list. This pattern occurs for each of the 1143 pages in "Some random person's" medical record. I have not noticed any other obvious fubars yet.

I would attach the file(s) but (a) it really is PHI and (b) there's A LOT of data.

Support TIFF in textractcaller

Tiff support is released, textractcaller and textracthelper should support it

Visualize document.expense_documents

It would be great if we could visualize expense_documents and the associated normalized summary fields directly on the document as well, similarly as to how we currently visualize KV containers for example.

Prettyprint sample produces TypeError: ‘NoneType’ object is not iterable

line 28, in
pretty_printed_string = get_string(textract_json=response, table_format=Pretty_Print_Table_Format)
File “/usr/local/lib/python3.9/site-packages/textractprettyprinter/t_pretty_print.py”, line 20, in get_string
for t in output_type:
TypeError: ‘NoneType’ object is not iterable

[Feature request] Configurable async job poll time in Caller

Textract Caller currently hard-codes the polling interval for checking whether an async analysis job is done to 1 second, here.

This seems like a nice, speedy default for single-threaded calls. However, it becomes the limiting factor for scaling up via local thread parallelism because the default quota on GetDocumentAnalysis API calls is relatively low (reported as 5TPS here or 1TPS here) - as compared to the quota for concurrent running analyses, which is in the hundreds.

So if a user (like me) is doing a PoC with a few hundred PDF documents on S3 and hoping to get them processed promptly without going the whole hog and deploying something like the large scale serverless processing sample, they can achieve a decent speed-up quite easily by using Textract Caller with a multiprocessing.pool.ThreadPool... But, can't increase the pool size beyond about 5 (in my us-east-2 tests at least), because 5 threads times 1/sec = 5TPS on the Get* API.

I'm not saying TextractCaller + thread parallelism is a great/scalable practice, but processing datasets ~5 times faster with a small bit of ThreadPool logic can be quite helpful... And seems like giving flexibility to increase the poll period could quite easily enable going up to ~10-20 requests in parallel.

Of course it would also be great if this library exposed some more built-in functionality for processing batches of docs with concurrency - instead of leaving this to user code.

Async AnalyzeExpense does not work

to reproduce:
invoice_multipage.pdf

import textractor
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.start_expense_analysis(
    file_source='./invoice_multipage.pdf',
    s3_output_path='s3://<BUCKET>',
    s3_upload_path='s3://<BUCKET>'
)

Need an option to save output in UTF-8 encoding to avoid saving as Windows-1252 encoding

It looks like the only way to capture the output of amazon-textract is to redirect it into a file. Such as:

amazon-textract --input-document "s3://somebucket/2022-04-16-0010.jpg" --pretty-print LINES > 2022-04-16-0010.txt

Unfortunately, this is a problem on Windows because the default encoding is Windows 1252, not UTF-8. When trying to analyze the output using other tools, UTF-8 is often required.

Something like this would be very useful:

amazon-textract --input-document "s3://somebucket/2022-04-16-0010.jpg" --pretty-print LINES -output-document 2022-04-16-0010.txt

where the default output is UTF-8.

geofinder fails for find_phrase when same distance

TWord did not implement the comparison operators, so when the euclidean distance was the same it would fail due to TWord not being able to sort

S3 folder limited to one 'page' of objects

In textractor.py, we currently seem to hard-code a limit of 1 max pages of S3 objects when calling S3Helper.getFileNames() to list the objects in an S3 folder input - even though the underlying helper seems capable of pagination.

It looks from the Boto3 doc like the default page length is 1000 objects so it would only come up for pretty big jobs, but I worry it might trip some folks up as I don't think there's any obvious message to the user that this limit is being applied?

My 2 cents is that the potential detriment to user experience from having the hidden limit is worse than the potential risk of somebody pointing the script at a bigger S3 folder than they expected: Particularly because today Textractor outputs such frequent diagnostics about the number of docs it's processing, and the serial nature of the queue isn't like it's going to start pushing a massive volume through APIs before anybody can spot the problem and interrupt it... But maybe others have different perspectives?

Implement smoke tests

Recent issues such as #121 #122 #123 showed that our current test suite is inadequate and several edge cases are missing. This issue aims to outline a plan for implementing smoke tests to identify issues with the core APIs namely:

detect_document_text
start_document_text_detection
analyze_document
start_document_analysis
analyze_id
analyze_expense
start_expense_analysis
get_result

Each API should be callable with a Pillow image (lowest common-denominator) and the result should be visualizable.

duplicates in checkboxes with same value

I had an issue regarding checkboxes.let me explain problem.
i had check boxes with same values like 1)gender []male [] female and 2)sex []male []female
I am getting results like male selected and female unselected only check boxes with status coming .but i want to extract checkboxes with status and pairing with question or key like gender,sex .

please respond ASAP ,thank you

Currency symbols not identifying properly

Currency symbols not identifying properly. Pound symbol is recognised as E

amazon-textract helper not works on windows 10

After run

python -m pip install amazon-textract-helper

It creates a file named "amazon-textract" at %LOCALAPPDATA%\Programs\Python\Python38\Scripts

Note that is named "amazon-textract" not "amazon-textract.py", so windows 10 don't know how execute it and ask with he usual popup window "How do you want to open this file?"

The following command works ok on power shell.

python $env:LOCALAPPDATA\Programs\Python\Python38\Scripts\amazon-textract --help

Modification to the Document entity from response is not captured when using .to_trp2()

The conversion to trp2 is based on using the initial response. This does not capture the any modifications made to the entities like OCR post-processing or correction or deletion of entities. A proper converter needs to be implemented to make the library usable for post-processing in-place modifications.

This would allow workflows along the lines of:

document.pages[1].key_values = {key: value + '_edited' for key, value in document.pages[1].key_values}
document.export("document.json")

We also need to add utilities function such as

Merging tables
Adding new keys
Adding new queries output

Multipage docs fail to import with unique constraint exception

for multi-page docs, getting IntegrityError: UNIQUE constraint failed: ocrdb.id

[Feature Request] Caller to export useful constructs directly from the library root

IMO it's a bit verbose to have a submodule called t_call that contains basically all the useful parts of the (very small) textractcaller library.

Could we have the public members imported directly in __init__, so that you can e.g:

from textractcaller import call_textract, OutputConfig, Textract_Features

prettyprinter convert_queries_to_list_trp2 returns wrong list when answers are missing

instead of:

2022-07-08T17:44:41+00:00 AWS_PAYSTUBS Paystub_1_reMars.json 1 PAYSTUB_PERIOD_COMPANY_NAME 1
ANYCOMPANY INC. USA 96 0 0 0 0 0.095238097 0.154919237 0.633219957 0.10866373

returns

2022-07-08T17:44:41+00:00 AWS_PAYSTUBS Paystub_1_reMars.json 1 What is the Pay Period Start Date?
0 0 0 0 0 0 0 0

so, it returns the name and misses teh confidence values

Overlayer broken with DocumentDimension not subscritable

Traceback (most recent call last):
File "/Users/schadem/code/github/aws-samples/amazon-textract-textractor/helper/bin/amazon-textract", line 259, in
bounding_box_list = get_bounding_boxes(textract_json=doc,
File "/Users/schadem/code/github/aws-samples/amazon-textract-textractor/overlayer/textractoverlayer/t_overlay.py", line 104, in
get_bou
nding_boxes
page_dimensions = document_dimensions[page_number]
TypeError: 'DocumentDimensions' object is not subscriptable

Visualization is not taking into account the Geometry block

See screenshot of parsing the screenshot of the readme.

I believe this block [TBlock(geometry=TGeometry(bounding_box=TBoundingBox(width=1.0, height=0.912468671798706, left=0.0, top=0.030051277950406075) is ignored

Access denied StartDocumentTextDetection

Hi,

when running python3 textractor.py --documents s3://inputdatatimo/document.pdf --region us-east-1 --text I get following error message:

botocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the StartDocumentTextDetection operation:

though I granted (full access) permissions to the IAM user for S3 and Textract.
Are there any further permissions needed to make this run?

Many thanks
schatimo

Text is extracted but not grouped into forms and tables correctly

We're starting an invoice processing project and really like this library, but we're having one interesting issue: The text is all parsed correctly, but then it is not always grouped into forms and tables correctly.

So for example we could have this block in the text file:

SOLD TO:
SHIP TO:
CUSTOMER COMPANY
JANE RECIPIENT
123 SOMEWHERE ST
987 ANOTHER PL
LOS ANGELES CA 90001
SAN FRANCISCO CA 94100
USA
USA

But then it does not show up in a SOLD TO: or SHIP TO: forms or tables.

When it does show up, the confidence level is low (around 37), but the confidence of the text itself is very high:
"BlockType": "LINE",
"Confidence": 99.85072326660156,

Is this a problem of the relative position of the text versus their labels?

Should we try to adjust the forms/tables parsing algorithm? Or should we just work with the text and try to go with the repeating patterns of text, and not worry about forms and tables?

support single page PDF

Textrat launched single page PDF, should be supported in the caller.
Currently PDF defaults to async - we dont' want to change that. Maybe add a force_sync param to the method.

How can I order the results as shown in the pdf?

PDF

Example :
python3 textractor.py --documents s3://mybucket/mydoc.pdf --forms

Result :

62692bb61ab53-pdf-page-1-forms.csv

how can i order this way

Cannot visualize expense document

import textractor
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_expense(
    file_source='./invoice.png',
)
document.visualize()

[]

Import of variables is failing

The Import of the sample code is failing.

from textractcaller import call_textract, Textract_Features
features = [Textract_Features.TABLES]
print(features)

from textractcaller import call_textract, Textract_Features
ImportError: cannot import name 'Textract_Features' from 'textractcaller'

call_textract call_mode force_sync not calling sync for PDF/TIFF

call_textract_expense not exported in init.py

Rotated documents are not visualized correctly

I suggest that as a starting point we:

Use the polygon in order to visualize the text bounding box, print the text horizontally

Breaking compatibility requirement

Hi I am happy amazon-textract-caller user,
after updating to latest version difficulties with dependencies appeared:

https://github.com/aws-samples/amazon-textract-textractor/blob/master/caller/textractcaller/t_call.py#L11
line unused, breaks compatibility with previous requirements, provides marshmallow package into dependencies base, please verify

no test for overlayer

PDF dimension get_width_height_from_file call in image_tools.py was broken, but not detected as test were missing.
Adding test

Character encoding

Likely windows problem:-

Total Pages in Document: 77
Traceback (most recent call last):
File "textractorsandbox.py", line 169, in
Textractor().run()
File "textractorsandbox.py", line 148, in run
self.processDocument(ips, i, document)
File "textractorsandbox.py", line 113, in processDocument
opg.run()
File "D:\amazon-textract-textractor\src\og.py", line 111, in run
self._outputWords(page, p)
File "D:\amazon-textract-textractor\src\og.py", line 30, in _outputWords
csvFieldNames, csvData)
File "D:\amazon-textract-textractor\src\helper.py", line 122, in writeCSV
writer.writerow(row)
File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\singan\lib\csv.py", line 155, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\singan\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 37: character maps to

PDF must be in S3 bucket?

When running the following command:

python3 textractor.py --documents document-name.pdf --text --forms --tables

I get an error Exception: PDF must be in S3 bucket. Can it not be run on a local document?

Full error message below:

Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 143, in run
self.processDocument(ips, i, document)
File "textractor.py", line 98, in processDocument
dp = DocumentProcessor(ips["bucketName"], document, ips["awsRegion"], ips["text"], ips["forms"], ips["tables"])
File "/mnt/c/Users/Username/Documents/textractor/tdp.py", line 218, in init
raise Exception("PDF must be in S3 bucket.")
Exception: PDF must be in S3 bucket.

Queries cannot be visualized

doc.queries.visualize() returns an error

document.queries.visualize()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-36-09589854c2af> in <module>
----> 1 document.queries.visualize()

~/anaconda3/lib/python3.7/site-packages/textractor/visualizers/entitylist.py in visualize(self, with_text, with_confidence, font_size_ratio)
    109                 with_text,
    110                 with_confidence,
--> 111                 font_size_ratio,
    112             )
    113 

~/anaconda3/lib/python3.7/site-packages/textractor/visualizers/entitylist.py in _draw_bbox(entities, with_text, with_confidence, font_size_ratio)
    576     for entity in entities:
    577         width, height = image.size
--> 578         overlayer_data = _get_overlayer_data(entity, width, height)
    579         drw.rectangle(
    580             xy=overlayer_data["coords"], outline=overlayer_data["color"], width=2

~/anaconda3/lib/python3.7/site-packages/textractor/visualizers/entitylist.py in _get_overlayer_data(entity, width, height)
    711     data["coords"] = [x, y, x + w, y + h]
    712     data["confidence"] = (
--> 713         entity.confidence if not entity.__class__.__name__ == "Table" else ""
    714     )
    715     data["text_color"] = (0, 0, 0)

AttributeError: 'Query' object has no attribute 'confidence'

I suggest we highlight the results of the queries and index them like query_1 above on the document visualization

QuerieConfig not False when empty

this fails

    queries_config = QueriesConfig(queries=[])
    assert not queries_config.get_dict()

and therefore creates problems with simplyfing the call_textract call

Cannot visualize AnalyzeID documents

import textractor
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_id(
    file_source='./id.png',
)
document.visualize()

[]

_parseDocumentPagesAndBlockMap NoneType Error

Running into an error with specific PDF documents. I believe this is due to _parseDocumentPagesAndBlockMap not detecting blocks.

Traceback (most recent call last):
  File "c:\\Users\\User\\amazon-textract-textractor\\textractor.py", line 164, in <module>
    Textractor().run()
  File "c:\\Users\\User\\amazon-textract-textractor\\textractor.py", line 143, in run
    self.processDocument(ips, i, document)
  File "c:\\Users\\User\\amazon-textract-textractor\\textractor.py", line 109, in processDocument
    ips["forms"], ips["tables"])
  File "c:\\Users\\User\\amazon-textract-textractor\\og.py", line 13, in __init__
    self.document = Document(self.response)
  File "c:\\Users\\User\\amazon-textract-textractor\\trp.py", line 599, in __init__
    self._parse()
  File "c:\\Users\\User\\amazon-textract-textractor\\trp.py", line 631, in _parse
    self._responseDocumentPages, self._blockMap = self._parseDocumentPagesAndBlockMap()
  File "c:\\Users\\User\\amazon-textract-textractor\\trp.py", line 614, in _parseDocumentPagesAndBlockMap
    for block in page['Blocks']:
TypeError: 'NoneType' object is not subscriptable ```

Invalid input: Document or path to a foler or S3 bucket containing documents is required.

Hello - I'm running into errors when trying to process a PDF on my S3 bucket.

First thing I did was reference the ReadMe and tried the following:

py textractor.py --documents s3://bucketname/filename.pdf --text --forms --tables

Error I received:

`Invalid input: An error occurred (InvalidToken) when calling the GetBucketLocation operation: The provided token is malformed or otherwise invalid.
Valid format:

python3 textractor.py --documents mydoc.jpg --text --forms --tables --region us-east-1
python3 textractor.py --documents ./myfolder/ --text --forms --tables
python3 textractor.py --document s3://mybucket/mydoc.pdf --text --forms --tables
python3 textractor.py --document s3://mybucket/ --text --forms --tables
Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 135, in run
totalDocuments = len(ips["documents"])
TypeError: 'NoneType' object is not subscriptable`

I noticed there's no "s" when trying to process a file on s3, so I tried this:

py textractor.py --document s3://bucketname/filename.pdf --text --forms --tables

Got this error:

Invalid input: Document or path to a foler or S3 bucket containing documents is required.
Valid format:

python3 textractor.py --documents mydoc.jpg --text --forms --tables --region us-east-1
python3 textractor.py --documents ./myfolder/ --text --forms --tables
python3 textractor.py --document s3://mybucket/mydoc.pdf --text --forms --tables
python3 textractor.py --document s3://mybucket/ --text --forms --tables
Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 135, in run
totalDocuments = len(ips["documents"])
TypeError: 'NoneType' object is not subscriptable

Additionally... I tried running a test file locally (I chose a JPG and not a PDF since I get another message when trying to process a PDF locally stating I need to upload the PDF to an S3 bucket)

py textractor.py --documents test.jpg --text

And it seemed to work but ended up getting an error:

Textracting Document # 1: test.jpg
Calling Textract...
Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 143, in run
self.processDocument(ips, i, document)
File "textractor.py", line 99, in processDocument
response = dp.run()
File "C:\Users\CamiLaPorte\Desktop\Checkbox-Textract\amazon-textract-textractor\src\tdp.py", line 232, in run
response = ip.run()
File "C:\Users\CamiLaPorte\Desktop\Checkbox-Textract\amazon-textract-textractor\src\tdp.py", line 79, in run
response = self._callTextract()
File "C:\Users\CamiLaPorte\Desktop\Checkbox-Textract\amazon-textract-textractor\src\tdp.py", line 42, in _callTextract
response = textract.detect_document_text(Document={'Bytes': imageBytes})
File "C:\Users\CamiLaPorte\Anaconda3\lib\site-packages\botocore\client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "C:\Users\CamiLaPorte\Anaconda3\lib\site-packages\botocore\client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the DetectDocumentText operation: The security token included in the request is invalid

I've used textractor code multiple times in the last year and have never had this issue. It's only been with a recent download of the repo and code that I'm experiencing this. There have been no changes to my machine and I'm running the same environments (I've even tried multiple environments as well.)

Please let me know if you need more information. Thanks!

tpipelinepagedimensions not supporting TIFF

newly released TIFF support is not working with tiff files to get pipelinedimensions atm

amazon-textract helper not working for --stdin

receive

Traceback (most recent call last):
File "/Users/schadem/.pyenv/versions/3.9.6/bin/amazon-textract", line 193, in
if len(input_document) > 7 and input_document.lower().startswith("s3://"):
TypeError: object of type 'NoneType' has no len()

when using --stdin

[Feature request] Output destination option

Thanks for the nice utility! However, my working directory is now an absolute mess 😂

It would be really helpful if something like an --output CLI option was available where we could specify the destination to push the output files. Ideal if you could supply an S3 URI, but even being able to give a folder would be a big help!

(Maybe I'm just missing some obvious way to do this? Using Textractor to process many tens of documents in batch for a quick PoC)

def find_phrase_in_lines(
        self, phrase: str, min_textdistance=0.6, page_number: int = 1
    ) -> List[TWord]:
        """
        phrase = words seperated by space char
        """
        # first check if we already did find this phrase and stored it in the DB
        # TODO: Problem: it will not find Current: when the phrase has current and there are other current values in the document without :
        if not phrase:
            raise ValueError(f"no valid phrase: '{phrase}")
        phrase_words = phrase.split(" ")
        if len(phrase_words) < 1:
            raise ValueError(f"no valid phrase: '{phrase}")
        # TODO: check for page_number impl
        found_phrases: "list[TWord]" = self.ocrdb.select_text(
            textract_doc_uuid=self.textract_doc_uuid,
            text=make_alphanum_and_lower_for_non_numbers(phrase),
        )
        print("after ocrdb.select_text")
        if found_phrases:
            print("phrases found")
            return found_phrases

        alphanum_regex = re.compile(r"[\W_]+")
        # find phrase (words that follow each other) in trp lines
        for page in self.doc.pages:
            page_number = 1
            for line in page.lines:
......
`