Git Product home page Git Product logo

amazon-textract-textractor's Introduction

Textractor

Tests Documentation PyPI version Downloads Code style: black

Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

Installation

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

  • pandas (pip install "amazon-textract-textractor[pandas]") installs pandas which is used to enable DataFrame and CSV exports.
  • pdf (pip install "amazon-textract-textractor[pdf]") includes pdf2image and enables PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
  • torch (pip install "amazon-textract-textractor[torch]") includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
  • dev (pip install "amazon-textract-textractor[dev]") includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]".

Documentation

Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/

Examples

While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.

Setup

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

from textractor import Textractor

extractor = Textractor(profile_name="default")

Text recognition

# file_source can be an image, list of images, bytes or S3 path
document = extractor.detect_document_text(file_source="tests/fixtures/single-page-1.png")
print(document.lines)
#[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]

Table extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.TABLES]
)
# Saves the table in an excel document for further processing
document.tables[0].to_excel("output.xlsx")

Form extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.FORMS]
)
# Use document.get() to search for a key with fuzzy matching
document.get("email")
# [E-mail Address : [email protected]]

Analyze ID

document = extractor.analyze_id(file_source="tests/fixtures/fake_id.png")
print(document.identity_documents[0].get("FIRST_NAME"))
# 'MARIA'

Receipt processing (Analyze Expense)

document = extractor.analyze_expense(file_source="tests/fixtures/receipt.jpg")
print(document.expense_documents[0].summary_fields.get("TOTAL")[0].text)
# '$1810.46'

If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples.

CLI

Textractor also comes with the textractor script, which supports calling, printing and overlaying directly in the terminal.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES

overlay_example

See the documentation for more examples.

Tests

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna (@srividh-r).

Contributing

See CONTRIBUTING.md

Citing

Textractor can be cited using:

@software{amazontextractor,
  author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya},
  title = {{Amazon Textractor}},
  url = {https://github.com/aws-samples/amazon-textract-textractor},
  version = {1.7.5},
  year = {2024}
}

Or using the CITATION.cff file.

License

This library is licensed under the Apache 2.0 License.

Excavator image by macrovector on Freepik

amazon-textract-textractor's People

Contributors

0xb1dd1e-koan avatar abest0 avatar alanmohan avatar anjanvb avatar belval avatar darwaishx avatar dependabot[bot] avatar dhawalkp avatar grantrosse avatar irbian avatar janahang avatar jpbalarini avatar kashiiatamazon avatar kkourmousis avatar kmascar avatar krzim-aws avatar mdscruggs avatar michaelhsieh42 avatar paike avatar r9w avatar richardscottoz avatar schadem avatar ssnghkj avatar syanng avatar tb102122 avatar thomasdelteil avatar vinyasmusic avatar yuajia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-textract-textractor's Issues

OutputConfig json has none value

sample

    "Blocks": [{
        "BlockType": "PAGE",
        "ColumnIndex": null,
        "ColumnSpan": null,
        "Confidence": null,
        "EntityTypes": null,
        "Geometry": {
            "BoundingBox": {

Find phrase on page not working with multipage docs

The find phase on page is not working for multipage documents with search terms that are actual 2 words.
I am able to reproduce the error with a document and the below 2 calls.
First one works but the second call fails with the error that the page numbers are not equal.

response = document.find_phrase_on_page("Any", page_number=2,area_selection=AreaSelection(
                top_left=t2.TPoint(y=0, x=0),lower_right=t2.TPoint(y=doc_height, x=doc_width),
                page_number=2))

response = document.find_phrase_on_page("Any words", page_number=2,area_selection=AreaSelection(
                top_left=t2.TPoint(y=0, x=0),lower_right=t2.TPoint(y=doc_height, x=doc_width),
                page_number=2))

I was able to narrow it down to the related function and call but I am not sure how resolve it.
I would appreciate some guidance to resolve it.

tgeofinder.py\TGeoFinder__find_phrase_on_page during calling

words_to_right = self.get_words_to_the_right(
   anker=TGeoFinder.get_area_selection_for_twords([current_word]),number_of_words_to_return=1)

For that call the page_numbers is not part of the call and therefore the wrong area is returned.

Textract missing most of the text in documents.

I am processing some fairly simple pdfs from S3 using textract document detection. For most of these documents, the returned JSON contains very little text. For example, using the pdf located here, returns only 778 words and 127 lines from a 96 page pdf. Also note that the pdf is very simple in structure so I'm very confused why textract is less effective than something like PyPDF2.

Please let me know if there is something I'm missing here? Also, if there is any data on the types of documents that are suitable/not suitable for textract would be helpful.

Many thanks.

Repeated data in medical-insights-entities.csv and medical-insights-phi.json

I am planning to use Comprehend Medical in production in a new biomedical research product we are working on. I used Textractor to process an 1143 page pdf of a single patient's medical records. And it worked - amazing!

However, I noticed that the entire set of extracted entities are repeated in the {}-medical-insights-entities.csv file that is generated for each page of the pdf. The file contents look like this ...

Text,Type,Category,Score,BeginOffset,EndOffset
"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18
68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28
Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50
11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61
917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172
MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180

... there are lots of valid entries and then the entire set repeats.

"Some random person",NAME,PROTECTED_HEALTH_INFORMATION,0.9943187832832336,0,18
68,AGE,PROTECTED_HEALTH_INFORMATION,0.22106069326400757,26,28
Penn Medicine,ADDRESS,PROTECTED_HEALTH_INFORMATION,0.1475761979818344,37,50
11/18/2015,DATE,PROTECTED_HEALTH_INFORMATION,0.9999758005142212,51,61
917356281,ID,PROTECTED_HEALTH_INFORMATION,0.9997710585594177,163,172
MYELOMA,DX_NAME,MEDICAL_CONDITION,0.8934065699577332,173,180

... etc. The same pattern occurs in the {}-medical-insights-phi.json file. "Id": 0 through "Id": 48 occur as expected and then they repeat - all within the same enclosing list. This pattern occurs for each of the 1143 pages in "Some random person's" medical record. I have not noticed any other obvious fubars yet.

I would attach the file(s) but (a) it really is PHI and (b) there's A LOT of data.

Visualize document.expense_documents

It would be great if we could visualize expense_documents and the associated normalized summary fields directly on the document as well, similarly as to how we currently visualize KV containers for example.

[Feature request] Configurable async job poll time in Caller

Textract Caller currently hard-codes the polling interval for checking whether an async analysis job is done to 1 second, here.

This seems like a nice, speedy default for single-threaded calls. However, it becomes the limiting factor for scaling up via local thread parallelism because the default quota on GetDocumentAnalysis API calls is relatively low (reported as 5TPS here or 1TPS here) - as compared to the quota for concurrent running analyses, which is in the hundreds.

So if a user (like me) is doing a PoC with a few hundred PDF documents on S3 and hoping to get them processed promptly without going the whole hog and deploying something like the large scale serverless processing sample, they can achieve a decent speed-up quite easily by using Textract Caller with a multiprocessing.pool.ThreadPool... But, can't increase the pool size beyond about 5 (in my us-east-2 tests at least), because 5 threads times 1/sec = 5TPS on the Get* API.

I'm not saying TextractCaller + thread parallelism is a great/scalable practice, but processing datasets ~5 times faster with a small bit of ThreadPool logic can be quite helpful... And seems like giving flexibility to increase the poll period could quite easily enable going up to ~10-20 requests in parallel.

Of course it would also be great if this library exposed some more built-in functionality for processing batches of docs with concurrency - instead of leaving this to user code.

Async AnalyzeExpense does not work

to reproduce:
invoice_multipage.pdf

import textractor
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.start_expense_analysis(
    file_source='./invoice_multipage.pdf',
    s3_output_path='s3://<BUCKET>',
    s3_upload_path='s3://<BUCKET>'
)

Need an option to save output in UTF-8 encoding to avoid saving as Windows-1252 encoding

It looks like the only way to capture the output of amazon-textract is to redirect it into a file. Such as:

amazon-textract --input-document "s3://somebucket/2022-04-16-0010.jpg" --pretty-print LINES > 2022-04-16-0010.txt

Unfortunately, this is a problem on Windows because the default encoding is Windows 1252, not UTF-8. When trying to analyze the output using other tools, UTF-8 is often required.

Something like this would be very useful:

amazon-textract --input-document "s3://somebucket/2022-04-16-0010.jpg" --pretty-print LINES -output-document 2022-04-16-0010.txt

where the default output is UTF-8.

S3 folder limited to one 'page' of objects

In textractor.py, we currently seem to hard-code a limit of 1 max pages of S3 objects when calling S3Helper.getFileNames() to list the objects in an S3 folder input - even though the underlying helper seems capable of pagination.

It looks from the Boto3 doc like the default page length is 1000 objects so it would only come up for pretty big jobs, but I worry it might trip some folks up as I don't think there's any obvious message to the user that this limit is being applied?

My 2 cents is that the potential detriment to user experience from having the hidden limit is worse than the potential risk of somebody pointing the script at a bigger S3 folder than they expected: Particularly because today Textractor outputs such frequent diagnostics about the number of docs it's processing, and the serial nature of the queue isn't like it's going to start pushing a massive volume through APIs before anybody can spot the problem and interrupt it... But maybe others have different perspectives?

Implement smoke tests

Recent issues such as #121 #122 #123 showed that our current test suite is inadequate and several edge cases are missing. This issue aims to outline a plan for implementing smoke tests to identify issues with the core APIs namely:

  • detect_document_text
  • start_document_text_detection
  • analyze_document
  • start_document_analysis
  • analyze_id
  • analyze_expense
  • start_expense_analysis
  • get_result

Each API should be callable with a Pillow image (lowest common-denominator) and the result should be visualizable.

duplicates in checkboxes with same value

I had an issue regarding checkboxes.let me explain problem.
i had check boxes with same values like 1)gender []male [] female and 2)sex []male []female
I am getting results like male selected and female unselected only check boxes with status coming .but i want to extract checkboxes with status and pairing with question or key like gender,sex .

please respond ASAP ,thank you

amazon-textract helper not works on windows 10

After run

python -m pip install amazon-textract-helper

It creates a file named "amazon-textract" at %LOCALAPPDATA%\Programs\Python\Python38\Scripts

Note that is named "amazon-textract" not "amazon-textract.py", so windows 10 don't know how execute it and ask with he usual popup window "How do you want to open this file?"

The following command works ok on power shell.

python $env:LOCALAPPDATA\Programs\Python\Python38\Scripts\amazon-textract --help

Modification to the Document entity from response is not captured when using .to_trp2()

The conversion to trp2 is based on using the initial response. This does not capture the any modifications made to the entities like OCR post-processing or correction or deletion of entities. A proper converter needs to be implemented to make the library usable for post-processing in-place modifications.

This would allow workflows along the lines of:

document.pages[1].key_values = {key: value + '_edited' for key, value in document.pages[1].key_values}
document.export("document.json")

We also need to add utilities function such as

  • Merging tables
  • Adding new keys
  • Adding new queries output

prettyprinter convert_queries_to_list_trp2 returns wrong list when answers are missing

instead of:

2022-07-08T17:44:41+00:00 AWS_PAYSTUBS Paystub_1_reMars.json 1 PAYSTUB_PERIOD_COMPANY_NAME 1
ANYCOMPANY INC. USA 96 0 0 0 0 0.095238097 0.154919237 0.633219957 0.10866373

returns

2022-07-08T17:44:41+00:00 AWS_PAYSTUBS Paystub_1_reMars.json 1 What is the Pay Period Start Date?
0 0 0 0 0 0 0 0

so, it returns the name and misses teh confidence values

Overlayer broken with DocumentDimension not subscritable

Traceback (most recent call last):
File "/Users/schadem/code/github/aws-samples/amazon-textract-textractor/helper/bin/amazon-textract", line 259, in
bounding_box_list = get_bounding_boxes(textract_json=doc,
File "/Users/schadem/code/github/aws-samples/amazon-textract-textractor/overlayer/textractoverlayer/t_overlay.py", line 104, in
get_bou
nding_boxes
page_dimensions = document_dimensions[page_number]
TypeError: 'DocumentDimensions' object is not subscriptable

Access denied StartDocumentTextDetection

Hi,

when running python3 textractor.py --documents s3://inputdatatimo/document.pdf --region us-east-1 --text I get following error message:

botocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the StartDocumentTextDetection operation:

though I granted (full access) permissions to the IAM user for S3 and Textract.
Are there any further permissions needed to make this run?

Many thanks
schatimo

Text is extracted but not grouped into forms and tables correctly

We're starting an invoice processing project and really like this library, but we're having one interesting issue: The text is all parsed correctly, but then it is not always grouped into forms and tables correctly.

So for example we could have this block in the text file:

SOLD TO:
SHIP TO:
CUSTOMER COMPANY
JANE RECIPIENT
123 SOMEWHERE ST
987 ANOTHER PL
LOS ANGELES CA 90001
SAN FRANCISCO CA 94100
USA
USA

But then it does not show up in a SOLD TO: or SHIP TO: forms or tables.

When it does show up, the confidence level is low (around 37), but the confidence of the text itself is very high:
"BlockType": "LINE",
"Confidence": 99.85072326660156,

Is this a problem of the relative position of the text versus their labels?

Should we try to adjust the forms/tables parsing algorithm? Or should we just work with the text and try to go with the repeating patterns of text, and not worry about forms and tables?

support single page PDF

Textrat launched single page PDF, should be supported in the caller.
Currently PDF defaults to async - we dont' want to change that. Maybe add a force_sync param to the method.


Cannot visualize expense document

import textractor
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_expense(
    file_source='./invoice.png',
)
document.visualize()
[]

Import of variables is failing

The Import of the sample code is failing.

from textractcaller import call_textract, Textract_Features
features = [Textract_Features.TABLES]
print(features)

from textractcaller import call_textract, Textract_Features
ImportError: cannot import name 'Textract_Features' from 'textractcaller' 

no test for overlayer

PDF dimension get_width_height_from_file call in image_tools.py was broken, but not detected as test were missing.
Adding test

Character encoding

Likely windows problem:-

Total Pages in Document: 77
Traceback (most recent call last):
File "textractorsandbox.py", line 169, in
Textractor().run()
File "textractorsandbox.py", line 148, in run
self.processDocument(ips, i, document)
File "textractorsandbox.py", line 113, in processDocument
opg.run()
File "D:\amazon-textract-textractor\src\og.py", line 111, in run
self._outputWords(page, p)
File "D:\amazon-textract-textractor\src\og.py", line 30, in _outputWords
csvFieldNames, csvData)
File "D:\amazon-textract-textractor\src\helper.py", line 122, in writeCSV
writer.writerow(row)
File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\singan\lib\csv.py", line 155, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\singan\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 37: character maps to

PDF must be in S3 bucket?

When running the following command:

python3 textractor.py --documents document-name.pdf --text --forms --tables

I get an error Exception: PDF must be in S3 bucket. Can it not be run on a local document?

Full error message below:

Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 143, in run
self.processDocument(ips, i, document)
File "textractor.py", line 98, in processDocument
dp = DocumentProcessor(ips["bucketName"], document, ips["awsRegion"], ips["text"], ips["forms"], ips["tables"])
File "/mnt/c/Users/Username/Documents/textractor/tdp.py", line 218, in init
raise Exception("PDF must be in S3 bucket.")
Exception: PDF must be in S3 bucket.

Queries cannot be visualized

doc.queries.visualize() returns an error

document.queries.visualize()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-36-09589854c2af> in <module>
----> 1 document.queries.visualize()

~/anaconda3/lib/python3.7/site-packages/textractor/visualizers/entitylist.py in visualize(self, with_text, with_confidence, font_size_ratio)
    109                 with_text,
    110                 with_confidence,
--> 111                 font_size_ratio,
    112             )
    113 

~/anaconda3/lib/python3.7/site-packages/textractor/visualizers/entitylist.py in _draw_bbox(entities, with_text, with_confidence, font_size_ratio)
    576     for entity in entities:
    577         width, height = image.size
--> 578         overlayer_data = _get_overlayer_data(entity, width, height)
    579         drw.rectangle(
    580             xy=overlayer_data["coords"], outline=overlayer_data["color"], width=2

~/anaconda3/lib/python3.7/site-packages/textractor/visualizers/entitylist.py in _get_overlayer_data(entity, width, height)
    711     data["coords"] = [x, y, x + w, y + h]
    712     data["confidence"] = (
--> 713         entity.confidence if not entity.__class__.__name__ == "Table" else ""
    714     )
    715     data["text_color"] = (0, 0, 0)

AttributeError: 'Query' object has no attribute 'confidence'

I suggest we highlight the results of the queries and index them like query_1 above on the document visualization

QuerieConfig not False when empty

this fails

    queries_config = QueriesConfig(queries=[])
    assert not queries_config.get_dict()

and therefore creates problems with simplyfing the call_textract call

Cannot visualize AnalyzeID documents

import textractor
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_id(
    file_source='./id.png',
)
document.visualize()
[]

_parseDocumentPagesAndBlockMap NoneType Error

Running into an error with specific PDF documents. I believe this is due to _parseDocumentPagesAndBlockMap not detecting blocks.

Traceback (most recent call last):
  File "c:\\Users\\User\\amazon-textract-textractor\\textractor.py", line 164, in <module>
    Textractor().run()
  File "c:\\Users\\User\\amazon-textract-textractor\\textractor.py", line 143, in run
    self.processDocument(ips, i, document)
  File "c:\\Users\\User\\amazon-textract-textractor\\textractor.py", line 109, in processDocument
    ips["forms"], ips["tables"])
  File "c:\\Users\\User\\amazon-textract-textractor\\og.py", line 13, in __init__
    self.document = Document(self.response)
  File "c:\\Users\\User\\amazon-textract-textractor\\trp.py", line 599, in __init__
    self._parse()
  File "c:\\Users\\User\\amazon-textract-textractor\\trp.py", line 631, in _parse
    self._responseDocumentPages, self._blockMap = self._parseDocumentPagesAndBlockMap()
  File "c:\\Users\\User\\amazon-textract-textractor\\trp.py", line 614, in _parseDocumentPagesAndBlockMap
    for block in page['Blocks']:
TypeError: 'NoneType' object is not subscriptable ```

Invalid input: Document or path to a foler or S3 bucket containing documents is required.

Hello - I'm running into errors when trying to process a PDF on my S3 bucket.

First thing I did was reference the ReadMe and tried the following:

py textractor.py --documents s3://bucketname/filename.pdf --text --forms --tables

Error I received:

`Invalid input: An error occurred (InvalidToken) when calling the GetBucketLocation operation: The provided token is malformed or otherwise invalid.
Valid format:

  • python3 textractor.py --documents mydoc.jpg --text --forms --tables --region us-east-1
  • python3 textractor.py --documents ./myfolder/ --text --forms --tables
  • python3 textractor.py --document s3://mybucket/mydoc.pdf --text --forms --tables
  • python3 textractor.py --document s3://mybucket/ --text --forms --tables
    Traceback (most recent call last):
    File "textractor.py", line 164, in
    Textractor().run()
    File "textractor.py", line 135, in run
    totalDocuments = len(ips["documents"])
    TypeError: 'NoneType' object is not subscriptable`

I noticed there's no "s" when trying to process a file on s3, so I tried this:

py textractor.py --document s3://bucketname/filename.pdf --text --forms --tables

Got this error:

Invalid input: Document or path to a foler or S3 bucket containing documents is required.
Valid format:

  • python3 textractor.py --documents mydoc.jpg --text --forms --tables --region us-east-1
  • python3 textractor.py --documents ./myfolder/ --text --forms --tables
  • python3 textractor.py --document s3://mybucket/mydoc.pdf --text --forms --tables
  • python3 textractor.py --document s3://mybucket/ --text --forms --tables
    Traceback (most recent call last):
    File "textractor.py", line 164, in
    Textractor().run()
    File "textractor.py", line 135, in run
    totalDocuments = len(ips["documents"])
    TypeError: 'NoneType' object is not subscriptable

Additionally... I tried running a test file locally (I chose a JPG and not a PDF since I get another message when trying to process a PDF locally stating I need to upload the PDF to an S3 bucket)

py textractor.py --documents test.jpg --text

And it seemed to work but ended up getting an error:

Textracting Document # 1: test.jpg
Calling Textract...
Traceback (most recent call last):
File "textractor.py", line 164, in
Textractor().run()
File "textractor.py", line 143, in run
self.processDocument(ips, i, document)
File "textractor.py", line 99, in processDocument
response = dp.run()
File "C:\Users\CamiLaPorte\Desktop\Checkbox-Textract\amazon-textract-textractor\src\tdp.py", line 232, in run
response = ip.run()
File "C:\Users\CamiLaPorte\Desktop\Checkbox-Textract\amazon-textract-textractor\src\tdp.py", line 79, in run
response = self._callTextract()
File "C:\Users\CamiLaPorte\Desktop\Checkbox-Textract\amazon-textract-textractor\src\tdp.py", line 42, in _callTextract
response = textract.detect_document_text(Document={'Bytes': imageBytes})
File "C:\Users\CamiLaPorte\Anaconda3\lib\site-packages\botocore\client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "C:\Users\CamiLaPorte\Anaconda3\lib\site-packages\botocore\client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the DetectDocumentText operation: The security token included in the request is invalid

I've used textractor code multiple times in the last year and have never had this issue. It's only been with a recent download of the repo and code that I'm experiencing this. There have been no changes to my machine and I'm running the same environments (I've even tried multiple environments as well.)

Please let me know if you need more information. Thanks!

amazon-textract helper not working for --stdin

receive

Traceback (most recent call last):
File "/Users/schadem/.pyenv/versions/3.9.6/bin/amazon-textract", line 193, in
if len(input_document) > 7 and input_document.lower().startswith("s3://"):
TypeError: object of type 'NoneType' has no len()

when using --stdin

[Feature request] Output destination option

Thanks for the nice utility! However, my working directory is now an absolute mess 😂

It would be really helpful if something like an --output CLI option was available where we could specify the destination to push the output files. Ideal if you could supply an S3 URI, but even being able to give a folder would be a big help!

(Maybe I'm just missing some obvious way to do this? Using Textractor to process many tens of documents in batch for a quick PoC)

Can not force sync

When sending in a document it is not clear if async or sync is called.
We can force async, but not sync.
Accidentially sending a multi page document when expecting sync calls can lead to timeouts and errors.
`

File name with space

If file name contains space, it's not processing. The loop is continuing with "IN PROGRESS" indefinitely. I have tested this in two environments, same behavior.

Translation JSON response

When we use --translate, we get the translation for each page but the consolidated JSON response is -response.json not translated. How to generate the translation in the final JSON as well?

[Feature request] Preserve folder structure

When applying textractor to a local folder or S3 prefix with an inner folder structure, it would be really useful if output files were also mapped to the same folder structure - rather than flattened out by filename only.

For example processing a parent folder containing invoices/ABC123.pdf and purchase-orders/DEF456.pdf currently seems to generate ABC123-response.json and DEF456-response.json - which then need to be mapped back to their categories afterwards.

page number is overwritten in function find_phrase_in_lines

The page number is overwritten if you pass it to the function within the for loop.
Plus the page number is not considered as search criteria.

Source Code Snipped from line 1091ff

def find_phrase_in_lines(
        self, phrase: str, min_textdistance=0.6, page_number: int = 1
    ) -> List[TWord]:
        """
        phrase = words seperated by space char
        """
        # first check if we already did find this phrase and stored it in the DB
        # TODO: Problem: it will not find Current: when the phrase has current and there are other current values in the document without :
        if not phrase:
            raise ValueError(f"no valid phrase: '{phrase}")
        phrase_words = phrase.split(" ")
        if len(phrase_words) < 1:
            raise ValueError(f"no valid phrase: '{phrase}")
        # TODO: check for page_number impl
        found_phrases: "list[TWord]" = self.ocrdb.select_text(
            textract_doc_uuid=self.textract_doc_uuid,
            text=make_alphanum_and_lower_for_non_numbers(phrase),
        )
        print("after ocrdb.select_text")
        if found_phrases:
            print("phrases found")
            return found_phrases

        alphanum_regex = re.compile(r"[\W_]+")
        # find phrase (words that follow each other) in trp lines
        for page in self.doc.pages:
            page_number = 1
            for line in page.lines:
......
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.