Comments (7)
Thanks @darwaishx ! It would be great to add this to the READ.me to clarify to future users.
from amazon-textract-textractor.
Textract async API expect PDF as an S3 object.
Example below show how to take a PDF on local disk, extract individual images and process them.
https://github.com/aws-samples/amazon-textract-searchable-pdf/blob/master/src/SearchablePDF/src/main/java/DemoPdfFromLocalPdf.java
from amazon-textract-textractor.
Yeah @darwaishx that README is pretty confusing...at least it got me on this same issue as well.
There are a couple sections in particular:
- python3 textractor.py --documents [file|folder|S3Object|S3Folder] --text --forms --tables --region [AWSRegion] --insights --medical-insights --translate [LanguageCode]
Argument Description --documents Name of the document or local folder/S3 bucket
which seem to reference (or at least mistakenly imply) the ability for this script to accept local files directly.
from amazon-textract-textractor.
Probably clarify more between 'documents' and 'pdf'.
Beautifully written samples and textractor local tools
from amazon-textract-textractor.
Ran into this as well, send the PDF as bytes instead:
file_name = 'Invoice_INV300351.pdf'
client = boto3.client('textract', 'us-east-1')
with open(file_name, "rb") as sample_file:
b = bytearray(sample_file.read())
response = call_textract(input_document=b, boto3_textract_client=client )
pickle.dump(response, open(f'response_{file_name.split(".")[0]}.pk', 'wb'))
from amazon-textract-textractor.
Thx @grantrosse , can you create a new ticket? In 2019 (original ticket date), Textract did in deed not support a PDF to be passed in from the local filesystem. However, that changed now and we should update that behavior.
from amazon-textract-textractor.
from amazon-textract-textractor.
Related Issues (20)
- start_document_analysis high memory usage HOT 4
- The key property of the KeyValue class does not return Line instance HOT 3
- Mistake a text field above a table as table title HOT 2
- [Doc] BoundingBox coordinate unit and scale are unclear HOT 2
- [Doc] Documentation of Linearizable and their methods e.g, get_text(config) HOT 1
- [Q] hide_keyavlue_layout option in TextLinearizationConfig HOT 4
- Caller: allow early return when job incomplete HOT 1
- Query entity is not linearizable
- Queries ordering is not preserved after parsing HOT 1
- Missing CITATION.cff file for repo HOT 1
- Error in get_layout_text_from_json in textractprettyprinter
- Parsing response from a start_document_analysis() HOT 2
- Proper way of getting cell content? HOT 5
- Large PDF response processing is slow
- Textractor import error HOT 1
- JPEG conversion in `analyze_document` significantly impacts table predictions HOT 1
- KeyError: 'Text' - on documents with tables HOT 1
- S3 path parsing for textractcaller is not robust enough
- Exporting text+tables while maintaining layout HOT 1
- KeyError in get_lines_string
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-textract-textractor.