Git Product home page Git Product logo

doeextractor's Introduction

Metrics

doeextractor's People

Contributors

aldnav avatar

Watchers

 avatar  avatar

doeextractor's Issues

Textract UnsupportedDocumentException while using PNG for Bytes

  • doeextractor version: 0.1.0
  • Python version: Python 3.9.13
  • Operating System: Mac (ARM)

Description

When using textract with Bytes, got the error
botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format even after following AWS sample code.

What I Did

$ doeextractor extract '/Users/aldnav/pro/retailprices/reports/2022-05-18/petro_min_2022-may-10.pdf'
Traceback (most recent call last):
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/bin/doeextractor", line 33, in <module>
    sys.exit(load_entry_point('doeextractor', 'console_scripts', 'doeextractor')())
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/aldnav/pro/doeextractor/doeextractor/cli.py", line 140, in extract
    do_textract(
(doeextractor) ☁  doeextractor [main] ⚡  doeextractor extract '/Users/aldnav/pro/retailprices/reports/2022-05-18/petro_min_2022-may-10.pdf'
Saved 11 pages to /Users/aldnav/pro/doeextractor/output/petro_min_2022-may-10
Analyzing...
9.090909090909092%
Traceback (most recent call last):
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/bin/doeextractor", line 33, in <module>
    sys.exit(load_entry_point('doeextractor', 'console_scripts', 'doeextractor')())
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/aldnav/pro/doeextractor/doeextractor/cli.py", line 140, in extract
    do_textract(
  File "/Users/aldnav/pro/doeextractor/doeextractor/textractor.py", line 142, in extract
    csv_results = get_table_csv_results(input_file_path)
  File "/Users/aldnav/pro/doeextractor/doeextractor/textractor.py", line 52, in get_table_csv_results
    response = client.analyze_document(
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/aldnav/.virtualenvs/doeextractor-cyiER_AY/lib/python3.9/site-packages/botocore/client.py", line 911, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Parse extracted tables from Textract

Description

Parse results of Textract similar to Tabula.
Also, handle nuances such as below.

$ doeextractor parse samples/petro_min_2022-may-10.csv
Parse extracted tables
Skip first cell: True
Header: ['AREA ', 'PRODUCT ', 'PETRON ', 'SHELL ', 'CALTEX ', 'PHOENIX ', 'FLYING V ', 'SEAOIL ', 'JETTI ', 'MY GAS ', 'INDEPENDENT ', 'OVERALL DANCE ', 'COMMON DDICE ', 'AVERAGE DDICE ', '']
Uncategorized token: region xiii
Uncategorized token: ron 91 diesel
Uncategorized token: none none
Uncategorized token: - -
Uncategorized token: - n.a
Uncategorized token: barmm
Uncategorized token: region xi
Uncategorized token: city
Uncategorized token: tagum
Uncategorized token: region xii
Uncategorized token: general santos
Uncategorized token: city
Uncategorized token: koronadal
Uncategorized token: city
Uncategorized token: region x
Uncategorized token: cagayan de oro
Uncategorized token: city
Uncategorized token: kerosene ron 97
Uncategorized token: - -
Uncategorized token: - -
Uncategorized token: - -
Uncategorized token: - -
Uncategorized token: none none
Uncategorized token: liquid
Uncategorized token: fuels price
Uncategorized token: range
Uncategorized token: range
Uncategorized token: price
Uncategorized token: price
Uncategorized token: region ix
Uncategorized token: isabela
Uncategorized token: ron 95 ron 91
Uncategorized token: - 81.85 81.85
Uncategorized token: - 81.85 81.85
Uncategorized token: none 81.85
Uncategorized token: none 81.85
Uncategorized token: city

Document the requirements

Also write docs in Markdown. With this, use MkDocs to generate a "Read the docs"-ready documentation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.