Git Product home page Git Product logo

example-process-invoices-with-amazon-textract's Introduction

Process PDF invoices with Amazon Textract

PDF - The most machine-readable document format ever! Right? ๐Ÿ™ˆ

Extracting text from PDF files is not a simple operation. PDF was never meant to be a format to read data from: its purpose is to provide an accurate way of reproducing documents and make them portable to any system. - How to read PDF files with RPA Framework!

Still, it is possible to automatically read and extract invoice data from PDF documents and save the data to an Excel file. No more manual copy & pasting!

This robot processes randomly generated PDF invoices with Amazon Textract and saves the extracted invoice data in an Excel file.

Example PDF invoice

Example Excel

Tasks

The robot provides three tasks:

  • Create Invoices
  • Process PDF invoices with Amazon Textract
  • Delete Files From Amazon S3 Bucket

Create Invoices

  • Generates random PDF invoices and uploads them to Amazon S3 bucket.
  • Saves the generated PDF invoices to the output directory for debugging purposes.

Process PDF invoices with Amazon Textract

  • Reads the invoices from the Amazon S3 bucket.
  • Processes the invoices with Amazon Textract.
  • Saves the extracted invoice data in an Excel file in the output directory.
  • Finally, deletes the PDF invoices from the Amazon S3 bucket.

Delete Files From Amazon S3 Bucket

  • A utility task for deleting the PDF invoices from the Amazon S3 bucket.
  • Can be executed separately when you want to empty the Amazon S3 bucket.
  • Called by the Process PDF invoices with Amazon Textract task in the teardown phase.

Prerequisites

Amazon API key and key ID with access to Amazon S3 and Amazon Textract

The robot requires access to Amazon S3 and Amazon Textract services. It needs an API key, key ID, and the AWS region. Check out Amazon Textract Developer Guide!

Store the API key, key ID, and the AWS region in Robocorp Vault

Set up Robocorp Vault either locally or in Control Room.

For a local run, use the following configuration:

/Users/username/vault.json:

{
  "aws": {
    "AWS_KEY": "aws-key",
    "AWS_KEY_ID": "aws-key-id",
    "AWS_REGION": "us-east-1"
  }
}

devdata/env.json:

{
  "RPA_SECRET_MANAGER": "RPA.Robocorp.Vault.FileSecrets",
  "RPA_SECRET_FILE": "/Users/username/vault.json"
}

For Control Room run, create a new vault entry in Control Room.

  • Enter aws as the name.
  • Provide values for the AWS_KEY, AWS_KEY_ID, and AWS_REGION keys:

Running

  1. Run the Create Invoices task to create the PDF invoices.

  2. Run the Process PDF invoices with Amazon Textract task to process the PDF invoices and to generate the Excel file with the data extracted from the invoices.

Optional: Run the Delete Files From Amazon S3 Bucket task if you want to delete the PDF invoices from the Amazon S3 bucket (the Process PDF invoices with Amazon Textract task does this automatically in the teardown phase).

When running in Control Room, add the Create Invoices and Process PDF invoices with Amazon Textract as process steps:

Further reading

example-process-invoices-with-amazon-textract's People

Contributors

janipalsamaki avatar tonnitommi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.