Git Product home page Git Product logo

aws-solutions / enhanced-document-understanding-on-aws Goto Github PK

View Code? Open in Web Editor NEW
23.0 13.0 5.0 56.77 MB

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

Home Page: https://aws.amazon.com/solutions/implementations/enhanced-document-understanding-on-aws/

License: Apache License 2.0

Shell 0.71% JavaScript 48.39% TypeScript 31.80% Dockerfile 0.01% Makefile 0.06% Java 9.66% Python 8.77% HTML 0.03% CSS 0.58%
document-analysis document-processing

enhanced-document-understanding-on-aws's Introduction

Enhanced Document Understanding on AWS

Organizations across industries are increasingly required to process large volumes of semi-structured and unstructured documents with greater accuracy and speed. They need a document processing system that ingests and analyzes documents, extracts their content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data. Many industries have stringent compliance requirements to redact personally identifiable information (PII) and protected health information (PHI) from documents. In most cases, organizations manually process documents to extract information and insights. This approach can be time consuming, expensive, and difficult to scale. Organizations need information to rapidly extract insights from documents. They can benefit from a smart document processing system as a foundation to automating business processes that rely on manual inputs and interventions. To help meet these needs, the Enhanced Document Understanding on AWS solution:

  • Automates document ingestion process to improve operational efficiency and reduce cost
  • Ingests and analyzes document files at scale using artificial intelligence (AI) and machine learning (ML)
  • Extracts text from documents
  • Identifies structural data (such as single word, a line, a table, or individual cells within a table)
  • Extracts critical information (such as entities)
  • Creates smart search indexes from the data
  • Detects and redacts PII and PHI to generate a redacted version of the original document

You can use each of these features standalone or configure the solution as a unique composition of workflow orchestration based on your use case. This solution deploys an AWS CloudFormation template that provides the capability to configure workflows for various use cases. The solution allows users to define custom workflows that include the types of documents required for their workflows and the type of processing each document can be subjected to. The solution also provides a web user interface (UI) for users to upload documents. Once the documents are uploaded, a backend workflow orchestrates AWS managed AI services to process documents at scale. For a detailed solution implementation guide, refer to Enhanced Document Understanding on AWS.

On this page

Architecture Overview

Deploying this solution with the default parameters deploys the following components in your AWS account.

Diagram

The high-level process flow for the solution components deployed with the CloudFormation template is as follows:

  1. The user requests the browser to navigate to an Amazon CloudFront URL.
  2. The UI prompts the user for authentication, which the solution validates using Amazon Cognito.
  3. The UI interacts with the REST endpoint deployed on Amazon API Gateway.
  4. The user creates a case that the solution stores in the Case management store Amazon DynamoDB table.
  5. The user requests a signed Amazon Simple Storage Service (Amazon S3) URL to upload documents to an S3 bucket.
  6. Amazon S3 generates an s3:PutObject event on the default Amazon EventBridge event bus.
  7. The s3:PutObject event invokes the workflow orchestrator AWS Lambda function. This function uses the configuration stored in the Configuration for orchestrating workflows DynamoDB table to determine the workflows to be called.
  8. The workflow orchestrator Lambda function creates an event and sends it to the custom event bus.
  9. The custom event bus invokes one of the three AWS Step Functions state machine workflows based on the event definition.
  10. The workflow completes and publishes an event to the custom EventBridge event bus.
  11. The custom EventBridge event bus invokes the workflow orchestrator Lambda function. This function uses the configuration stored in the Configuration for orchestrating workflows DynamoDB table to determine whether the sequence is complete or if the sequence requires another workflow: a. The solution updates the Case management store DynamoDB table. b. If the sequence is not complete, the solution returns to step 8 for the next state machine workflow.
  12. The workflow orchestrator Lambda function writes metadata from the processed information to an Amazon Kendra index. This index provides the ability to perform machine learning powered search.

Note: The deployment to Amazon Kendra is optional. If not deployed the search feature is not available.

Deployment

Note: You can also test the UI project locally by deploying the API endpoints and the rest of the infrastructure. To do so, follow either of the below two options and then refer UI project for details.

There are two options for deployment into your AWS account:

1. Using cdk deploy

Following are pre-requisites to build and deploy locally:

Note: Configure the AWS CLI with your AWS credentials or have them exported in the CLI terminal environment. In case the credentials are invalid or expired, running cdk deploy produces an error.

After cloning the repo from GitHub, complete the following steps:

  cd <project-directory>/source/infrastructure
  npm install
  npm run build
  cdk deploy

2. Using CloudFormation templates

To deploy CloudFormation templates, follow the instruction for creating a custom build.

Source code

Project directory structure

├── email-templates                   [email templates]
├── sample-documents                  [sample documents for different industry verticals]
├── images                            [images used for markdown files]
│   ├── architecture.png
│   └── ui-components.png
├── infrastructure                    [CDK infrastructure]
│   ├── bin
│   ├── cdk.json
│   ├── cdk.out
│   ├── coverage
│   ├── jest.config.js
│   ├── lib
│   ├── node_modules
│   ├── package-lock.json
│   ├── package.json
│   ├── test
│   └── tsconfig.json
├── Lambda                            [Lambda functions for the application]
│   ├── create-presigned-url
│   ├── custom-resource
│   ├── entity-detection
│   ├── fetch-records
│   ├── get-inferences
│   ├── layers
│   ├── redact-content
│   ├── search
│   ├── send-notification
│   ├── text-extract
│   ├── upload-document
│   └── workflow-orchestrator
├── pre-build-jars.sh                 [pre-builds libraries required for the CDK infrastructure project]
├── run-all-tests.sh                  [shell script that can run unit tests for the entire project]
├── ui                                [Web App project for UI]
│   ├── README.md
│   ├── build
│   ├── node_modules
│   ├── package-lock.json
│   ├── package.json
│   ├── public
│   ├── src
│   └── tsconfig.json
└── workflow-config                   [provides out-of-the-box workflow configurations]
    ├── default.json
    ├── entity-detection.json
    ├── redaction.json
    ├── textract-analyze-doc.json
    ├── textract-to-entity-medical.json
    ├── textract-to-entity-pii.json
    ├── textract-to-entity-standard.json
    ├── textract-to-entity.json
    └── textract.json

Creating a custom build

1. Clone the repository

Run the following command:

git clone https://github.com/aws-solutions/<repository_name>

2. Build the solution for deployment

  1. Install the dependencies:
cd <rootDir>/source/infrastructure
npm install
  1. (Optional) Run the unit tests:

Note: To run the unit tests, docker must be installed and running, and valid AWS credentials must be configured.

cd <rootDir>/source
chmod +x ./run-all-tests.sh
./run-all-tests.sh
  1. Configure the bucket name of your target Amazon S3 distribution bucket:
export DIST_OUTPUT_BUCKET=my-bucket-name
export VERSION=my-version
  1. Build the distributable:
cd <rootDir>/deployment
chmod +x ./build-s3-dist.sh
./build-s3-dist.sh $DIST_OUTPUT_BUCKET $SOLUTION_NAME $VERSION $CF_TEMPLATE_BUCKET_NAME

Parameter details:

$DIST_OUTPUT_BUCKET - This is the global name of the distribution. For the bucket name, the AWS Region is added to the global name (example: 'my-bucket-name-us-east-1') to create a regional bucket. The lambda
artifact should be uploaded to the regional buckets for the CloudFormation template to pick it up for deployment.

$SOLUTION_NAME - The name of This solution (example: document-understanding-solution)
$VERSION - The version number of the change
$CF_TEMPLATE_BUCKET_NAME - The name of the S3 bucket where the CloudFormation templates should be uploaded

When you create and use buckets, we recommended that you:

  • Use randomized names or uuid as part of your bucket naming strategy.
  • Ensure that buckets aren't public.
  • Verify bucket ownership prior to uploading templates or code artifacts.
  1. Deploy the distributable to an Amazon S3 bucket in your account.

Note: You must have the AWS CLI installed.

aws s3 cp ./global-s3-assets/ s3://my-bucket-name-<aws_region>/document-understanding-solution/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name
aws s3 cp ./regional-s3-assets/ s3://my-bucket-name-<aws_region>/document-understanding-solution/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name

Anonymized data collection

This solution collects anonymized operational metrics to help AWS improve the quality and features of the solution. For more information, including how to disable this capability, please see the implementation guide.


Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

enhanced-document-understanding-on-aws's People

Contributors

amazon-auto avatar jamesnixon-aws avatar knihit avatar mukitmomin avatar tabdunabi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

enhanced-document-understanding-on-aws's Issues

How to opt out The deployment to Amazon Kendra

Is your feature request related to a problem? Please describe.
The documentation stated that the deployment to Amazon Kendra is optional. If not deployed the search feature is not available.

How do we do it?

Describe the feature you'd like
Any configuration to opt out Kendra?

Thanks in advance for the consideration and hope that we can have a prompt reply soon.

Document and case deletion

Is your feature request related to a problem? Please describe.
A case cannot be deleted, likewise a document cannot be removed once it has been uploaded to the case.

Describe the feature you'd like
Be able to delete a case, as well as delete a document from a case.

Additional context
Would it be enough to remove the entry on DynamoDB and the files on S3 in order to accomplish a clean delete operation? thank you.

Bucket Not Found and Key Missing Errors in Custom Build Process

Describe the bug

It's probably not a bug, more like I did not understand the documentation clearly, and it's causing this problem. I'm not a cloud developer, but I'm trying to make a few modifications to deploy a specific application. Right now, the code has no modifications at all; I am just trying to deploy a custom build following the steps in the docs, but I am getting errors (bucket not found, key missing) while trying to set the buckets for the custom build.

To Reproduce

Basically, I created two S3 buckets for DIST_OUTPUT_BUCKET and CF_TEMPLATE_BUCKET_NAME. The name format for DIST_OUTPUT_BUCKET included the region, as described in the parameters details, like this: my-bucket-us-east-1. CF_TEMPLATE_BUCKET_NAME had a random bucket name without the region in the name itself.

If I understood correctly, the global-s3-assets must be uploaded to CF_TEMPLATE_BUCKET, and the regional-s3-assets to the DIST_OUTPUT_BUCKET. I did that after building the distributable with these buckets. However, when deploying on CloudFormation, I get an error saying a bucket was not found.

By not changing the name my-bucket-us-east-1 but giving the name without the region part to the variable DIST_OUTPUT_BUCKET, like this: my-bucket, the bucket is not missing anymore. So, I am assuming we should give the name of the bucket till the region part of the name but still create a bucket with the region in the name, right?

Anyways, I still get an error saying a key is missing from the S3 bucket. I'm not sure which one. So just to make sure, I uploaded all global and regional assets to both buckets, but the same error persists.

Expected behavior

Custom build with no modifications to be deployed.

Please complete the following information about the solution:

  • Version: v1.0.3

  • Region: us-east-1

  • Was the solution modified from the version published on this repository? No

  • If the answer to the previous question was yes, are the changes available on GitHub?

  • Have you checked your service quotas for the sevices this solution uses?

  • Were there any errors in the CloudWatch Logs? Nothing in the logs

Additional context

I am using Cloud9 (Amazon Linux) and deploying directly on Cloud Formation though AWS interface.

Issues deploying a custom build - cdk deploy -> Error: Failed to bundle asset DocUnderstanding/WebApp

Describe the bug
Issues deploying the solution with a custom build. cdk deploy command failes to bundle assets.

To Reproduce
I have installed all the dependencies, however cdk version is 2.103.1, I have not been able to find v2.36.0 as listed on the GitHub repo.
I run cdk deploy as instructed but I hit an error:

Error: Failed to bundle asset DocUnderstanding/WebApp/S3UI/<wbr>UI/Stage, bundle output is located at /home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/cdk.out/asset.<wbr>8f26c22c27562c32d4e256a2e7b480<wbr>dcd2cfc46b20eba02a128f3a197963<wbr>24ae-error: Error: docker exited with status 1<br>--&gt; Command: docker run --rm --security-opt "no-new-privileges:true" --network host -u root -v "/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>ui:/asset-input:delegated" -v "/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/cdk.out/asset.<wbr>8f26c22c27562c32d4e256a2e7b480<wbr>dcd2cfc46b20eba02a128f3a197963<wbr>24ae:/asset-output:delegated" -w "/asset-input" "public.ecr.aws/sam/build-<wbr>nodejs18.x" bash -c "echo \"local bundling failed for /home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/ui and hence building with Docker image\" &amp;&amp; npm install &amp;&amp; npm run build &amp;&amp; rm -fr /asset-input/node_modules &amp;&amp; npm ci --omit=dev &amp;&amp; mkdir -p /asset-output/ &amp;&amp; cp -au /asset-input/* /asset-output/ &amp;&amp; rm -fr /asset-output/.coverage"<br>    at AssetStaging.bundle (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/node_modules/<wbr>aws-cdk-lib/core/lib/asset-<wbr>staging.js:2:619)<br>    at AssetStaging.stageByBundling (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/node_modules/<wbr>aws-cdk-lib/core/lib/asset-<wbr>staging.js:1:5297)<br>    at stageThisAsset (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/node_modules/<wbr>aws-cdk-lib/core/lib/asset-<wbr>staging.js:1:2728)<br>    at Cache.obtain (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/node_modules/<wbr>aws-cdk-lib/core/lib/private/<wbr>cache.js:1:242)<br>    at new AssetStaging (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/node_modules/<wbr>aws-cdk-lib/core/lib/asset-<wbr>staging.js:1:3125)<br>    at new Asset (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/node_modules/<wbr>aws-cdk-lib/aws-s3-assets/lib/<wbr>asset.js:1:1080)<br>    at new UIAssets (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/lib/s3web/ui-<wbr>asset.ts:81:26)<br>    at new UIInfrastructure (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/lib/ui-<wbr>infrastructure.ts:64:30)<br>    at new DusStack (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/lib/dus-stack.<wbr>ts:312:33)<br>    at Object.&lt;anonymous&gt; (/home/ec2-user/environment/<wbr>enhanced-document-<wbr>understanding-on-aws/source/<wbr>infrastructure/bin/dus.ts:44:<wbr>13)<br><br>Subprocess exited with error 1

Expected behavior
CDK should deploy the infrastructure to run the solution. However the documentation doesn't indicate how to customize the values as in the CF deployment, user email, indexer option, etc...

Please complete the following information about the solution:

  • Version: [e.g. v1.0.0]

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0281) - Enhanced Document Understanding on AWS. Version v1.0.0".

  • Region: us-east-1
  • Was the solution modified from the version published on this repository? no
  • If the answer to the previous question was yes, are the changes available on GitHub? no
  • Have you checked your [service quotas] yes, not an issue there (https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) for the sevices this solution uses?
  • Were there any errors in the CloudWatch Logs? never got that far

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
I have tried other avenues like the script to generate the cloud formation templates, but I hit another issue. I have been able to deploy the solution only from the console with default CF template that is linked from the Implementation guide.

Thanks.

Some documents fail entity detection due to repeating words

Describe the bug
When the line where an entity is present contains some words from the entity just before the actual entity, the entity detection fails.

To Reproduce
Upload a document with the above conditions to a case with an entity detection workflow

Expected behavior
Detection succeeds, or on failure we do not fail the whole workflow.

Please complete the following information about the solution:

  • Version: [e.g. v1.0.5]

Unable to parse command line options: Unrecognized option: --no-transfer-progress

Describe the bug
Unable to parse command line options: Unrecognized option: --no-transfer-progress
Subprocess exited with error 1

To Reproduce
cdk synth


Current directory is: /home/ec2-user/workspaces/enhanced-document-understanding-on-aws/source/lambda/layers/custom-java-sdk-config. Running build

Unable to parse command line options: Unrecognized option: --no-transfer-progress

usage: mvn [options] [<goal(s)>] [<phase(s)>]

Expected behavior
A clear and concise description of what you expected to happen.

Please complete the following information about the solution:

  • Version: [e.g. v1.0.0]

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0281) - Enhanced Document Understanding on AWS. Version v1.0.0".

  • [ca-central-1 ] Region: [ca-central-1]

I am using cloud9 as the IDE.

uname -a
Linux ip-10-2-35-238.ca-central-1.compute.internal 4.14.322-244.536.amzn2.x86_64 #1 SMP Wed Aug 16 04:58:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Case max number of documents and processing workflow

Is your feature request related to a problem? Please describe.
Hello, currently you can control the number of documents you can have in a case through the workflow-config JSON object, set at deployment time. However you are stuck with this max amount of documents, and they are only processed for Textract features until all of them are uploaded.

Describe the feature you'd like
I think is a common situation to have a case type that doesn't always have the same amount of documents, for instance, insurance claims cases can have sometimes 10 invoices , 5 receipts, etc, and the next claim have 1 receipt and 1 invoice. It would be really nice if you could set a max of documents of each type (like it is currently now), but kick off the process for Textract as soon as the document is uploaded.

Payload for upload document

Hi, I am trying to hit /document api with the expected payload I found suing inspect in UI but not able to figure out how to send a file using postman for custom Non UI build.
{
caseId : "sm****:b6485fe3-c6ef-4f01-86a4-f57d96e77adf"
caseName : "XYZ"
documentType:"generic"
fileExtension:".pdf"
fileName: "9159.pdf"
tagging: "userIdsm****"
userId: "sm****"}

But not sure how to pass a file is it binary or form data and if it is form data then what is its key.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.