Google Cloud Document AI Warehouse demo

How it works

Deployment

Pre-Requisites

Create a Google Cloud Organization.
Create a project on your Organization.
Install the gcloud CLI.
Run gcloud auth login.
Run gcloud auth application-default login.
Install terraform.

Configure the Document AI Warehouse web application

Follow this guide to configure the Document AI Warehouse web application.
Take note of the Service Account Email.
Go to the Document AI Warehouse web application.
Go to "Admin" -> "Schemas". Click "Add new".
Name the schema "US Patent", copy and paste the schema.json into the JSON area and then click "Done".
Go to "Admin" -> "Access". Click "Add new".
Add the Service Account with type "User" and access "Document Admin".

Deployment

This process will:

cd into the infra/deployment folder.
Comment out the entire contents of the backend.tf file.
Create a terraform.tfvars file and set the values of the variables.
Run terraform init.
Run terraform apply and type yes.
Uncomment the contents of the backend.tf file and set the bucket attribute to the value of the tfstate_bucket output.
Run terraform init and type yes.

Train US Patent Parser Custom Document Extractor

The US Patent Parser is a Document AI Custom Document Extractor. To train it, follow the steps below:

Go to Key Management -> take note of the location of the public-doc-ai-keyring.
Go to Cloud Storage -> click "Create" to create a GCS bucket -> You can name it <some random prefix>-us-patent-parser-v1-0-0-initial-data-import -> For the location, select the same region as the public-doc-ai-keyring -> Click "Continue" until the "Choose how to protect object data" section -> open the "Data encryption" accordion, click "Customer-managed encryption key (CMEK)" and select the doc-ai-key as the encryption key -> click "Create". Now click "Upload Folder" and upload the US patents labeled data folder.
Now go to Document AI -> My Processors -> Click the us-patent-parser processor -> Train.
Click "Show Advanced Options" -> Click "I'l specify my own location -> select the <project_id>-us-patent-parser-dataset bucket. Wait for the dataset configuration to finish.
Click the "Import Documents" button -> click "Browse" -> select the bucket you imported the US patents labeled data to and select the labeled folder -> In the "Data split" dropdown on the right, select Auto-split -> click "Import". Wait for the import to finish.

Click "Edit Schema", enable all the labels, set the labels according to the table below, and then click "Save":

Name	Data type	Occurrence
applicant_line_1	Plain Text	Required once
application_number	Number	Required once
class_international	Plain Text	Required once
class_us	Plain Text	Required once
filing_date	Datetime	Required once
inventor_line_1	Plain text	Required once
issuer	Plain text	Required once
patent_number	Number	Required once
publication_date	Datetime	Required once
title_line_1	Plain text	Required once

Go back to the "Train" tab, and click "Train New Version". You can name the version v1-0-0, and then click "Start Training". Wait for the training to finish: it can take more than 1 hour for it to finish.
Check the processor's F1 score: it should show more than 0.9 for all labels.
Go to the Manage Versions tab -> click the three dots on the right of the model version -> click "Deploy version", and wait for it to finish. It can take more than 10 minutes for it to finish.
Click the three dots again and click "Set as default".

marcusmonteirodesouza / google-cloud-document-ai-warehouse-demo Goto Github PK

google-cloud-document-ai-warehouse-demo's Introduction

Google Cloud Document AI Warehouse demo

How it works

Deployment

Pre-Requisites

Configure the Document AI Warehouse web application

Deployment

Train US Patent Parser Custom Document Extractor

google-cloud-document-ai-warehouse-demo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent