- Create a Google Cloud Organization.
- Create a project on your Organization.
- Install the gcloud CLI.
- Run
gcloud auth login
. - Run
gcloud auth application-default login
. - Install
terraform
.
- Follow this guide to configure the Document AI Warehouse web application.
- Take note of the Service Account Email.
- Go to the Document AI Warehouse web application.
- Go to "Admin" -> "Schemas". Click "Add new".
- Name the schema "US Patent", copy and paste the schema.json into the JSON area and then click "Done".
- Go to "Admin" -> "Access". Click "Add new".
- Add the Service Account with type "User" and access "Document Admin".
This process will:
cd
into the infra/deployment folder.- Comment out the entire contents of the
backend.tf
file. - Create a
terraform.tfvars
file and set the values of the variables. - Run
terraform init
. - Run
terraform apply
and typeyes
. - Uncomment the contents of the
backend.tf
file and set thebucket
attribute to the value of thetfstate_bucket
output. - Run
terraform init
and typeyes
.
The US Patent Parser is a Document AI Custom Document Extractor. To train it, follow the steps below:
-
Go to Key Management -> take note of the location of the
public-doc-ai-keyring
. -
Go to Cloud Storage -> click "Create" to create a GCS bucket -> You can name it
<some random prefix>-us-patent-parser-v1-0-0-initial-data-import
-> For the location, select the same region as thepublic-doc-ai-keyring
-> Click "Continue" until the "Choose how to protect object data" section -> open the "Data encryption" accordion, click "Customer-managed encryption key (CMEK)" and select thedoc-ai-key
as the encryption key -> click "Create". Now click "Upload Folder" and upload the US patents labeled data folder. -
Now go to Document AI -> My Processors -> Click the
us-patent-parser
processor -> Train. -
Click "Show Advanced Options" -> Click "I'l specify my own location -> select the
<project_id>-us-patent-parser-dataset
bucket. Wait for the dataset configuration to finish. -
Click the "Import Documents" button -> click "Browse" -> select the bucket you imported the US patents labeled data to and select the
labeled
folder -> In the "Data split" dropdown on the right, selectAuto-split
-> click "Import". Wait for the import to finish. -
Click "Edit Schema", enable all the labels, set the labels according to the table below, and then click "Save":
Name Data type Occurrence applicant_line_1 Plain Text Required once application_number Number Required once class_international Plain Text Required once class_us Plain Text Required once filing_date Datetime Required once inventor_line_1 Plain text Required once issuer Plain text Required once patent_number Number Required once publication_date Datetime Required once title_line_1 Plain text Required once -
Go back to the "Train" tab, and click "Train New Version". You can name the version
v1-0-0
, and then click "Start Training". Wait for the training to finish: it can take more than 1 hour for it to finish. -
Check the processor's F1 score: it should show more than
0.9
for all labels. -
Go to the Manage Versions tab -> click the three dots on the right of the model version -> click "Deploy version", and wait for it to finish. It can take more than 10 minutes for it to finish.
-
Click the three dots again and click "Set as default".