Git Product home page Git Product logo

google-cloud-document-ai-warehouse-demo's Introduction

Google Cloud Document AI Warehouse demo

How it works

Deployment

Pre-Requisites

  1. Create a Google Cloud Organization.
  2. Create a project on your Organization.
  3. Install the gcloud CLI.
  4. Run gcloud auth login.
  5. Run gcloud auth application-default login.
  6. Install terraform.

Configure the Document AI Warehouse web application

  1. Follow this guide to configure the Document AI Warehouse web application.
  2. Take note of the Service Account Email.
  3. Go to the Document AI Warehouse web application.
  4. Go to "Admin" -> "Schemas". Click "Add new".
  5. Name the schema "US Patent", copy and paste the schema.json into the JSON area and then click "Done".
  6. Go to "Admin" -> "Access". Click "Add new".
  7. Add the Service Account with type "User" and access "Document Admin".

Deployment

This process will:

  1. cd into the infra/deployment folder.
  2. Comment out the entire contents of the backend.tf file.
  3. Create a terraform.tfvars file and set the values of the variables.
  4. Run terraform init.
  5. Run terraform apply and type yes.
  6. Uncomment the contents of the backend.tf file and set the bucket attribute to the value of the tfstate_bucket output.
  7. Run terraform init and type yes.

Train US Patent Parser Custom Document Extractor

The US Patent Parser is a Document AI Custom Document Extractor. To train it, follow the steps below:

  1. Go to Key Management -> take note of the location of the public-doc-ai-keyring.

  2. Go to Cloud Storage -> click "Create" to create a GCS bucket -> You can name it <some random prefix>-us-patent-parser-v1-0-0-initial-data-import -> For the location, select the same region as the public-doc-ai-keyring -> Click "Continue" until the "Choose how to protect object data" section -> open the "Data encryption" accordion, click "Customer-managed encryption key (CMEK)" and select the doc-ai-key as the encryption key -> click "Create". Now click "Upload Folder" and upload the US patents labeled data folder.

  3. Now go to Document AI -> My Processors -> Click the us-patent-parser processor -> Train.

  4. Click "Show Advanced Options" -> Click "I'l specify my own location -> select the <project_id>-us-patent-parser-dataset bucket. Wait for the dataset configuration to finish.

  5. Click the "Import Documents" button -> click "Browse" -> select the bucket you imported the US patents labeled data to and select the labeled folder -> In the "Data split" dropdown on the right, select Auto-split -> click "Import". Wait for the import to finish.

  6. Click "Edit Schema", enable all the labels, set the labels according to the table below, and then click "Save":

    Name Data type Occurrence
    applicant_line_1 Plain Text Required once
    application_number Number Required once
    class_international Plain Text Required once
    class_us Plain Text Required once
    filing_date Datetime Required once
    inventor_line_1 Plain text Required once
    issuer Plain text Required once
    patent_number Number Required once
    publication_date Datetime Required once
    title_line_1 Plain text Required once
  7. Go back to the "Train" tab, and click "Train New Version". You can name the version v1-0-0, and then click "Start Training". Wait for the training to finish: it can take more than 1 hour for it to finish.

  8. Check the processor's F1 score: it should show more than 0.9 for all labels.

  9. Go to the Manage Versions tab -> click the three dots on the right of the model version -> click "Deploy version", and wait for it to finish. It can take more than 10 minutes for it to finish.

  10. Click the three dots again and click "Set as default".

google-cloud-document-ai-warehouse-demo's People

Contributors

marcusmonteirodesouza avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.