Git Product home page Git Product logo

dorothy-ymir's Introduction

Dorothy AI Patent Classifier

This github repository includes code for Dorothy AI Patent Classifier

Table of Contents

Data generation and preprocess

Step 1: We generate our dataset from all granted patents up to September 2019, the total number of patents in the dataset is 4,363,544. To regenerate this dataset, such command could be used

$ sbatch /pylon5/sez3a3p/yyn1228/json_process_jobs/json_process_sin_*.job

or mannuly sbatch from

/pylon5/sez3a3p/yyn1228/json_process_jobs/json_process_sin_a.job

to

/pylon5/sez3a3p/yyn1228/json_process_jobs/json_process_sin_h.job

The extracted dataset is stored in the /pylon5/sez3a3p/yyn1228/data/json_reparse, this path is defined in the file database_reparse.py.

Step 2: We parse the cpc field into labels we need (section, classs, subclass, etc.), convert the text into a list of tokens, and split the data into train, valid, and test datasets by the ratio of 8:1:1. This step also removes all punctuations and convert all uppercase letters into lower case. This can be done by running the file data_preprocess/text_preprocess.py, for example:

$ python3 -u data_preprocess/text_preprocess.py \
/pylon5/sez3a3p/yyn1228/data/json_reparse \
/pylon5/sez3a3p/yyn1228/data/all_data

Step 3: We further preprocess the data into a format that can be used by the machine learning libraries. This can be done by running the file data_preprocess/create_training_data.py. Note that the file takes 6 arguments:

  • input directory
  • output directory
  • text field: 'title', 'abstraction', 'claims', 'brief_summary' ('description' is too large to include)
  • level name: 'section', 'class', 'subclass', 'main_group', 'subgroup'
  • whether to remove stop words: True means remove stop words
  • whether to follow fasttext format: True means FastText format, False means Tecent format

For example:

$ python3 -u data_preprocess/create_training_data.py \
/pylon5/sez3a3p/yyn1228/data/all_data \
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_group \
brief_summary main_group false true

Data location and summary

Processed data after Step 1, which includes 91 files and most of which have 50,000 patents.

/pylon5/sez3a3p/yyn1228/data/json_reparse

Processed data after Step 2, which includes three files: train.json, valid.json, and test.json.

/pylon5/sez3a3p/yyn1228/data/all_data

Smaller datasets for valid and test: created by shuffling valid.json and test.json above and taking the first 60,000 records. These data have the following fields:

  • all_labels: all true labels at the lowest subgroup level
  • title, abstraction, claims, brief_summary, description: text split into list of tokens for various cpc text fields
/pylon5/sez3a3p/yyn1228/data/all_data_small

Processed data after Step 3 for brief summary and subclass level, in Tecent's format. Note that these data do not have stop words.

/pylon5/sez3a3p/yyn1228/data/all_summary_nonstop

Processed data after Step 3 for brief summary and all levels, in FastText's format. Note that these data include stop words.

/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_section
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_class
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext (this is subclass)
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_group
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_subgroup

Smaller datasets initially used for testing purposes. Note that these data were generated by legacy code and may not be easily reproduced.

/pylon5/sez3a3p/yyn1228/data/summary_only
/pylon5/sez3a3p/yyn1228/data/summary_only_fasttext
/pylon5/sez3a3p/yyn1228/data/summary_only_nonstop

Machine learning model

This section introduces how we use various libraries to train machine learning models. All the models are trained using the brief summary text field.

FastText

We use Facebook's FastText library to train the well-known FastText model. This method first converts words into word embeddings and then average word embeddings to create the document embedding. Note that this does not consider the order of the words. To keep some information regarding the order, it includes 2-grams into the vocabulary. Because this model is relatively simple and Facebook uses a lot of tricks to speed up the training, the training can be done by using CPUs instead of GPUs in a couple of hours. To account for the hierarchical information, we borrow the idea from HFT-CNN: we first train the section level and pass the word embeddings into the next word as pretrained word embeddings.

To train FastText on PSC, first run the training job to train the data on the section level

$ sbatch model/FastText/summary_all_section/train_fasttext.job

Then save the word embeddings by running

$ sbatch model/FastText/summary_all_section/bin_to_vec.job

And then do the same thing for the class, subclass, group, and subgroup levels. To change the hyperparameters, just edit the train.py file in the corresponding folder.

Tencent's NeuralClassifier

We use Tencent's NeuralClassifier library to train the classic CNN/RNN/RCNN text classification models. This model accounts for the hierarchical structure by adding a loss that is calculated based on the label tree, which forces closer leaves in the tree to have closer losses. Note that the library supports many models but we also tried the classic CNN/RNN/RCNN models. We edit some code to allow for using existing vocabulary.

A detailed README on how to train the model using NeuralClassifier is saved here: README.md. All models are saved in the "/pylon5/sez3a3p/yyn1228/Dorothy-Ymir/model/NeuralClassifier/output/xxx/checkpoint_dir_cpc" folders on PSC.

Because there are many hyperparameters to tune, we include a summary of all the models we trained with their corresponding hyperparameters: tencent models

HFT-CNN

We also use the HFT-CNN library to train another model. The idea is to train a CNN model on each level, which pass word embeddings and parameters in early layers to the CNN model on the next level. We add some code to support multi-GPU training. Follow the README.md here to train the model. Subclass level model is saved in the "/pylon5/sez3a3p/yyn1228/Dorothy-Ymir/model/HFT-CNN/CNN/CNN/PARAMS/" folders on PSC.

Evaluation

The detailed evaluation is saved in notebooks/prob_evaluate.ipynb. It also includes methods to ensemble different models. See below a summary of the model results below. The best recall at n ≈ 5 is 91.6%. evaluation

To see how the model works on other text fields, we also evaluate the model using title, abstract, and claims, although the model is trained using brief summary. Note that we have not used description to evaluate the model because it would explode the storage, but it is worth trying to evaluate the model using the first 1,000 tokens of the description. Also note that these evaluations only use the FastText model.

Text Field Precision @ 1 Recall @ 1 Precision @ 5 Recall @ 5
Title 0.107 0.568 0.098 0.603
Abstract 0.675 0.401 0.190 0.709
Claims 0.699 0.379 0.247 0.710
Title + Abstract + Claims 0.749 0.403 0.251 0.755
Brief Summary 0.851 0.453 0.216 0.856

We also train FastText models at all the 5 levels. Note that it is only plausible to use the FastText model to train for group and subgroup because there are too many labels. At the subclass level, there are 666 labels and it takes hours to train a non-FastText model; at the subgroup level, there are 200,000 labels, which means if we still use the same model, it would take weeks to finish the training. For group and subgroup, we use the "hierarchical softmax loss" in the FastText model, which is a trick developed by Facebook and significantly shortens training time but lowers the performance a little bit.

Level Precision @ 1 Recall @ 1 Precision @ 5 Recall @ 5
Section 0.921 0.623 0.271 0.992
Class 0.886 0.535 0.257 0.929
Subclass 0.851 0.453 0.216 0.856

For the group:

Level Recall @ 1 Recall @ 10 Recall @ 100
Group 0.220 0.661 0.912

For subgroup:

Level Recall @ 1 Recall @ 10 Recall @ 100 Recall @ 1000
Subgroup 0.054 0.208 0.468 0.750

Visualization

This notebook at notebooks/visualization.ipynb includes visualizations of patent embeddings and word embeddings. The patent embedding figure is the one used in the presentation. The word embedding figure does not show very clear clusters, because the word lists and categories we choose are too general. The notebook includes all the code to generate the word embeddings and the word lists can be changed easily. If more representative word lists and categories are found, change the word lists and rerun the word embedding part in the notebook to get the word embedding visualization.

Web app

Inorder to obtain an intuitive feeling of the result, we built an web app that could predict the corresponding CPC code given by any text in real time, and generate an tree plot. One thing to note is that the WebApp is under the visualization branch, and the model impl is under master branch.

The backend for our Web app is Django, the frontend is built by React and the project is deployed on the AWS. The user could easily type in any text the describe one tech utility, and the predicted cpc codes will be render in the form of tree in secondes.

For the backend, we load the model and make the prediction in models.py at the very first time of the prediction, so for the following predictions, we can get rid of loading the giant model file too much times. But the prediction also need to be structed and parsed into the format that could be used for frontend, and this part of work is done by treebuilders.py and views.py. Also, in the backend, we need to rank the predcitions according to their confidence scores made by our model for the frontend render work, so the total data we return back to front is :

res = {'tree': tree, 'ordered_labels': ordered_labels}

where tree is the parsed predictions in tree structure format, and ordered_labels is the prediction labels ranked by their confidence scores.

In the front end, we use these two data to render a tree chart, and we also provide an adjust bar that could change the tree leafnode numbers for analyse.

The implementation is well documented, so it is easy for further integration.

Other

  • notebooks/CPC_Preliminary_Data_Analysis.ipynb: this notebook includes some preliminary data analysis of the CPC MCF data (e.g. average number of labels, duplicate issues, number of labels at each level, etc.)
  • notebooks/CPC_Text_Data.ipynb: this notebook has some preliminary data analysis of the CPC text data (e.g. average number of tokens of each text field)
  • notebooks/evaluate.ipynb: this notebook has some old evaluation methods (e.g. macro and micro F1, precision and recall at different percentiles, etc.)

dorothy-ymir's People

Contributors

qinghaopeng avatar anqiwang0218 avatar arktrail avatar irisuq avatar

Stargazers

thiago avatar Poyeh Li avatar 热心市民黄先生 avatar  avatar Erichen avatar Nancy Raker avatar  avatar David Qiao avatar Jaime Lee avatar Cynthia Xin avatar wei_zh37 avatar timothy Rasinski avatar Mr. Pieixoto avatar  avatar WangYiChen avatar K.N. Sun avatar Denise Turner DVM avatar  avatar 韩少明 avatar Yitian Yang avatar 白马非马 avatar James avatar Zeng Xiangyun avatar Dia avatar  avatar Not Fatal Error Yet avatar Intimissimi avatar Covariance Dirac Delta avatar mr zhang avatar TsutomuN avatar Wan Zhe avatar  avatar Margaret Wong avatar Farming Tong avatar  avatar  avatar  avatar  avatar  avatar xy avatar  avatar Zha Meng avatar Cheng Yang avatar 陈良顺 avatar 汪菲鸿 avatar Mike Tu avatar Kazuo Yamamoto avatar iacker avatar Zhijie Cao avatar shu nee duo avatar Husband-Like Developer avatar Wang Heye avatar snowflowersnowflake avatar 虞兮曦 avatar  avatar  avatar Watson avatar MinerProxy avatar  avatar  avatar 筱楽 avatar  avatar 摸金少年 avatar Zhao Xing avatar Sandwich Cat avatar  avatar สาวดุ้น avatar Singularity avatar  avatar 0xLemon avatar Awesome King avatar 涂娜娜 avatar 幻城 avatar Jason Sung avatar Yongjie Wen avatar ProgrammerUnknown avatar porschebz avatar Ken avatar Priscilla J. Nunez avatar  avatar Benjamin Moll avatar 轻茹莉莉酱 avatar 共赴前程 avatar  avatar Hongyu Xiang avatar 上海一女子 avatar  avatar Zhonghua Suo avatar Zhang Chao avatar Keo avatar  avatar Yencatx avatar Ryan Davis avatar  avatar  avatar  avatar CarlCarly7 avatar  avatar 皮皮 PiPi avatar Johnny Chew avatar

Watchers

 avatar  avatar Johnny Chew avatar Steele Koch avatar 张思绮 avatar 梁俊宇 avatar Leo Pan avatar Orbitz::{} avatar Vespoli avatar 涂娜娜 avatar 马志宇 avatar 魔鬼·珺 avatar Farming Tong avatar ドーム avatar Benjamin Moll avatar 筱楽 avatar Ashley En avatar Guo Lin avatar 热心市民黄先生 avatar Nowwa avatar Unprocessable Man avatar Jason Sung avatar 轻茹莉莉酱 avatar 李易连 avatar 霎弼海龍 avatar DELAG avatar Awesome King avatar Xuper avatar Nicholas Baird avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.