google / patents-public-data Goto Github PK

View Code? Open in Web Editor NEW

500.0 58.0 159.0 10.21 MB

Patent analysis using the Google Patents Public Datasets on BigQuery

Home Page: https://bigquery.cloud.google.com/dataset/patents-public-data:patents

License: Apache License 2.0

Jupyter Notebook 87.95% Python 11.97% Dockerfile 0.08%

patents-public-data's Introduction

Patent analysis using the Google Patents Public Datasets on BigQuery

The contents of this repository are not an official Google product.

Google Patents Public Datasets is a collection of compatible BigQuery database tables from government, research and private companies for conducting statistical analysis of patent data. The data is available to be queried with SQL through BigQuery, joined with private datasets you upload, and exported and processed using many other compatible analysis tools. This repository is a centralized source for examples which use the data.

Currently the repo contains three examples:

Patent Landscaping: A demo of an automated process of finding patents related to a particular topic given an initial seed set of patents. Based on the paper by Dave Feltenberger and Aaron Abood, Automated Patent Landscaping.
Claim Text Extraction: A demo of interacting with patent claim text data using BigQuery and python.
Claim Breadth Model: A machine learning method for estimating patent claim breadth using data from BigQuery.

Other helpful resources from the community:

Replicable Patent Indicators (paper)

patents-public-data's People

Contributors

Stargazers

Watchers

Forkers

priya-gittest pombredanne jmirandalc yngcan teqdex lexruby ricky-lim mtdersvan yz81 nastyaye vishal-keshav lsimoneau lucyxiaoluwang xyzqwzxy aalbracht kant aviyallapalli cheermango lzhan015 mharrisonbaker ssorc hammerice samiratzn skalt shyamalschandra hongzhonglu hubbucket-team choiseokkyu curious-bee gmatalkah nickrance danielm-github sfd9898 syuan-yi gwengle c-ruttkies mkracker albert2lyu sownthunder crisdurso ioannispapadakis jobajuba abreitzmansr jasonhoou jayeetap allensmile shuaichengli0428 wqwersss heidiwang123 robert-srebrovic asdbaihu ytldsimage taozho asethi-ds obulpathi jlee12393 pofernandes lk3 zeonium metasj juancmijares rhahn28 farhadrclass lamprosmousselimis tiennguyen-lumenci orgpatentroot bioinfonerd-forks xiang-li-fin tomysun krptodr alonasorochynska khatvangi bbyun28 isabella232 neotim mlawson00 catherinesjkim peterli1001 krishan-stack f-michael arocavanaclocha j143 arjkumar1 global-localhost global19 global19-atlassian-net mustberuss felipecamelo sronro elnaaz d-w216 alekssadowski95 thinhnv1102 tayyabmujahid michoemad rahneh mikpim01 etpython brianhance zhedian

patents-public-data's Issues

Kicking the tyres

If the project team and @wetherbeei might consider it useful I could do some per-document comparisons of chemistry extraction e.g. vs SureChEMBL and/or IBM. I realise this is a work in progress but has this advanced to the point where this would be appropriate?

Load Inference Data in LandscapeNotebook.ipnyb

Hello, receiving the following error in when I try to run code immediately following "Load Inference Data."

OverflowError Traceback (most recent call last)
in
2
3 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot =
----> 4 expander.sample_for_inference(td, 0.2)

~/patents-public-data/models/landscaping/expansion.py in sample_for_inference(self, train_data_util, sample_frac)
535 pickle.dump(
536 (subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot),
--> 537 outfile)
538 else:
539 print('Loading inference data from filesystem at {}'.format(inference_data_path))

OverflowError: cannot serialize a bytes object larger than 4 GiB

I tried to add "protocol=4" to the pickle.dump (see here) but still receive the same error.

patent landscaping pruning

Hello,

It looks like the patent landscape notebook doesn't get all the way through the l1 and l2 expansion and pruning described in the paper. Based on the paper alone, I'm not sure how to try to re-produce these steps for my own seed set. Do you have additional documentation available?

Thanks!

BERT-Base

This BERT model actually a BERT-Large model with 24 hidden layers etc. Is there the respective BERT-Base trained on patent data?

Sklearn 1.1.1 Issue

I cant get this to work with the latest version of Sklearn 1.1.1
Its something related to the NearestNeighbors function in the latest version of this library

Here is the error in the cell #@markdown Run the clustering algorithm, calculate cluster characteristics and visualize.

--> 294 nbrs = NearestNeighbors(n_neighbors + 1, metric=metric).fit(x)
295 _, indices = nbrs.kneighbors(x)
297 indices = nbrs.kneighbors(x)

TypeError: NearestNeighbors.init() takes 1 positional argument but 2 positional arguments (and 1 keyword-only argument) were given

This is called in this section of code

Step 1.

Add 1 since NearestNeighbors returns itself as a nearest point.

nbrs = NearestNeighbors(n_neighbors + 1, metric=metric).fit(x)
_, indices = nbrs.kneighbors(x)

Cant seem to get past this one

error in word2vec

Hi,
Congratulations on your awesome work!
I'm trying to run it, but I've got an error I could not figure out.
Trying to run block Download Embedding Model if Necessary from patent landscape notebook, i got the following error message:

~...\patents-public-data-master\patents-public-data-master\models\landscaping\word2vec.py in download_w2v_model(self, landscape_bucket, model_name)
38
39 def download_w2v_model(self, landscape_bucket, model_name):
---> 40 from google.cloud import storage
41
42 """

ImportError: cannot import name 'storage'

It seems there's no storage in google.cloud package. In python interactive shell:

import google.cloud as g
help(g)

Help on package google.cloud in google:

shows:

__NAME
google.cloud

PACKAGE CONTENTS
_helpers
_http
_testing
bigquery (package)
client
environment_vars
exceptions
iam
obsolete
operation

FILE
(built-in)__

I believe I've followed the steps carefully. Am I doing something wrong?

Thanks a lot!
Wagner

Expiration date

I'm new here so I apologize in advance if I'm deviating from any normal protocol.

I've been exploring the public data sets provided by Google and USPTO and can't find an API that enables me to access the expiration dates of all US patents. I have a scraping tool that enables me to scrape all patents into a database and then filter on the expiration data field by ascend/descend. That tool doesn't meet my ongoing needs, however, due to 1) time it takes to scrape all the data and 2) it doesn't account for updates to data.

Has anybody had luck pulling expiration dates from any of the data sets?

Converting Tensorflow Bert for Patent saved model to keras.

The link to google bigdata blog does not work

The link "https://cloud.google.com/blog/big-data/2018/07/measuring-patent-claim-breadth-using-google-patents-public-datasets" for "this post on the Google Cloud Big Data Blog" leads to 404.

lack of "vocab"

We finished allocating the environment according to the instructions. Without downloading files on Google Cloud Storage bucket, we wrote .model and .npy by ourselves. But we still fail to run "LandscapeNotebook.ipynb" , it show error in "w2vruntime = word2vec5_9m.restore_runtime()" cause we failed in "load_vocab_mappings" , line 476 in "restore_runtime".
So I would like to ask where I can get the "vocab" file. Looking forward to your answer, thanks!

Error when trying to reproduce

I have been trying to recreate the patent landscape code in a jupyter notebook. Everything runs perfectly until I get to loading the inference data. I get "EOF error: Ran out of input" at line 541 of expansion.py.
`Loading inference data from filesystem at data\video_codec\landscape_inference_data.pkl

EOFError Traceback (most recent call last)
in ()
----> 1 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = expander.sample_for_inference(td, 0.2)

~\Documents\Python Scripts\expansion.py in sample_for_inference(self, train_data_util, sample_frac)
539 print('Loading inference data from filesystem at {}'.format(inference_data_path))
540 with open(inference_data_path, 'rb') as infile:
--> 541 inference_data_deserialized = pickle.load(infile)
542
543 subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = \

EOFError: Ran out of input`

Appreciate the help. I can put this all on my git if that would help

Has the quarterly dataset update schedule been adhered to?

On the cloud console details page both the public patents and research datasets are said to update quarterly. However, the bigquery details tab for each datatset lists February as the last update (6-7 months ago at the time of this issue).

In the event this isn't the right place to ask about the dataset updates, where should I reraise this issue?

Build the Deep Neural Network in LandscapeNotebook.ipnyb

Hello,

Getting the following error and have been unable to trouble shoot so far:

ImportError Traceback (most recent call last)
in
----> 1 import model
2 import importlib
3 importlib.reload(model)
4
5 model = model.LandscapeModel(td, 'data', seed_name)

~/patents-public-data/models/landscaping/model.py in
19 from keras.models import Sequential, Model
20 from keras.layers import Dense, Input, Embedding, BatchNormalization, ELU, Concatenate
---> 21 from keras.layers import LSTM, Conv1D, MaxPooling1D, Merge
22 from keras.layers.merge import concatenate
23 from keras.layers.core import Dropout

ImportError: cannot import name 'Merge'

Thanks!

Possible inconsistency between model tokenizer and white-paper example of 'prothesis'.

When tokenizing 'prothesis' I expect this to map to a single numerical token; however it is mapped to two and when compared to the original BERT tokenizer they do the same thing. To my understanding this tokenizer should tokenize
'prothesis' into a single token as previously mentioned as this is stated as an example in the white paper accompanying this repo. Additionally, I don't see the work prothesis in the .txt vocabulary file provided through this repo. I am curious if there was a change between what is in this repo and the white paper or if this is something new an unexpected?

Note: I should mention I am converting this TensorFlow implementation and supported files to PyTorch and using HuggingFace.

Better understanding Google Patents Research Dataset

Google Patents Research dataset (on BigQuery) has a table called annotations which has useful information for research. The schema of the annotations table comprises of the following columns:

publication_number
ocid
preferred_name
domain
source
confidence
character_offset_start
character_offset_end
inchi
inchi_key
smiles
conf_bucket

Most of the fields are understandable through their names and values. However, I am having a hard time figuring out what does character_offset_start and character_offset_end mean here. Intuitively, it should refer to character offset in the original patent documents (column 1) but since patents are mostly pdfs it is not very clear how to make use of it. If anyone has any idea then please share.

Thanks in advance

error when trying to reproduce

when I tried to reproduce this model and input:
from word2vec import Word2Vec

word2vec5_9m = Word2Vec('5.9m')
w2v_runtime = word2vec5_9m.res

I got error information:

OutOfRangeError Traceback (most recent call last)
~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1326 try:
-> 1327 return fn(*args)
1328 except errors.OpError as e:

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1311 return self._call_tf_sessionrun(
-> 1312 options, feed_dict, fetch_list, target_list, run_metadata)
1313

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1419 self._session, options, feed_dict, fetch_list, target_list,
-> 1420 status, run_metadata)
1421

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\framework\errors_impl.py in exit(self, type_arg, value_arg, traceback_arg)
515 compat.as_text(c_api.TF_Message(self.status.status)),
--> 516 c_api.TF_GetCode(self.status.status))
517 # Delete the underlying status object from memory otherwise it stays alive

OutOfRangeError: Read fewer bytes than requested
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

OutOfRangeError Traceback (most recent call last)
in ()
2
3 word2vec5_9m = Word2Vec('5.9m')
----> 4 w2v_runtime = word2vec5_9m.restore_runtime()

~\patents-public-data\models\landscaping\word2vec.py in restore_runtime(self)
481 with tf.Session(graph=w2v_graph.train_graph, config=GPU_MEM_CONFIG) as sess:
482 saver = tf.train.Saver()
--> 483 saver.restore(sess, tf.train.latest_checkpoint(self.checkpoints_path))
484 embedding_weights, normed_embedding_weights =
485 sess.run([w2v_graph.embedding, w2v_graph.normalized_embedding])

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\training\saver.py in restore(self, sess, save_path)
1773 else:
1774 sess.run(self.saver_def.restore_op_name,
-> 1775 {self.saver_def.filename_tensor_name: save_path})
1776
1777 @staticmethod

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
903 try:
904 result = self._run(None, fetches, feed_dict, options_ptr,
--> 905 run_metadata_ptr)
906 if run_metadata:
907 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1138 if final_fetches or final_targets or (handle and feed_dict_tensor):
1139 results = self._do_run(handle, final_targets, final_fetches,
-> 1140 feed_dict_tensor, options, run_metadata)
1141 else:
1142 results = []

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1319 if handle is None:
1320 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1321 run_metadata)
1322 else:
1323 return self._do_call(_prun_fn, handle, feeds, fetches)

~\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1338 except KeyError:
1339 pass
-> 1340 raise type(e)(node_def, op, message)
1341
1342 def _extend_graph(self):

Caused by op 'save/RestoreV2', defined at:
File "C:\Users\xyz\Miniconda3\envs\py35\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\ipykernel_main.py", line 3, in
app.launch_new_instance()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
app.start()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\ipykernel\kernelapp.py", line 486, in start
self.io_loop.start()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tornado\platform\asyncio.py", line 127, in start
self.asyncio_loop.run_forever()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\asyncio\base_events.py", line 421, in run_forever
self._run_once()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\asyncio\base_events.py", line 1425, in _run_once
handle._run()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\asyncio\events.py", line 127, in _run
self._callback(*self._args)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tornado\platform\asyncio.py", line 117, in _handle_events
handler_func(fileobj, events)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tornado\stack_context.py", line 276, in null_wrapper
return fn(*args, **kwargs)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\zmq\eventloop\zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\zmq\eventloop\zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\zmq\eventloop\zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tornado\stack_context.py", line 276, in null_wrapper
return fn(*args, **kwargs)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\ipykernel\kernelbase.py", line 233, in dispatch_shell
handler(stream, idents, msg)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\ipykernel\ipkernel.py", line 208, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\ipykernel\zmqshell.py", line 537, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\IPython\core\interactiveshell.py", line 2662, in run_cell
raw_cell, store_history, silent, shell_futures)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\IPython\core\interactiveshell.py", line 2785, in _run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\IPython\core\interactiveshell.py", line 2903, in run_ast_nodes
if self.run_code(code, result):
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 4, in
w2v_runtime = word2vec5_9m.restore_runtime()
File "C:\Users\xyz\patents-public-data\models\landscaping\word2vec.py", line 482, in restore_runtime
saver = tf.train.Saver()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\training\saver.py", line 1311, in init
self.build()
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\training\saver.py", line 1320, in build
self._build(self._filename, build_save=True, build_restore=True)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\training\saver.py", line 1357, in _build
build_save=build_save, build_restore=build_restore)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\training\saver.py", line 809, in _build_internal
restore_sequentially, reshape)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\training\saver.py", line 448, in _AddRestoreOps
restore_sequentially)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\training\saver.py", line 860, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1457, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 3290, in create_op
op_def=op_def)
File "C:\Users\xyz\Miniconda3\envs\py35\lib\site-packages\tensorflow\python\framework\ops.py", line 1654, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): Read fewer bytes than requested
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I don't know what's the problem since I followed all your instructions!

ResourceExhaustedError while running Document_representation_from_BERT

I have been trying to run Document_representation_from_BERT on local machine with enough memory, i.e. 8 GB RAM.
All the other TF function runs without this error for other notebooks on my local machine.

But when I load the Patent_BERT model, i.e.
model = tf.compat.v2.saved_model.load(export_dir=MODEL_DIR, tags=['serve']) model = model.signatures['serving_default']

It also gives the similar error at :
docs_embeddings = [] for _, row in df.iterrows(): inputs = get_bert_token_input(row['claims']) response = model(**inputs) avg_embeddings = pooling( tf.reshape(response['encoder_layer'], shape=[1, -1, 1024])) docs_embeddings.append(avg_embeddings.numpy()[0])

Please help me to get this done. I have already spent a lot of time solving the issue, but to no avail.

expansion.py load_training_data_from_pubs

Hello,

I think a formatting question for the BigQuery command: how would I also download inventor and assignee fields? I must be getting the formatting wrong. I keep getting messages that inventor and assignee (or inventor_harmonized and assignee_harmonized) columns are not available.

Thanks!

Missing embedding

Patent US-7398565-B1 had embedding before, but seems to have vanished recently (< 3 months).
There might be others.

SELECT publication_number, embedding_v1
FROM `patents-public-data.google_patents_research.publications`
WHERE
publication_number in ('US-7398565-B1')
LIMIT 1000

Maybe they can be restored.

BERT for Patents: unable to access hidden layers

Hi there,

on the BERT for Patents "saved model", how can we access the hidden layers?

The response signature only contains "encoder_layer" (the final layer). But we would like to access all 24 layers.

Do we need to use the "checkpoint"? If so, how can we load this?

Thanks in advance!

Decreasing number of annotations in google patents research in recent batches

Could someone help me understand why the number of annotations in google patents research are dropping in recent batches?

202208 verison: 59,089,580,018 rows
202212 verison: 60,246,963,593 rows
202304 verison: 63,130,241,301 rows
202307 verison: 48,064,657,811 rows
current verison: 41,000,981,833 rows

Data size seems increasing before 202307 but decreasing greatly afterwards. Is it due to model changes/coverage change, or some other reason? And which vesion of data should I rely on more for some analysis? Is the most recent version most trustable? Thanks!

Lots of Patents in the latest patent dataset are missing a description

Since the latest version of the free patent dataset from August 2022, there are many patents with no title, e.g.: https://patents.google.com/patent/GB1050484A

The older dataset didn't suffer from this issue - could this be caused by this pipeline or is it coming directly from the source?

embedding model is not found// Automated Patent Landscaping

Hello, good day!

Thank you for your work.

Unfortunately the embedding model is not found.
Is there a new data storage location ?

error message:

NotFound: 404 GET https://storage.googleapis.com/download/storage/v1/b/patent_landscapes/o/models%2F5.9m%2Fcheckpoints%2Fcheckpoint?alt=media: The specified bucket does not exist.: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

Empty Tables in the Dataset

I'm not sure whether I've done something wrong, but some tables in the dataset on BigQuery are showing up as empty for me.

See attached screenshot:

[landscape] what are the costs of running the model?

I'm interested in reproducing the work in a docker container, but would rather not if it means I will incur excessive costs of

processing time
local memory
BigQuery TB processed
What are the orders of magnitude of costs for running the model?

Generating new Document Embeddings

From what I glean, it seems that you are generating patent embeddings directly. Is this correct? and if so, are you then forced to rerun the model every time new patents are published?

Is it possible to generate new embeddings using unpublished patent data like an abstract, etc?

Unable to access or download Word2Vec Embeddings for Patent Landscaping

Hello,
I ran into trouble when trying to download the Word2Vec Embeddings from Google Cloud Storage, which output the following error:
The billing account for the owning project is disabled in state absent: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

I tried to access the bucket patent_landscapes and it showed the following message:
Additional permissions required to list objects in this bucket. Ask a bucket owner to grant you 'storage.objects.list' permission

There were no visible objects in the bucket. I hope someone can help me with the codes or with a copy of the Word2Vec Embeddings.

Thank you in advance!

Large decrease in number of OCID annotations

We have observed a large decrease in number of OCID annotations available in the recent Google Patents public data. We specifically consume the OCID associated with patents, so I will focus on that here. It appears that large number of patents that used to be annotated with OCIDs of specific entities (in our case genes), are no longer annotated by those OCIDs.

To give one specific example, if we take STAT1/ENSG00000115415 OCID:102100019657 and application US-201816499393-A, previous release had 32 OCID IDs associated with this application:

OCIDs

102100004941 102100002816 102100017157 102100004159 102100016295 102100020485 102100017509 102100019667 102100019388 102100008658 102100018913 102100015895 102100017329 102100019517 102100009637 102100000197 102100008614 102100005617 102100016662 102100009641 102100017996 102100019657 102100003514 102100017933 102100009664 102100019816 102100015722 102100017932 102100019099 102100012464 102100010255 102100002212

The most recent release doesn't have any, missing STAT1 annotation completely even though it's clearly in the text of the patent. Further if we count unique patents annotated with STAT1 OCID over time:

In the most recent public data there appears to be half as many publications with STAT1 annotations. Is there any specific reason for this?

Could be related to #88

confidence>1 for annotations in google patent research

In the current google patent research annotation, confidence (and the corresponding conf_bucket) are sometimes >1 (>1000). Is there some numerical or formatting error behind? If not, how should I interpret the >1 confidence score? Thanks!

Some sample from big query with confidence>1

sampleid|publication_number|confidence|conf_bucket|
1 | US-6818775-B2 | 1.05 | 1050 |
2 | AU-2017219004-B2 | 1.0999999 | 1099 |
3 | US-2016046635-A1 | 1.1999998 | 1199 |
4 | US-9643971-B2 | 1.1999998 | 1199 |
5 | ES-2438576-T3 | 1.05 | 1050 |
6 | JP-WO2007013641-A1 | 1.05 | 1050

Dataset lacking cited_by data even though its available on the website.

There are a lot of patents that dont have any cited_by data in patents-public-data.google_patents_research.publications even though the data is available on the website. Has anyone also tried to get that data and has a solution to this?

Here is an example:
Google Patents website
Document with publication number 'US-9824690-B2' does have 80 cited by entries on the website
https://patents.google.com/patent/US9824690B2/en?q=makerbot&assignee=Makerbot+Industries%2c+Llc

Google Cloud BiqQuery
But there are none in the patents-public-data.google_patents_research.publications dataset.

SELECT publication_number, cited_by
FROM `patents-public-data.google_patents_research.publications` 
WHERE publication_number = 'US-9824690-B2'
LIMIT 50

Can anyone explain this?

Clean and vectorize 20k data set

Use TextBlob to generate similarity between claims text sections

Unable to use the Patent-BERT

This is more of a query rather than a bug. Not sure where to ask.
I want to use the BERT embeddings for some patents so I followed the standard Tensorflow tutorial to use BERT (classify_text_with_bert.ipynb). I downloaded and unzipped the saved model and got something like this, very similar to the BERT model structure in the tutorial:

Archive:  saved_model.zip
   creating: temp_dir/rawout/
   creating: temp_dir/rawout/variables/
  inflating: temp_dir/rawout/variables/variables.data-00000-of-00001  
  inflating: temp_dir/rawout/variables/variables.index  
  inflating: temp_dir/rawout/saved_model.pb  
   creating: temp_dir/rawout/assets.extra/
  inflating: temp_dir/rawout/assets.extra/vocab.txt

Then instead of below line in the tutorial:
bert_model = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1')
I passed :
bert_model = hub.KerasLayer('/content/temp_dir/rawout')
It threw this error:

ValueError: Importing a SavedModel with tf.saved_model.load requires a 'tags=' argument if there is more than one MetaGraph. Got 'tags=None', but there are 2 MetaGraphs in the SavedModel with tag sets [['serve'], ['serve', 'tpu']]. Pass a 'tags=' argument to load this SavedModel.

So I tried
bert_model = hub.KerasLayer('/content/temp_dir/rawout', tags=['serve'])
Then I got this error:
ValueError: Signature name has to be specified for non-callable saved models (if not legacy TF1 Hub format).

Can someone please help what is the correct way to use this model locally??

BERT for Patents yields 1024 element array, but embedding_v1 is 64 element

How should I generate an embedding equivalent to embedding_v1? BERT for Patents generates a 1024 element embedding, but the embedding_v1 is a 64 element embedding.

context tokens

Thanks for sharing your great work!

I am trying to use the special "context" tokens, so I just want to make sure that it is okay for us to just add the token in the head of each part, like adding [claim] before the text of the first claim?

Also, I'm not sure about the other two tokens: [invention] and [summary]. Will they both work for description? Or they are for very specific parts of a patent?

Many thanks in advance!

API limit exceeded: Unable to return a row that exceeds the API limits. To retrieve the row, export the table.

Hello,

First of all, it is really an exciting project and paper. Thank you!!

I tried to adapt the landscaping method on my project. But when I was in the "Load Inference Data" step, it showed the "API limit exceeded: Unable to return a row that exceeds the API limits. To retrieve the row, export the table" error. It seems there is only a way to workaround, that is: "export table" (please see: https://stackoverflow.com/questions/37547711/bigquery-api-limit-exceeded-error).
What does it mean? which table should be exported?
In addition, I tried to split L1 expansion patents to several small numbers, which could be dumped into the model for prediction. But it seems to go beyond my ability...Would you please give me some hints or tell me how to put all L1 expansion patents into the model? Thank you very much.

All the best,
Syuan-Yi

Error in word2vec: model from Google Cloud Storage was not downloaded

Hello, good day!

Thank you for your work. I'm trying to run automated patent landscape notebook, but I've got an error I could not figure out.

Trying to run block Download Embedding Model if Necessary from patent landscape notebook, I got the following error message:

InvalidResponse Traceback (most recent call last)
C:\Anaconda\envs\new-environment\lib\site-packages\google\cloud\storage\client.py in download_blob_to_file(self, blob_or_uri, file_obj, start, end, raw_download, if_generation_match, if_generation_not_match, if_metageneration_match, if_metageneration_not_match, timeout, checksum)
718 try:
--> 719 blob_or_uri._do_download(
720 transport,

C:\Anaconda\envs\new-environment\lib\site-packages\google\cloud\storage\blob.py in _do_download(self, transport, file_obj, download_url, headers, start, end, raw_download, timeout, checksum)
960 )
--> 961 response = download.consume(transport, timeout=timeout)
962 self._extract_headers_from_download(response)

C:\Anaconda\envs\new-environment\lib\site-packages\google\resumable_media\requests\download.py in consume(self, transport, timeout)
167
--> 168 self._process_response(result)
169

C:\Anaconda\envs\new-environment\lib\site-packages\google\resumable_media_download.py in _process_response(self, response)
184 self._finished = True
--> 185 _helpers.require_status_code(
186 response, _ACCEPTABLE_STATUS_CODES, self._get_status_code

C:\Anaconda\envs\new-environment\lib\site-packages\google\resumable_media_helpers.py in require_status_code(response, status_codes, get_status_code, callback)
98 callback()
---> 99 raise common.InvalidResponse(
100 response,

InvalidResponse: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

During handling of the above exception, another exception occurred:

NotFound Traceback (most recent call last)
in
3 model_name = '5.9m'
4 model_download = W2VModelDownload(bq_project)
----> 5 model_download.download_w2v_model('patent_landscapes', model_name)
6 print('Done downloading model {}!'.format(model_name))

~\patents-public-data\models\landscaping\word2vec.py in download_w2v_model(self, landscape_bucket, model_name)
54 bucket = client.bucket('patent_landscapes')
55 blob = bucket.blob(checkpoint_list_file)
---> 56 checkpoints = blob.download_as_string(client=client).decode()
57 checkpoint_file = 'n/a'
58

C:\Anaconda\envs\new-environment\lib\site-packages\google\cloud\storage\blob.py in download_as_string(self, client, start, end, raw_download, if_generation_match, if_generation_not_match, if_metageneration_match, if_metageneration_not_match, timeout)
1385 stacklevel=1,
1386 )
-> 1387 return self.download_as_bytes(
1388 client=client,
1389 start=start,

C:\Anaconda\envs\new-environment\lib\site-packages\google\cloud\storage\blob.py in download_as_bytes(self, client, start, end, raw_download, if_generation_match, if_generation_not_match, if_metageneration_match, if_metageneration_not_match, timeout, checksum)
1294 client = self._require_client(client)
1295 string_buffer = BytesIO()
-> 1296 client.download_blob_to_file(
1297 self,
1298 string_buffer,

C:\Anaconda\envs\new-environment\lib\site-packages\google\cloud\storage\client.py in download_blob_to_file(self, blob_or_uri, file_obj, start, end, raw_download, if_generation_match, if_generation_not_match, if_metageneration_match, if_metageneration_not_match, timeout, checksum)
729 )
730 except resumable_media.InvalidResponse as exc:
--> 731 _raise_from_invalid_response(exc)
732
733 def list_blobs(

C:\Anaconda\envs\new-environment\lib\site-packages\google\cloud\storage\blob.py in _raise_from_invalid_response(error)
4059 )
4060
-> 4061 raise exceptions.from_http_status(response.status_code, message, response=response)
4062
4063

NotFound: 404 GET https://storage.googleapis.com/download/storage/v1/b/patent_landscapes/o/models%2F5.9m%2Fcheckpoints%2Fcheckpoint?alt=media: Not Found: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

They problem may be with the folder patent_landscapes in Google Cloud Storage. I couldn't access it manually. The message is:

Sorry, the server was not able to fulfil your request.

Thank you in advance!

How to access hidden layers?

Could you please provide an example code on how to access hidden layers?

How to download

This is probably a really easy question, but how do I download the patent data? For example, https://console.cloud.google.com/bigquery?project=erudite-marker-539&page=table&t=inventor_raw&d=JEMS16&p=erudite-marker-539&redirect_from_classic=true&ws=!1m5!1m4!4m3!1serudite-marker-539!2sJEMS16!3sinventor_raw

I tried Export - Export to GCS. And then I get stuck.

Description and claims are missing for JP patents data.

When I queried patent records for JP country_code like below, the description_localized.text and claims_localized.text were always NULL.

SELECT * FROM `patents-public-data.patents.publications` 
WHERE country_code = "JP"

Only abstract_localized is available but this is not enough for my use case.
I'm not sure if this is an expected issue because Google Patents search can include both descriptions and claims in their results without any problem.

Can you help me if I'm missing something?

SureChEMBL

I noticed that the SureChEMBL database in bigquery was last updated on Nov 2018. Are there plans to continue updating the SureChEMBL database within BigQuery? Are there ways to do chemistry related (fingerprint similarity, SMARTS) searches in BigQuery?

claim_text_extraction.ipynb df = pd.read_csv('./data/20k_G_and_H_publication_numbers.csv') workaround

I had a hard time getting this bit of code to work:

df = pd.read_csv('./data/20k_G_and_H_publication_numbers.csv')

I went into jupyter labs, copied the file path of the 20k_G_and_H_publication_numbers.csv,
and then pasted the file path.
For me, the command looked like this:

df=pd.read_csv('GCS/20k_G_and_H_publication_numbers.csv')

I don't know that I'm doing that ^^ right, but it didn't work to load the dataframe.
I found this google-cloud-service-buckets workaround on stack overflow:

Store the file in a GCS bucket.

1.Upload your file to GCS.

2.In your Notebook, type the following code, replacing the bucket and file names accordingly:

import pandas as pd
from google.cloud import storage
from io import BytesIO
client = storage.Client()
bucket_name = "your-bucket"
file_name = "your_file.csv"
bucket = client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content))
print(df)

Credit to OP

^^ this workaround worked for me

Error with shape of arguments for Google Patents saved model prediction

While running the following line of code:
response, inputs, masked_text = bert_predictor.predict(texts)
the following error comes up for the shape of the tensor. I have followed the jupyter notebook as is.
(tf version: 1.15.2)

----------------------Error--------------------------

`ValueError Traceback (most recent call last)
in ()
----> 1 response, inputs, masked_text = bert_predictor.predict(texts)
2 train_inputs = response['cls_token']
3 train_labels = tf.convert_to_tensor(classes)

3 frames
in predict(self, texts, mlm_ids, context_tokens)
140 input_mask=tf.convert_to_tensor(inputs['input_mask'], dtype=tf.int64),
141 input_ids=tf.convert_to_tensor(inputs['input_ids'], dtype=tf.int64),
--> 142 mlm_positions=tf.convert_to_tensor(inputs['mlm_ids'], dtype=tf.int64),
143 )
144

/tensorflow-1.15.2/python3.6/tensorflow_core/python/eager/function.py in call(self, *args, **kwargs)
1079 TypeError: For invalid positional/keyword argument combinations.
1080 """
-> 1081 return self._call_impl(args, kwargs)
1082
1083 def _call_impl(self, args, kwargs, cancellation_manager=None):

/tensorflow-1.15.2/python3.6/tensorflow_core/python/eager/function.py in _call_impl(self, args, kwargs, cancellation_manager)
1119 raise TypeError("Keyword arguments {} unknown. Expected {}.".format(
1120 list(kwargs.keys()), list(self._arg_keywords)))
-> 1121 return self._call_flat(args, self.captured_inputs, cancellation_manager)
1122
1123 def _filtered_call(self, args, kwargs):

/tensorflow-1.15.2/python3.6/tensorflow_core/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1208 arg_name, arg,
1209 self._func_graph.inputs[i].shape,
-> 1210 arg.shape))
1211 elif (self._signature is not None and
1212 isinstance(self._signature[i], tensor_spec.TensorSpec)):

ValueError: The argument 'mlm_positions' (value Tensor("Const_20:0", shape=(0,), dtype=int64)) is not compatible with the shape this function was traced with. Expected shape (?, 45), but got shape (0,).

If you called get_concrete_function, you may need to pass a tf.TensorSpec(..., shape=...) with a less specific shape, having None on axes which can vary.`

Linking proteins and humangenes annotation preferred name to identifier

I'm looking at annotations in the proteins and humangenes domain in the google_patent_research.annotations table. It appears that the annotations themselves are normalized to their preferred name, but I was wondering if there is any way to link the preferred name of the gene annotation to some kind of unique identifier, such as HGNC, that can be used to ground these annotations?

I see that there is huge amount of information in the ebi_chembl section, with seemingly promising table names, but haven't spotted a useful connection by looking through the schemas.

google / patents-public-data Goto Github PK

patents-public-data's Introduction

Patent analysis using the Google Patents Public Datasets on BigQuery

patents-public-data's People

Contributors

Stargazers

Watchers

Forkers

patents-public-data's Issues

Step 1.

Add 1 since NearestNeighbors returns itself as a nearest point.

I got error information:

Recommend Projects

Recommend Topics

Recommend Org