Git Product home page Git Product logo

parse-uspto-xml's Introduction

Parse USPTO

Step 1: Download XML patents: https://bulkdata.uspto.gov/

Download under the header:

Patent Application Full Text Data (No Images) (MAR 15, 2001 - PRESENT)
Contains the full text of each patent application (non-provisional utility and plant)
published weekly (Thursdays) from March 15, 2001 to present (excludes images/drawings).
Subset of the Patent Application Full Text Data with Embedded TIFF Images.

The current script(s) only work on version 4.0 or higher of the XML (2005 - Present).

It also will skip all documents with DNA sequences.

Step 2: Extract the zipped file: *.xml

Step 3: Install the package locally with:

pip install -e .

Step 4: Individual patents (or directories) can then be parsed with:

python parse_uspto_xml/parse_patent.py <filename.xml> <filename.xml> <directory>

You can edit the filename variable in the python file parse_patent.py to match the unzipped file. Inside that file are typically thousands of patents which can be parsed for the given week.

Using the parse_patent.py if you add it will load all the .xml files.

Download all Files

For 2005 to Today, you can download all the zip files for a given year using the following format:

wget -r -np -l1 -nd -A zip https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/<year>/

This will download all the .zip files off that page, which contain all the patents for the given year (as a zipped .xml file).

If the -nd command is dropped, a directory will be created in the form:

bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/<year>/<filename>.zip`

It's recommended, to create a folder such as: patent/<year>, then execute the command to download all the .zip files.

When all the files are downloaded it's possible to unzip all the directory with:

unzip \*.zip

Storing in Database

In addition, it's possible to save the parsed data in a database. In this repository, we provide documentation in config/README to configure PostgreSQL to store and search the patents.

This reduces the size of a file ipa200109.xml of 734MB to 154MB (in the database).

In terms of overall size, the XML files are 367Gb, the parsed files (in the database) are

In terms of decompressed data (by year) & unerrored parsed patent documents:

  • 2005 - 11Gb - 157,822
  • 2006 - 15Gb - 196,485
  • 2007 - 14Gb - 182,968
  • 2008 - 14Gb - 185,249
  • 2009 - 16Gb - 192,045
  • 2010 - 20Gb - 244,589
  • 2011 - 21Gb - 248,091
  • 2012 - 24Gb - 277,264
  • 2013 - 28Gb - 303,641
  • 2014 - 31Gb - 327,014
  • 2015 - 32Gb - 326,969
  • 2016 - 33Gb - 334,674
  • 2017 - 36Gb - 352,547
  • 2018 - 35Gb - 341,104
  • 2019 - 42Gb - 392,618

Get patent count by year (in PostgreSQL):

SELECT date_trunc('year', publication_date), count(*) from uspto_patents group by date_trunc('year', publication_date) ORDER BY date_trunc('year', publication_date);

parse-uspto-xml's People

Contributors

lettergram avatar jgsweets avatar tomj avatar

Stargazers

ctianphilip avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.