Git Product home page Git Product logo

patents's Introduction

Parse patent grant and assignment info from USPTO and match with Compustat data. This handles all patent grant formats (dat, pgb, ipgb) and uses sqlite3 as the storage backend.

Additionally, cluster patents by firm name. Uses locality-sensitive hashing as a first pass then find components in the graph induced by a Levenshtein distance threshhold.

I also maintain a simplified repository with only the parsing code, which is kept roughly in sync with this, over at: patents_simple. Note that this can also parse Chinese patent data!

You can also find some higher level analysis code, mostly using pandas, in the patents_analyze repository.

File descriptions

Below is the pipeline that you'll want to follow. There are many small design decisions I've made along the way, and you may want to tweak these to suit your own purposes.

  • Acquiring data
    • fetch_grants.py: fetch patent grant files
    • fetch_assign.py: fetch patent assignment files
    • batch unzip XML files: ls *.zip | xargs -n 1 unzip
  • Parsing raw data files
    • parse_grants.py: parse patent grants (including citations), all data formats
    • parse_assign.py: parse patent assignments
    • parse_maint.py: parse patent maintenance events
    • parse_compustat.py: parse compustat data
  • Cleaning patent data
    • process_assign.py: flag assignments between the same entity
  • Name matching and firm aggregation
    • firm_cluster.py: match firms by name from all data sources
    • process_cites.py: resolve citations at firm level and find self-cites
    • firm_merge.py: merge all of above into firmyear panel

Database layout

The parsed data is stored and manipulated with sqlite3 in a single file. I usually put these in store. All of the parse commands take a --db argument where you can specify the exact file name. The internal layout is:

  • patent: (patnum int, filedate text, grantdate text, class text, ipc text, ipcver text, city text, state text, country text, owner text, claims int, title text, abstract text, gen int)
  • assign: (assignid integer primary key, patnum int, execdate text, recdate text, conveyance text, assignor text, assignee text, assignee_state text, assignee_country text)
  • maint: (patnum int, ever_large int, last_maint int)

Data sources

The fetch commands use the following layout:

  • data/grant_files: patent grant data from Google/USPTO
  • data/assign_files: patent reassignment data from Google/USPTO
  • data/maint_files: patent maintentance data from Google/USPTO
  • data/compustat_files: Compustat data since 1950 from WRDS

patents's People

Contributors

iamlemec avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.