Git Product home page Git Product logo

regparser-lisp's Introduction

Regulations Parser

This project parses the XML regulations from http://www.federalregister.gov and turns them into structured XML files which conform to the CFPB's RegML schema. Right now, only the "initial" notices are supported; what this means is that the first time an agency issues a regulation, it posts the full new regulation, but subsequent notices only indicate modifications of the previous version, and only that full new version is supported by the parser. Support for notice updates is pending.

Usage

regparser is quicklisp-loadable; I suggest making a symlink from your local-projects directory to wherever you checked out this repo. Using your favorite REPL, do:

(ql:quickload :regulations-parser)

If all goes well, quicklisp will load all the project dependencies compile the project itself. If all does not go well, check to see if your Lisp is using UTF-8 as the default encoding for reading input files. This is usually the case on Linux with SBCL, but on Macs running SBCL, the default seems to be ASCII. You can change this by evaluating

(setf sb-impl::*default-external-format* :utf-8)

assuming you are running SBCL. Other Lisps should have similar facilities.

If everything has loaded properly, you can now run the parser. The entry point to doing so is the parse-reg-to-xml function, which takes as its arguments the file to be parsed (that's whatever you got from the Federal Register), the document number (a unique string used by the FR to identify the notice), and the output file, which should have an .xml extension. Keep in mind that the first argument is not a string but a pathspec, and if you pass a string, you'll probably get a weird error. Your call should look something like this (using the time macro for timing):

(time
 (regparser::parse-reg-to-xml #p"/path/to/regparser-lisp/2011-31715.xml"
 "2011-31715"
 "/path/to/regparser-lisp/out.xml"))

and the result will look something like this:

Evaluation took:
 100.795 seconds of real time
 101.253473 seconds of total run time (92.306040 user, 8.947433 system)
 [ Run times consist of 29.511 seconds GC time, and 71.743 seconds non-GC time. ]
 100.45% CPU
 56 forms interpreted
 158 lambdas converted
 281,582,924,310 processor cycles
 12 page faults
 124,678,227,440 bytes consed

Congrats, you have parsed the first notice of Regulation Z in under two minutes. Feel free to take a look at out.xml and validate it relative to the eregs.xsd schema linked above. It should validate, but there may be a few minor mistakes that I have yet to correct. For the most part it will be valid, though.

Future work

This code parses Regulations Z and E, issued by the CFPB, as test cases. As these were the original regs that the Pythonic version of the regulation parser was validated on, I think that's pretty solid. What this does not do is parse the subsequent updates to those regulations via notices. That is in the works and I hope to add that capability in the near future. I should also note that this parser generally strives for adequacy over perfection. For example, it does not do any kind of clever NLP to pick out definitions; it just looks at a few patterns that indicate that something is being defined here (e.g. "foo means bar"). This suffices for 90% of cases, and the other 10% are more easily fixed in "post-production" than coded into the grammar.

Issues

If something breaks, please leave an issue. The code as presented is functional but obviously not mistake-free by any means. The problem with parsing FR notices is that they tend to have inconsistencies in their XML depending on who did the conversion, which notice, the phase of the moon, etc. Because the FR does not guarantee any sort of consistent representation (unlike RegML, which does), it's impossible to promise that it will compile your specific notice; it may not. However, if it does not, and you know why it failed, I'd appreciate it if you would leave a note with an explanation.

regparser-lisp's People

Contributors

grapesmoker avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.