The opinionated from freelawproject

~47,000 broken opinions

We have an issue with the case.law data being poorly parsed, overly redacted or something else for around 46,579 opinions.

Unfortunately for roughly 1,226 opinions we have identified as actually not having an opinion.
They seem to form a larger grouping of missing opinions - overly redacted or other odd ball opinions.

Normally these are one line opinions, bunched up on a page.

This leaves the vast majority of around 44,125 opinions that were malformed. In many of these opinions the full opinion is simply, "Vacated" "Remanded" "Case Dismissed". But as is common there was no clear pattern or established "bad xml" that we could simple reverse.

To handle the task of identifying the missing or hidden opinion, I trained a Maximum Entropy text classifier ML model using createml- and keyed 12 categories based on good data we had from the harvard data set. See the distribution of data below.

The training and testing dataset was generated from a random sample of 650 opinions, extracting out all of the fields (excluding sub tags like br, strong, em, extracted-citation). 650 was the number required for a sample this size to ensure 99% confidence level and a 5% margin of error.

This generated a training set that after it was parsed over - identified roughly 1000 tags that I deemed were opinions. Now that number is larger than 650 because opinions sometimes were spread out over multiple

tags or multiple attorney tags etc. I also slowly added false negatives as I reviewed and improved the training set.

This method was effective but not quite as accurate as I would've liked. Roughly 93% validation accuracy.

After feeding back bad results back into the training set, I switched to a Transfer learning, Dynamic Embedding text classifier. This eventually increased the validation accuracy to > 99.1%

In actuality, this was closer to 99.9% accurate when identifying just opinions. I have yet to see a failed opinion in reviewing roughly 1000 generated html files.

With our ML model at hand, It was relatively easy to move all opinion data into the opinions and identify the actually failing opinions that I mentioned at the start. simply by taking the tags identified as opinion data into the empty opinion tags in the bad case.law data.

@mlissner

We still have a good portion roughly 2% of this final push that contains no opinions. I added a [NO OPINION] text to these opinions and included them to this push, but I would like your thoughts on this decision.

Of Note:
This was trained via CreateML, an Apple/Linux only ecosystem for training and Apple only for generating results.

Wrong date for Boston, Hoosic, Tunnel & Western Railroad v...

Here's the link: https://www.courtlistener.com/opinion/5625986/boston-hoosic-tunnel-western-railroad-v-troy-boston-railroad/

It's from the Harvard corpus. It should be 1879, not 1819.

A user reported this. I haven't changed our data. Figured either @quevon24 or @flooie would make more sense to take it on, so we could add it here as well.

Thoughts? Is this a useful way to handle errors?

Running list of opinions that need to be fixed manually

This is a running list of opinions that need to be fixed manually and that cant easily be fixed by the fixer Im building for the merger.

This opinion needs to be fixed.

law.free.cap.se2d.804/474.12648521.json

freelawproject / opinionated Goto Github PK

opinionated's People

Contributors

Stargazers

Watchers

Forkers

opinionated's Issues

~47,000 broken opinions

Wrong date for Boston, Hoosic, Tunnel & Western Railroad v...

Running list of opinions that need to be fixed manually

29K broken opinions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent