Git Product home page Git Product logo

opinionated's People

Contributors

dependabot[bot] avatar flooie avatar mlissner avatar pre-commit-ci[bot] avatar quevon24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kastningbrandon

opinionated's Issues

~47,000 broken opinions

We have an issue with the case.law data being poorly parsed, overly redacted or something else for around 46,579 opinions.

Unfortunately for roughly 1,226 opinions we have identified as actually not having an opinion.
They seem to form a larger grouping of missing opinions - overly redacted or other odd ball opinions.

Normally these are one line opinions, bunched up on a page.

This leaves the vast majority of around 44,125 opinions that were malformed. In many of these opinions the full opinion is simply, "Vacated" "Remanded" "Case Dismissed". But as is common there was no clear pattern or established "bad xml" that we could simple reverse.

To handle the task of identifying the missing or hidden opinion, I trained a Maximum Entropy text classifier ML model using createml- and keyed 12 categories based on good data we had from the harvard data set. See the distribution of data below.

The training and testing dataset was generated from a random sample of 650 opinions, extracting out all of the fields (excluding sub tags like br, strong, em, extracted-citation). 650 was the number required for a sample this size to ensure 99% confidence level and a 5% margin of error.

mlmodeltrainingset

This generated a training set that after it was parsed over - identified roughly 1000 tags that I deemed were opinions. Now that number is larger than 650 because opinions sometimes were spread out over multiple

tags or multiple attorney tags etc. I also slowly added false negatives as I reviewed and improved the training set.

This method was effective but not quite as accurate as I would've liked. Roughly 93% validation accuracy.

After feeding back bad results back into the training set, I switched to a Transfer learning, Dynamic Embedding text classifier. This eventually increased the validation accuracy to > 99.1%

Screenshot 2022-10-10 at 4 50 31 PM

In actuality, this was closer to 99.9% accurate when identifying just opinions. I have yet to see a failed opinion in reviewing roughly 1000 generated html files.

With our ML model at hand, It was relatively easy to move all opinion data into the opinions and identify the actually failing opinions that I mentioned at the start. simply by taking the tags identified as opinion data into the empty opinion tags in the bad case.law data.

@mlissner

We still have a good portion roughly 2% of this final push that contains no opinions. I added a [NO OPINION] text to these opinions and included them to this push, but I would like your thoughts on this decision.

Of Note:
This was trained via CreateML, an Apple/Linux only ecosystem for training and Apple only for generating results.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.