Education Data Science Practicum Spring 2019
Project Started: 1/30/2019
Updated: 3/23/2019
Download and Run "Midterm_Sample" folder for analysis using a sample that you can run on your local device
- Data collection:
- gain all federal appellate court opinions
- Pre-Analysis:
- made sure vector representation of opinions is possible (works for first circuit)
- Text Cleaning (lots of regex nightmares!):
- need to build metadata, for each opinion:
- authoring judge
- year
- court
- identify documents that contain dissents/concurrences from other judges
- identify documents that are per curiam
- need to build metadata, for each opinion:
- Modeling:
- average document vectors for each judge to get judge vectors
- PCA on judge vectors to look for clusters (for 1st circuit)
Notes:
- Part (3) above is incomplete. JSON raw data's "year" is sparse. I need to extract year from the opinions.
- Once Part (3) is complete, de-mean documents by court and by year.
- How to de-mean by topic? I need topic labels. Ash and Chen won't share metadata with topic labels :(
There are lots of models that can be used to estimate ideal points of judges (i.e. Poole and Rosenthal, Martin-Quinn, Clinton, etc). These models use mostly voting records of the judges. The purpose of this study is to try ideal-point modeling using the text of opinions instead of voting behavior. The subjects of the study are federal judges in the Appellate Courts of the United States (i.e. 1st Circuit Court of Appeals).
Some things to watch out for include era-effects, topic-effects and circuit effects.
- Start with one circuit court
- Learn to implement doc2vec using "fastTextR", "textTinyR"
- The idea is to apply this onto all federal appellate court opinions
Problem:
- I need to extract "year" from the opinions. Where do I get topic? Should I model this?
Now that I know I can implenent vector representations of these documents, need to clean text.
- Identify the judge who wrote the opinion (Metadata not available for all federal appellate court cases)
- Identify dissent, concurrence, per curiam
- Extract year from decisions
- Then, de-mean document vectors by year and circuit THEN average for each judge