Git Product home page Git Product logo

madgrades-extractor's Introduction

madgrades-extractor

This project reads UW Madison grade distribution and course report PDF files (published by the UW Madison Office of the Registrar) and converts them into CSV or SQL dump files.

You will find published, update-to-date datasets at Kaggle.

https://i.imgur.com/9ZrwRMt.png

Conversion

The conversion process for a single term is as follows:

  1. Open DIR report for the term.

    a. Extract table from PDF (using tabula)

    b. Read each row, adding new section per row.

    c. Collate section info as necessary (i.e. 2 instructors for 1 single section)

    d. Collate courses which appear to be cross-listed (based on similarity between sections offered)

  2. Open grades report for the term.

    a. Extract table from PDF

    b. Read each row, add add each section grade data to course data added by the DIR report process

Typically all terms are extracted so this process repeats for each term.

Command Line Usage

Build it yourself with mvn clean install or grab a release from the releases page.

Usage: <main class> [options]
  Options:
    -d, -download
      Download the PDF reports instead of extracting data
      Default: false
    -e, -exclude
      Comma-separated list of term codes to exclude (ex. -e 1082)
    -f, -format
      The output format
      Default: CSV
      Possible Values: [CSV, MYSQL]
    -l, -list
      Output list of terms to extract
      Default: false
    -out, -o
      Output directory path for exported files (ex. -o ../data)
      Default: ./
    -t, -terms
      Comma-separated list of term codes to run (ex. -t 1082,1072)

Examples:

  • java -jar madgrades-final-1.0-SNAPSHOT.jar: will fetch every term and output files to the current directory
  • java -jar madgrades-final-1.0-SNAPSHOT.jar -t 1082: will fetch just term 1082
  • java -jar madgrades-final-1.0-SNAPSHOT.jar -o ../ -t 1082,1072: will fetch terms 1072 and 1082, output to ../

Relational Diagram

The CSV or SQL dumps are in the format of a collection of relational entities modeled something like this:

diagram

madgrades-extractor's People

Contributors

dependabot[bot] avatar jamesjulich avatar odysa avatar thekeenant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

madgrades-extractor's Issues

0% A Rate for Spring 2022 Courses

Courses taught in Spring 2022 appear to have a 0.0% A rate, despite the grade report from the registrar saying otherwise.

For instance, MATH 340 taught by Peter Juhasz (Lec 002 and Lec 003) shows 0 As being granted for the Spring 2022 semester. The actual A rate should be ~30%.

Philosophy 101 taught by Henry Southgate in Spring 2022 is another class that is having the same issue.

These are just two classes that I have noticed that have the issue, but it is likely a much more widespread issue. I'm going to run this program on my system to figure out whether this was an issue with the extractor or an issue updating the databases. I'll update this issue thread if I find anything.

Year formatting not up to date

I tried to clone the project and gave the '-d' argument to the problem with the following error:

Scraping for subjects and report URLs...
Exception in thread "main" java.lang.NumberFormatException: For input string: "|"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at com.keenant.madgrades.tools.Scrapers.toTermCode(Scrapers.java:101)
	at com.keenant.madgrades.tools.Scrapers.scrapeGradeReports(Scrapers.java:66)
	at com.keenant.madgrades.CommandLineApp.main(CommandLineApp.java:78)

It seems that it is caused by the code on line 101 in Scrapers.java:

int year = Integer.parseInt(termName.split(" ")[1]);

I suspect that the format has changed and the algorithm here need to be updated.

CS/ECE 561 is cross listed incorrectly

Reported by a user:

CS/ECE 561 in the system is also cross-listed with N E and Physics, which is not the case, and has the wrong title "Introduction to Charged Particle Accelerators"The correct title should be "Probability and Information Theory in Machine Learning" and is only cross-listed between CS and ECE.

The extractor seems to get some data from the reports that points to these being the same course when they clearly are not.

Use latest names for instructors

If an instructor changes their name we currently do not honor the name change, even though that is reflected in the data published by the university.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.