Git Product home page Git Product logo

parse-uw-coop-package's Introduction

Parse UW Co-op Package PDFs

Parse out relevant information from co-op student resume package PDFs provided by the University of Waterloo.

NOTE: This program is not perfect, and the results (email address, etc.) will need to be manually inspected for errors and fixed.

Installation

Ensure that Go is installed and setup with a working $GOPATH.

Installation and sanity check:

$ go get github.com/curvegrid/parse-uw-coop-package
$ parse-uw-coop-package -h

If running on Mc, use HomeBrew to install the Poplar package, which provides the pdftotext utility (brew install poppler). The utility will fallback to ps2ascii, but it tends to be less reliable.

Usage

This assumes you are an employer of University of Waterloo co-operative education (co-op, interns) students and have a valid Employer login on WaterlooWorks.

  1. Post a job on WaterlooWorks and wait for student applications to become available.
  2. Login to WaterlooWorks, navigate to the applications list, and click the blue 'Application Options' button button near the top of the page to create a custom application bundle with each application as a separate PDF.
  3. Download and unzip the consolidated package.
  4. Install this utility, parse-uw-coop-package.
  5. From the directory where you unzipped the consolidated package of PDFs, run parse-uw-coop-package and pipe the output to a CSV file (e.g., parse-uw-coop-package > applicants.csv). You can tweak the options (try parse-uw-coop-package -h) as required.
  6. Import into your spreadsheet of choice. As noted above, manual cleanup will be required.

Running and Command Line Options

By default, searches the current directory for all PDFs that fit a regular expression (-fileregex) and parse the text within for fields specific to UW co-op.

Usage of ./parse-uw-coop-package:
  -averagesRegex string
    	Regex for averages (default "Term Average:\\s*([0-9]{2}\\.*[0-9]*)")
  -concurrency int
    	Number of PDF parsing threads to run in parallel (default 4)
  -coverLetterRegex string
    	Regex for cover letter yes/no (default "[Ss]incerely|[Hh]iring [Mm]anager")
  -emailRegex string
    	Regex for email address (default "[A-Za-z0-9_.-]+\\@[A-Za-z0-9.-]+\\.[A-Za-z0-9]+")
  -fileregex string
    	Regex filter for filenames (default "([A-Za-z-]+) ([A-Za-z-]+) \\(([0-9]+)\\).pdf")
  -githubRegex string
    	Regex for Github (default "github.com/[A-Za-z0-9_.-]+")
  -linkedInRegex string
    	Regex for LinkedIn (default "linkedin.com/in/[A-Za-z0-9_.-]+")
  -pdftoascii string
    	PDF to ASCII converter (default "ps2ascii")
  -worktermEvalRegex string
    	Regex for work term evaluations (default "UNSATISFACTORY|MARGINAL|SATISFACTORY|VERY GOOD|EXCELLENT|OUTSTANDING")

Sample Run

$ parse-uw-coop-package 
ID,First name,Last name,Email,Email with name,LinkedIn,Github,Included a cover letter,Work term evaluations,Term averages,Overall average
123456,Able,Baker,[email protected],Able Baker <[email protected]>,,,Yes,"OUTSTANDING,OUTSTANDING,OUTSTANDING,GOOD,OUTSTANDING","72,81,84.5,72,78",73.4
...

Known Issues and Limitations

  • This has only been tested on macOS.
  • The PDF-to-text converter defaults to ps2ascii or pdftotext, which may not be available on your system. See the command line options to adjust.
  • The PDF-to-text process is not perfect, especially with formatted PDFs. Email addresses seem to be especially problematic, with many of them mangled. For example, we've seen [email protected] turn into .com example je ef@ with ps2ascii, even in what seems like a fairly "standard" formatted PDF. Manual cleanup will be required.

Future enhancements

  • DRY up the whole program
  • Switch from ps2ascii to a native Go PDF-to-text solution
  • Improve the parsing accuracy: better regexes, etc.
  • Direct package download, and integration with tabular info, from WaterlooWorks
  • Keyword extraction

Contributing

Pull requests welcome.

Development

Assuming parse-uw-coop-package was installed per the previous step, then change to the directory where go get downloaded the source:

$ cd $GOPATH/src/github.com/curvegrid/parse-uw-coop-package
$ go build parse-uw-coop-package.go
$ ./parse-uw-coop-package

Note that you will now have two copies of the parse-uw-coop-package binary on your system, the one in $GOPATH/bin via go install, and the one just built in $GOPATH/src/curvegrid/parse-uw-coop-package via go build.

License and Copyright

Licensed under the MIT License. See the LICENSE file for details of the MIT License. Copyright 2018 by Curvegrid Inc.

parse-uw-coop-package's People

Contributors

shoenseiwaso avatar gwatts avatar natsukagami avatar

Stargazers

 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.