biglocalnews / warn-scraper Goto Github PK

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites

Home Page: https://warn-scraper.readthedocs.io

License: Apache License 2.0

Python 98.32% Makefile 1.68%

cli data-journalism economics etl journalism labor news open-data python warn

warn-scraper's People

Contributors

Stargazers

Watchers

Forkers

chriszs anikasikka jsvine ash1r riordan dpinargo arndtupb yashsax stucka

warn-scraper's Issues

Add Error logging for files being pushed to platform

This works, but there's no error logging.

Scrape Connecticut Warn Notices

https://www.ctdol.state.ct.us/progsupt/bussrvce/warnreports/warn2020.htm

These are standard HTML pages, the scrapes should be pretty similar to the ones we already have going.

Fix KS outputs

There are still some hard-coded file paths in KS scraper.

Reorganize repo structure

Restructure repo to move all scrapers scripts into warn/scrapers/[state-abbrev].py, e.g.

warn_scraper_al.py -> warn/scrapers/al.py

Also rename the top-level function to scrape for all of these functions, which will our new cli script (scrape_warn.py #11) to dynamically load and execute each scraper.

Fix Tennessee Missing Data

Right now, Tennessee's scraper has a few rows of data hard-coded into it, I assume because the scraper didn't catch all of the data.

This scraper should be re-written to grab all of the data, even if we have to start from scratch because we won't know if we are missing future data by leaving it this way.

Fix OK outputs

There are still some hard-coded file paths in OK scraper.

Reorganize state scrapers into a new directory

Let's move state scrapers into, e.g., a 'scrapers/` dir, and then rename to just state abbreviation:

scrapers/ne.py
scrapers/ny.py
etc.

This will set us up nicely to use meta-programming to automate scraper runs through a configurable CLI script

Fix Michigan Missing Data

Right now, Michigan's scraper has a few rows of data hard-coded into it, I assume because the scraper didn't catch all of the data.

This scraper should be re-written to grab all of the data, even if we have to start from scratch because we won't know if we are missing future data by leaving it this way.

In the scrapers, pages are given a max number (in the case of ok it's 550) and this counts the number of records on the site.
The scraper runs through page by page, pulling the records until it hits that max number.
Needless to say, if the max number turns out to be higher because many records were added, it needs to be manually updated.

Talk to @zstumgoren and figure out a better way to do it programmatically so that you don't have to check the web pages and update these numbers as you go.

re-org AZ
re-org DE
re-org KS
re-org OK

Make output dir configurable for all state scraper functions

It's fine to hard-code the basename of output file, but we'll need ability to pass in a directory from the CLI script to each function.

Scrape Kentucky Warn Notices

https://kcc.ky.gov/Pages/News.aspx

Downloads excel sheets.

Create list of scrapers to add to WARN

Add scrapers from excel sheet that team didn't have a chance to complete.

Add updated Regex to Michigan Scraper

Eric and Serdar shared some updated regex that would allow me to parse the data in a cleaner/more generic way.

add the regex and run it to see what it doesn't catch
20 minute pomodoro to get a refresher on Regex
20 minute pomodoro to tweak the Regex for scraper, so that it can grab all of the data

Disable GH Actions

since we'll be running this on cron on data-etl-vm

Update Scraper for Rhode Island.

They changed URLs - site to scrape: https://dlt.ri.gov/wds/warn/

Integrate Slack alerts

Crib and adapt the code for Slack-based alerts from covid-world-scraper.

import slack client
add AlertsManager Class
add argparse argument for alerts
implement logic for alerts
- the script runs without breaking
- test alerts are sent and show up on the slack channel

Need to further split ticket.

Scrape Ohio Warn Notices

There's already a scraper for Ohio, however it only pulls 2020 data because that's what is available in tabular format on the website.

2019 and back are PDFs that need to be scraped and joined with the more recent data.

Fix DE outputs

There are still some hard-coded file paths in DE scraper.

Scrape Minnesota Warn Notices

https://mn.gov/deed/programs-services/dislocated-worker/reports/

PDFs

Scrape Illinois Warn Notices

https://www.illinoisworknet.com/LayoffRecovery/Pages/ArchivedWARNReports.aspx

May not be able to do this one. The PDF's have inconsistent structures within and the more recent files are excel files.

Scrape South Carolina Warn Notices

https://scworks.org/employer/employer-programs/at-risk-of-closing/layoff-notification-reports

This site has PDFs for each year.

Add logging

Again, use covid-world-scraper as a reference. Log to /home/ubuntu/logs/warn.log but make the path configurable via CLI arg.

add argument for cache_dir and make sure the new args work
add the code below & integrate it to scrape_warn.py


 Path(cache_dir).mkdir(parents=True, exist_ok=True)
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)-12s - %(message)s',
        datefmt='%m-%d %H:%M',
        filename=log_file,
        filemode='a'
    )
    console = logging.StreamHandler()
    console.setLevel(logging.INFO)
    formatter = logging.Formatter('%(name)-12s - %(message)s')
    console.setFormatter(formatter)
    logging.getLogger('').addHandler(console)
    logger = logging.getLogger(__name__)

add logging to one scraper, make sure it works
add logging to remaining scrapers

Fix AZ outputs

There are still some hard-coded file paths in AZ scraper.

Implement data streaming to GCP bucket

Similar to the COVID world data

Remove logic of alerts manager conditional that runs scrapers

Right now I have a conditional that either runs all scrapers or runs the scrapers that were given by cli arguments.

The logic currently works in a way that depending on if you ran all scrapers or a select amount of scrapers, slack alerts you that all ran successfully even though a few may not have. Ie, if the last scraper ran successfully despite the others not running successfully, it logs the last message.

It logs the others as well, but I need to create something more like what Serdar has where it tells you overall which ones ran and which ones did not.

Thus, it's necessary to remove the current logic from where it is and re-work it.

extract all necessary information from functions without invoking slack_messages function
rewrite slack_messages function with new logic that tells you which scrapers ran, which did not
integrate that function with the rest of the program

Add Script to push files to platform on dev2

Code on dev2 runs just fine - it needs the last bit of pushing the data to the platform.

Scrape Iowa Warn Notices

https://www.iowaworkforcedevelopment.gov/worker-adjustment-and-retraining-notification-act

Click to download an excel sheet. Similar to the California scrape.

Scrape Texas Warn Notices

https://www.twc.texas.gov/businesses/worker-adjustment-and-retraining-notification-warn-notices#warnNotices

These are excel sheets for every year.

Adding Logging to each state

Fix NE outputs

2 functions in the script still have hard-coded outputs.

Deploy scrapers to GCP data etl vm

Set up to run on cron using pyenv (see existing examples in crontab for covid-world-scraper)

Scrape Georgia Warn Notices

https://www.dol.state.ga.us/public/es/warn/searchwarns

This scrape is going to require selenium.
It gives you results based on different regions which you have to select from a dropdown menu.
After that, it should be pretty easy, just downloading the excel file.

Add slack alerts for files being pushed to platform

Once error logging is added, we can add slack alerts to this too.

Scrape New Mexico Warn Notices

https://www.dws.state.nm.us/Rapid-Response

PDFs for each year.

Scrape Vermont Warn Notices

https://www.vermontjoblink.com/ada/mn_warn_dsp.cfm?securitysys=on&FormID=0

This site is similar to AZ and DE, have to scrape the main site and then the links on each page for the layoff count.

Scrape Colorado Warn Notices

https://www.colorado.gov/pacific/cdle/warn-listings

These WARN links lead to google spreadsheets. Download and convert to xlsx.

Scrape Maine Warn Notices

https://joblink.maine.gov/ada/mn_warn_dsp.cfm?securitysys=on&FormID=0

Similar site to AZ and NE - scrape pages then scrape link on pages.

Scrape Montana Warn Notices

http://wsd.dli.mt.gov/wioa/related-links/warn-notice-page

It's an excel download.

Review arguments and options implementation w/ Serdar

Not sure if I should leave command utility as is, or if I should use argsparse instead.

Scrape Louisiana Warn Notices

http://www.laworks.net/Downloads/Downloads_WFD.asp

Links to PDFs for each year.

update command line utility to argparse.

Has cleaner implementation for parsing & executing args than my simple version.

Scrape North Carolina Warn Notices

https://www.nccommerce.com/data-tools-reports/labor-market-data-tools/workforce-warn-reports

These are PDFs for each year.

Scrape Virginia Warn Notices

https://www.vec.virginia.gov/warn-notices

This is a table with many pages.

Create a single top-level CLI script

We should create a single script, e.g. scrape_warn.py that should be configurable in a few ways:

# Scrape list of specific startes
python scrape_warn.py OK

# Scrape all states and output data to custom dir
python scrape_warn.py ALL --output-dir=/tmp/foo

python scrape_warn.py ALL
--output-dir=/tmp/foo

Depends on #2

Script should be able to run all scrapers at once, or a list of given states specified via command-line argument.

Scraper Massachusetts Warn Notices

https://www.mass.gov/service-details/worker-adjustment-and-retraining-act-warn-weekly-report

Links to excel spreadsheet - can only find 2019-2020 years.