Git Product home page Git Product logo

biglocalnews / warn-scraper Goto Github PK

View Code? Open in Web Editor NEW
27.0 27.0 10.0 5.47 MB

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites

Home Page: https://warn-scraper.readthedocs.io

License: Apache License 2.0

Python 98.32% Makefile 1.68%
cli data-journalism economics etl journalism labor news open-data python warn

warn-scraper's People

Contributors

anikasikka avatar ash1r avatar chriszs avatar dependabot[bot] avatar dilcia19 avatar esagara avatar palewire avatar shallotly avatar stucka avatar ydoc5212 avatar zstumgoren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warn-scraper's Issues

Fix KS outputs

There are still some hard-coded file paths in KS scraper.

Reorganize repo structure

Restructure repo to move all scrapers scripts into warn/scrapers/[state-abbrev].py, e.g.

warn_scraper_al.py -> warn/scrapers/al.py

Also rename the top-level function to scrape for all of these functions, which will our new cli script (scrape_warn.py #11) to dynamically load and execute each scraper.

Fix Tennessee Missing Data

Right now, Tennessee's scraper has a few rows of data hard-coded into it, I assume because the scraper didn't catch all of the data.

This scraper should be re-written to grab all of the data, even if we have to start from scratch because we won't know if we are missing future data by leaving it this way.

Fix OK outputs

There are still some hard-coded file paths in OK scraper.

Reorganize state scrapers into a new directory

Let's move state scrapers into, e.g., a 'scrapers/` dir, and then rename to just state abbreviation:

scrapers/ne.py
scrapers/ny.py
etc.

This will set us up nicely to use meta-programming to automate scraper runs through a configurable CLI script

Fix Michigan Missing Data

Right now, Michigan's scraper has a few rows of data hard-coded into it, I assume because the scraper didn't catch all of the data.

This scraper should be re-written to grab all of the data, even if we have to start from scratch because we won't know if we are missing future data by leaving it this way.

Re-organize Scrapers like OK

In the scrapers, pages are given a max number (in the case of ok it's 550) and this counts the number of records on the site.
The scraper runs through page by page, pulling the records until it hits that max number.
Needless to say, if the max number turns out to be higher because many records were added, it needs to be manually updated.

Talk to @zstumgoren and figure out a better way to do it programmatically so that you don't have to check the web pages and update these numbers as you go.

  • re-org AZ
  • re-org DE
  • re-org KS
  • re-org OK

Add updated Regex to Michigan Scraper

Eric and Serdar shared some updated regex that would allow me to parse the data in a cleaner/more generic way.

  • add the regex and run it to see what it doesn't catch
  • 20 minute pomodoro to get a refresher on Regex
  • 20 minute pomodoro to tweak the Regex for scraper, so that it can grab all of the data

Integrate Slack alerts

Crib and adapt the code for Slack-based alerts from covid-world-scraper.

  • import slack client
  • add AlertsManager Class
  • add argparse argument for alerts
  • implement logic for alerts
    • the script runs without breaking
    • test alerts are sent and show up on the slack channel

Need to further split ticket.

Scrape Ohio Warn Notices

There's already a scraper for Ohio, however it only pulls 2020 data because that's what is available in tabular format on the website.

2019 and back are PDFs that need to be scraped and joined with the more recent data.

Fix DE outputs

There are still some hard-coded file paths in DE scraper.

Add logging

Again, use covid-world-scraper as a reference. Log to /home/ubuntu/logs/warn.log but make the path configurable via CLI arg.

  • add argument for cache_dir and make sure the new args work
  • add the code below & integrate it to scrape_warn.py

 Path(cache_dir).mkdir(parents=True, exist_ok=True)
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)-12s - %(message)s',
        datefmt='%m-%d %H:%M',
        filename=log_file,
        filemode='a'
    )
    console = logging.StreamHandler()
    console.setLevel(logging.INFO)
    formatter = logging.Formatter('%(name)-12s - %(message)s')
    console.setFormatter(formatter)
    logging.getLogger('').addHandler(console)
    logger = logging.getLogger(__name__)

  • add logging to one scraper, make sure it works
  • add logging to remaining scrapers

Fix AZ outputs

There are still some hard-coded file paths in AZ scraper.

Remove logic of alerts manager conditional that runs scrapers

Right now I have a conditional that either runs all scrapers or runs the scrapers that were given by cli arguments.

The logic currently works in a way that depending on if you ran all scrapers or a select amount of scrapers, slack alerts you that all ran successfully even though a few may not have. Ie, if the last scraper ran successfully despite the others not running successfully, it logs the last message.

It logs the others as well, but I need to create something more like what Serdar has where it tells you overall which ones ran and which ones did not.

Thus, it's necessary to remove the current logic from where it is and re-work it.

  • extract all necessary information from functions without invoking slack_messages function
  • rewrite slack_messages function with new logic that tells you which scrapers ran, which did not
  • integrate that function with the rest of the program

Fix NE outputs

2 functions in the script still have hard-coded outputs.

Create a single top-level CLI script

We should create a single script, e.g. scrape_warn.py that should be configurable in a few ways:

# Scrape list of specific startes
python scrape_warn.py OK

# Scrape all states and output data to custom dir
python scrape_warn.py ALL --output-dir=/tmp/foo
  • python scrape_warn.py ALL
  • --output-dir=/tmp/foo

Set up data writing to some other location outside of the project

Currently we can and do write CSVs to the data/ dir, but this causes code commits on the repository which cause the upstream to move ahead of the local code on developer machines.

Would be good to write the data to either a secondary GH repo that solely hosts the data and/or stream to Big Local GCP bucket.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.