biglocalnews / warn-transformer Goto Github PK
View Code? Open in Web Editor NEWConsolidate, enrich and republish the data gathered by warn-scraper
Home Page: https://warn-transformer.readthedocs.io
License: Apache License 2.0
Consolidate, enrich and republish the data gathered by warn-scraper
Home Page: https://warn-transformer.readthedocs.io
License: Apache License 2.0
#231 shows an example of 9mil+ layoffs. Let's check for layoffs above 999,999 and remove it, and throw a warnings!
We need to reduce the incidences of multiple dates getting shoved into the same field. Let's create a new field in STANDARDIZED_FIELDS and use CT as a pilot state to make sure this doesnt happen.
My suspicion is we'll need to create two date_effective columns: date_layoff, and date_closure
@zstumgoren I'm getting a "UnicodeDecodeError" when my standardizing program runs on the VA export file. As you can see the error happens when my program begins to parse the rows out of the state csv file, which it does successfully so far on AK, CT, DC, FL, IN, ME, MO, and NJ. Any ideas what might be causing this bug? Here's the output from my python program:
(Cody-DellXPS-Uab08hY7) C:\Users\Cody-DellXPS\warn-analysis>python standardize_field_names.py
Processing state ak.csv...
(...)
Processing state va.csv...
Traceback (most recent call last):
File "standardize_field_names.py", line 110, in
main()
File "standardize_field_names.py", line 54, in main
for row_idx, row in enumerate(state_csv):
File "c:\users\cody-dellxps\appdata\local\programs\python\python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 6006: character maps to
As described here biglocalnews/warn-scraper#435
Many important fields are missing.
Are they present in the source data?
What's the problem?
company | postal_code | jobs | hash_id | location | date |
---|---|---|---|---|---|
NORTHWEST AIRLINES | OR | 27500 | 50e28021cc053807ecdeaac24862150a3bb0bddc24b25c43d2832a96 | ST. PAUL, MN | 1998-08-11T00:00:00.000Z |
The WI scraper is picking up data from the website that we either 1) dont want to scrape
or 2) dont want in the final analysis data.
This is an example of what is scraped:
and tables like this from the site are the cause:
I'll go ahead and assign myself to this. i think we should drop these rows but integrate the data somehow.
need created as a result of the work for #21
We should create one or more scripts that can take files exported by our scrapers (i.e. the files in exports/
dir) and produce a single CSV that merges all states. For each state, it should map a subset of fields to some minimal set of standardized fields (e.g. Company Name
-> employer
, Number of Employees Affected
-> number_affected
, etc.).
The output of this process should be a single CSV that contains the subset of fields for all states we currently scrape.
Something to consider: we may want to model a wider range of standardized field names on the approach used by https://layoffdata.com/data/
company | postal_code | jobs | hash_id | location | date | year |
---|---|---|---|---|---|---|
Adventist Health St. Helena | CA | 5 | bb9826463869907671fcc3b7b3d94b562af891b7d2994aa4c9fa1658 | Saint Helena | 2008-09-04 | 2008 |
Dominion/State Line Energy Station | IN | 109 | 4c08e9392e75cdf08e26e4bde78161cb4dfc731ff5e1130aed2ca60a | Hammond | 1202-01-30 | 1202 |
burlington woods nursing home | NJ | 20 | c2a3baa8fdc69c4af11059e51ed7f1b1ae62b0b6c10c5b1cf2d7b144 | burlington | 3030-08-23 | 3030 |
Hooper Holmes, Inc. dba Provant Health | RI | 92 | 6410a06432ddae7bc4eee219c914665055867e9f6e00661ee9df60e7 | Warwick | 2108-11-01 | 2108 |
Nordstrom Providence Place | RI | 181 | 08e170ab6fb5f2e6aaf0186ff83c793a2949476fd7638ae7b1a8e46f | Providence | 2108-10-23 | 2108 |
WARN was enacted in 1988 and there are some states (OR, IL, soon GA) with notices going back to then. I think it's super cool to have that kind of historical perspective where we can. Therefore, I propose that the first valid year should be 1988.
California's location attribution would be a good use case for this. There are two fields where we might find a place. It would be nice if we could start with one, and then fall back to the other if it's null.
Currently there are two columns for "Closing" and 'Layoff" each marked with yes or no. For standardization purposes, it would be helpful for the crawler to have one column (called something like "Layoff Type") with expected outputs of: "Closing" or "Layoff".
Do we need to do any standardization with the columns "permanent" or "realignment"? Is this relevant information? I don't think I notice it in any other crawlers.
Investigate the relationship between the columns "city/town" and "location city". What is our merging strategy? Should we ignore one column, or is it safe to simply merge all empty/non-empty cells? Edit: drop "location city" and go with "city/town" column. The question remains of what is the purpose of "location city" which doesnt seem to be tied to the business address.
Following instructions in README, I set up project, set BLN API key, ran make download
and got:
2022-02-23 19:54:12,877 - urllib3.connectionpool - https://api.biglocalnews.org:443 "POST /graphql HTTP/1.1" 200 None
Traceback (most recent call last):
File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/workspaces/warn-transformer/warn_transformer/download.py", line 52, in <module>
run()
File "/workspaces/warn-transformer/warn_transformer/download.py", line 32, in run
p = c.get_project_by_name("WARN Act Notices")
File "/workspaces/warn-transformer/.venv/lib/python3.8/site-packages/bln/client.py", line 435, in get_project_by_name
raise ValueError(f"No project named {name} found")
ValueError: No project named WARN Act Notices found
make: *** [Makefile:76: download] Error 1
I'm able to view the project in BLN directory and download a file manually. Says here I'm a viewer, but of course I am not in the list of 15 explicitly added users.
Using OpenCorporates API? (https://api.opencorporates.com/)
-goal is to get standardized layoff vs closure data from the state (eg from closure: "yes" to layoff_type: "closure")
secondary goal is to clean up "revised notice" from the company names
regex is hard, and we'll probably clean up company names as part of the "convert to canonical company names" project
Why are there no company names?
Why are several MT rows that are empty in the output?
date layoff should be in the output
datasets from some states contain different date formats within themselves (eg: MO, DC) and possibly different conventions for documenting updates.
A thorough date format standardization can be done to all of the states at some point.
standardize_fields.csv could use a pass over of all the states' data
if you find something suspicious:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.