Git Product home page Git Product logo

data-covid19-sfbayarea's Introduction

data-covid19-sfbayarea

Processes for sourcing data for the Stop COVID-19 SF Bay Area Pandemic Dashboard. You can find the dashboard’s source code in the sfbrigade/stop-covid19-sfbayarea project on GitHub.

We are looking for feedback! Did you come here looking for a data API? Do you have questions, comments, or concerns? Don't leave yet - let us know how you are using this project and what you'd like to see implemented. Please leave us your two cents over in Issues under #101 Feedback Mega Thread.

Installation

This project requires Python 3 to run. It was built specifically with version 3.8.6, but it may run with other versions. However, it does take advantage of assignment expressions which are only available in 3.8+. To install this project, you can simply run ./install.sh in your terminal. This will set up the virtual environment and install all of the dependencies from requirements.txt and requirements-dev.txt. However, it will not keep the virtual environment running when the script ends. If you want to stay in the virtual environment, you will have to run source env/bin/activate separately from the install script.

Running the scraper

This project includes four separate scraping tools for different purposes:

You can also run each of these tools in Docker. See the “using Docker” section below.

Legacy CDS Scraper

The Legacy CDS Scraper loads Bay Area county data from the Corona Data Scraper project. Run it by typing into your terminal:

$ ./run_scraper.sh

This takes care of activating the virtual environment and running the actual Python scraping script. If you are managing your virtual environments separately, you can run the Python script directly with:

$ python3 scraper.py

County Website Scraper

The newer county website scraper loads data directly from county data portals or by scraping counties’ public health websites. Running the shell script wrapper will take care of activating the virtual environment for you, or you can run the Python script directly:

# Run the wrapper:
$ ./run_scraper_data.sh

# Or run the script directly if you are managing virtual environments youself:
$ python3 scraper_data.py

By default, it will output a JSON object with data for all currently supported counties. Use the --help option to see a information about additional arguments (the same options also work when running the Python script directly):

$ ./run_scraper_data.sh --help
Usage: scraper_data.py [OPTIONS] [COUNTY]...

  Create a .json with data for one or more counties. Supported counties:
  alameda, san_francisco, solano.

Options:
  --output PATH  write output file to this directory
  --help         Show this message and exit.
  • To scrape a specific county or counties, list the counties you want. For example, to scraper only Alameda and Solano counties:

    $ ./run_scraper_data.sh alameda solano
  • --output specifies a file to write to instead of your terminal’s STDOUT.

County News Scraper

The news scraper finds official county news, press releases, etc. relevant to COVID-19 and formats it as news feeds. Running the shell script wrapper will take care of activating the virtual environment for you, or you can run the Python script directly:

# Run the wrapper:
$ ./run_scraper_news.sh

# Or run the script directly if you are managing virtual environments youself:
$ python3 scraper_news.py

By default, it will output a series of JSON Feed-formatted JSON objects — one for each county. Use the --help option to see a information about additional arguments (the same options also work when running the Python script directly):

$ ./run_scraper_news.sh --help
Usage: scraper_news.py [OPTIONS] [COUNTY]...

  Create a news feed for one or more counties. Supported counties: alameda,
  contra_costa, marin, napa, san_francisco, san_mateo, santa_clara, solano,
  sonoma.

Options:
  --from CLI_DATE                 Only include news items newer than this
                                  date. Instead of a date, you can specify a
                                  number of days ago, e.g. "14" for 2 weeks
                                  ago.

  --format [json_feed|json_simple|rss]
  --output PATH                   write output file(s) to this directory
  --help                          Show this message and exit.
  • To scrape a specific county or counties, list the counties you want. For example, to scrape only Alameda and Solano counties:

    $ ./run_scraper_data.sh alameda solano
  • --from sets the earliest date/time from which to include news items. It can be a date, like 2020-07-15, a specific time, like 2020-07-15T10:00:00 for 10 am on July 15th, or a number of days before the current time, like 14 for the last 2 weeks.

  • --format sets the output format. Acceptable values are: json_feed (see the JSON Feed spec), json_simple (a simplified JSON format), or rss (RSS v2). Specify this option multiple times to output multiple formats, e.g:

    $ ./run_scraper_data.sh --format rss --format json_feed
  • --output specifies a directory to write to instead of your terminal’s STDOUT. Each county and --format combination will create a separate file in the directory. If the directory does not exist, it will be created.

Hospitalization Data Scraper

The hospitalization data scraper pulls down COVID-19-related hospitalization statistics at the county level from the California Department of Public Health via its CKAN API. To run the scraper, execute the following command in your terminal:

$ ./run_scraper_hospital.sh

By default, this will print time-series data in JSON format to stdout for all nine Bay Area counties, following the structure described in the data model documentation.

Data for all California counties is also available; to select a specific county or list of counties, add them as arguments when running the script. The county should be spelled in lowercase, with underscores replacing spaces:

$ ./run_scraper_hospital.sh alameda los_angeles mendocino

You may also pass an --output flag followed by the path to the directory where you would like the JSON data to be saved. If the directory does not exist, it will be created. The data will be saved as hospital_data.json.

Using Docker

As an alternative to installing and running the tools normally, you can use Docker to install and run them. This is especially helpful on Windows, where setting up Selenium and other Linux tools the scraper can be complicated.

  1. Download and install Docker from https://www.docker.com/ (You’ll probably need to create a Docker account as well if you don’t already have one.)

  2. Now run any of the tools by adding their command after ./run_docker.sh. For example, to run the news scraper:

    $ ./run_docker.sh python scraper_data.py

    Under the hood, this builds the Docker container and then runs the specified command in it.

    Docker acts kind of like a virtual machine, and you can also simply get yourself a command prompt inside the docker container by running ./run_docker.sh with no arguments:

    $ ./run_docker.sh
    # This will output information about the build, and then give you a
    # command prompt:
    root@ca87fa64d822:/app#
    
    # You can now run commands like the data scraper as normal from the prompt:
    root@ca87fa64d822:/app# python scraper_data.py
    root@ca87fa64d822:/app# python scraper_news.py

Data Models

The data models are in JSON format and are located in the data_models directory. For more information, see the data model readme.

Development

We use CircleCI to lint the code and run tests in this repository, but you can (and should!) also run tests locally.

The commands described below should all be run from within the virtual environment you’ve created for this project. If you used install.sh to get set up, you’ll need to activate your virtual environment before running them with the command:

$ source env/bin/activate

If you manage your environments differently (e.g. with Conda or Pyenv-Virtualenv), use whatever method you normally do to set up your environment.

Tests

You can run tests using pytest like so:

# In the root directory of the project:
$ python -m pytest -v .

Some tests run against live websites and can be slow (or worse; they might spam a county's server with requests and get your IP blocked), so they are disabled by default. To run them, set the LIVE_TESTS environment variable. It can be '*' to run live tests against all counties, or a comma separated list of counties to test.

# Run live tests against all county websites.
$ LIVE_TESTS='*' python -m pytest -v .

# Run live tests against only San Francisco and Sonoma counties.
$ LIVE_TESTS='san_francisco,sonoma' python -m pytest -v .

Linting and Code Conventions

We use Pyflakes for linting. Many editors have support for running it while you type (either built-in or via a plugin), but you can also run it directly from the command line:

# In the root directory of the project:
$ pyflakes .

We also use type annotations throughout the project. To check their validity with Mypy, run:

# In the root directory of the project:
$ mypy .

Reviewing and Merging Pull Requests

  1. PRs that are hotfixes do not require review.

    • Hotfixes repair broken functionality that was previously vetted, they do not add functionality. For these PRs, please feel free to request a review from one or more people.
    • If you are requested to review a hotfix, note that the first priority is to make sure the output is correct. "Get it working first, make it nice later." You do not have to be an expert in the function's history, nor understand every line of the diff changes. If you can verify whether the output is correct, you are qualified and encouraged to review a hotfix!
    • If no reviewers respond within 2 days, please merge in your PR yourself.
    • Examples of hotfixes are:
      1. Fixing broken scrapers
      2. Fixing dependencies - libraries, virtual environments, etc.
      3. Fixing github actions running the scrapers, and fixing CircleCI
  2. PRs that add functionality/features require at least 1 passing review.

    • If you are adding functionality, please explicitly require a review from at least one person.
    • When at least one person has approved the PR, the author of the PR is responsible for merging it in. You must have 1+ approving reviews to merge, but you don't need all requested reviewers to approve.
    • If you are one of the people required for review, please either complete your review within 3 days, or let the PR author know you are unavailable for review.
    • Examples of PRs that add functionality are:
      1. Adding new scrapers
      2. Structural refactors, such as changing the data model, or substantial rewrite of an existing scraper
  3. PRs that update the documentation require at least 1 passing review.

    • Documentation PRs are in the same tier as #2. Please explicitly require a review from at least one person.
    • When at least one person has approved the PR, the author of the PR is responsible for merging it in. You must have 1+ approving reviews to merge, but you don't need all requested reviewers to approve.
    • If you are one of the people required for review, please either complete your review within 3 days, or let the PR author know you are unavailable for review.
    • Examples are:
      1. Updates to the data fetch README
      2. Commenting code
      3. Adding to metadata
  4. Reviewers

    1. Everyone can review #1 hotfixes, or #3 documentation. If you want to proactively sign up to be first-string for these reviews, please add your github handle to the list below.

      • @elaguerta
      • @benghancock
    2. Experienced developers with deep knowledge of the project should be tapped for PRs that deal with complicated dependencies, language-specific implementation questions, structural/architectural concerns. If you want to be first-string for these reviews, please add your github handle to the list below.

      • @Mr0grog
      • @rickpr
      • @ldtcooper
    3. People who have interest in data, public health, and social science should be tapped for PRs that deal with decisions that affect how data is reported, structured, and provided to the user. If you want to be first-string for these reviews, please list your github name below.

      • @elaguerta
      • @benghancock
      • @ldtcooper

data-covid19-sfbayarea's People

Contributors

benghancock avatar cecen0 avatar dependabot[bot] avatar elaguerta avatar frhino avatar kengo-sony avatar kengoy avatar kwonangela7 avatar mr0grog avatar rickpr avatar slim-builder avatar zappascout avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-covid19-sfbayarea's Issues

Add news scraper for Santa Clara county

News scrapers live in the news directory. You can follow the San Francisco scraper as an example.

Santa Clara County

  1. The COVID-19 homepage has a list of announcements near the bottom we can scrape (no equivalent RSS I can find).

  2. The Public Health Department has a newsroom we can scrape. Can’t find any RSS or Atom feeds for it. :\

  3. The Office of Public Affairs also has a newsroom of the same format with slightly broader coverage. As far as I can tell, though, the Public Health Department one pretty well covers all the coronavirus-related stuff.

  4. There are some SOAP services linked from the COVID-19 page, but they seem to require authentication to access.

Fetch data from existing source

It looks like there are now several data sources that have county level data, so it may not make sense to manually gather data or even scrape it on a county level.
This source looks very promising

Identify news sources for each county

The news page is currently powered by the data/news.json file, which has so far been manually updated by @kengoy, and only includes information from San Francisco county.

We need to identify useful news sources for all the other bay area counties:

  • San Francisco
  • Santa Clara (#64)
  • Alameda (#60)
  • San Mateo (#65)
  • Contra Costa (#63)
  • Marin (#66)
  • Sonoma (#67)
  • Solano (#61)
  • Napa (#62)

…and consider how the news.json might be restructured to split out or tag news by county.

  • Make individual issues for each county

Determine which counties have detailed data

Especially breakdowns by race, ethnicity, gender, and age. For example, Santa Clara County has a dashboard showing cases and deaths by race, ethnicity, gender, age, and deaths by comorbidity. These are not being parsed by the coronadatascraper for this county.

SF county: gaps in timeseries

Describe the bug
Timeseries (cases, deaths, and tests) do not have consecutive daily entries. There are gaps where data is missing, i.e. the counts for that day were 0.

To Reproduce
Steps to reproduce the behavior:

  1. Run sh run_scraper_data.sh san_francisco

Expected behavior
Gaps in the timeseries have datapoint = -1.

Add news scraper for Contra Costa county

News scrapers live in the news directory. You can follow the San Francisco scraper as an example.

Contra Costa County

  1. Has a “news flash” RSS feed with some coronavirus-related items. We could filter it by key terms.

  2. Has a coronavirus-specific “news flash” page (no corresponding RSS). Each items is a paragraph, often with a link or two. It’s not entirely clear how to parse this into a list of news links in a reliable way.

  3. The Health Services Department has a coronavirus updates page we can scrape (no RSS).

  4. The Health Services Department has a newsroom with news and press releases, but is not coronavirus-specific (no RSS).

I think just using (3) is the best bet here.

NewsFeed should always serialize to bytes

The NewsFeed class can currently be serialized to JSON, JSON Feed, or RSS, but each of those serializations results in a different data type:

  • JSON → Dict
  • JSON Feed → Dict
  • RSS → String

That makes handling the output more complex than it should be:

for format_name in format:
if format_name == 'json_simple':
data = json.dumps(feed.format_json_simple(), indent=2)
extension = '.simple.json'
elif format_name == 'json_feed':
data = json.dumps(feed.format_json_feed(), indent=2)
extension = '.json'
else:
data = feed.format_rss()
extension = '.rss'

Instead, they should all serialize to bytes. (Why not strings? RSS, and potentially other future formats, should include an encoding directive on the first line, but we can’t do that if we serialize to a string and don’t know what encoding will later be written to disk.)

Data: Update data_scrapers/README.md

A list in no particular order of things to explain and comment on:
Data cleaning:

  1. Add data points with abnormalities like '<10' to metadata
  2. convert strings to int
  3. look for timezone info
  4. If cumulative numbers are reported per day, this means the county is not simply adding up the new cases/deaths per day. Look for a disclaimer in the comments or metadata.

Methodology for collapsing race/ethnicity:
From SF:
"In the race/ethnicity data shown below, the "Other” category includes those who identified as Other or with a race/ethnicity that does not fit the choices collected. The “Unknown” includes individuals who did not report a race/ethnicity to their provider, could not be contacted, or declined to answer."
Here's a possible workflow:

  1. Sum over race categories, except for unknown Race.
  2. Sum over Hispanic/Latino (all races) - note that this will double-count individuals who have a known race and are Hispanic/Latino
  3. Set our "unknown" datapoint to count only individuals with both unknown race & unknown ethnicity.

Add news scraper for Sonoma county

News scrapers live in the news directory. You can follow the San Francisco scraper as an example.

Sonoma County

  1. The county home page has 3 recent news items, but they are just a subset of the news page (2) (and not the most recent 3 items, either!) https://sonomacounty.ca.gov/Home/

  2. The county news page has a more complete list: https://sonomacounty.ca.gov/News/

    Some entries are repeated in Spanish; I’m not sure there’s an obvious way to determine which ones are different language versions of the same page. The topics are somewhat broad, going beyond COVID-related stuff.

  3. The county emergency site has a Coronavirus news page: https://socoemergency.org/emergency/novel-coronavirus/latest-news/

  4. The county emergency site has a public health orders page: https://socoemergency.org/emergency/novel-coronavirus/health-orders/

  5. The county emergency site has an RSS feed, which appears to be about any/every page on the site. This also means it has the same pages repeated in English and Spanish in a way that’s hard to put together, like (2) above. https://socoemergency.org/feed/

The fact that (5) appears to show every page on the emergency site as it gets added or updated might be useful, but also seems like not quite the right fit.

(2) is more friendly (e.g. it has the headline “Sonoma County Public Health Officer Amends Shelter in Place Order to Allow Additional Businesses to Reopen” while the emergency site has the headline “Amendment No. 3 to Health Order No. C19-09” for the same news item), but covers a lot of non-COVID stuff.

I’m thinking combining (3) and (4) is probably the way to go here.

Limit news scraper to time range

Over on the front-end side, @kengoy noted that we might want to limit the time range included in the news feeds we produce:

Another consideration is we may want to filter only for the new feeds like for the past 1 or 2 weeks instead of showing all the past feeds.

The current SF scraper just takes everything on the first page of news. Instead, NewsScraper should take a time range or a “from” date to limit what news items get returned.


Update 2020-06-11: Fixed NewsScraper link to point at the class’s new location since the code has been refactored a few times.

Add support for RSS/Atom for news

The news scraper currently outputs a custom, simplified JSON format, but we should really output either RSS or Atom — a standardized format so other tools and feed readers can use our output.

We need to have a CLI argument to choose the format, e.g:

# JSON Feed
> python3 scraper_news.py --format rss

See #40 about adding another format — JSON Feed — which will be the default.

San Francisco Data Scraper Breaks With Invalid Socrata Query

The San Francisco data scraper appears to be broken. It looks like one of the datasets it pulls from Socrata has changed.

Running python scraper_data.py san_francisco results in the following exception and stack trace:

Traceback (most recent call last):
  File "scraper_data.py", line 37, in <module>
    main()
  File "/Users/rbrackett/.pyenv/versions/data-covid19-sfbayarea/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/rbrackett/.pyenv/versions/data-covid19-sfbayarea/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/rbrackett/.pyenv/versions/data-covid19-sfbayarea/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/rbrackett/.pyenv/versions/data-covid19-sfbayarea/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "scraper_data.py", line 25, in main
    out[county] = data_scrapers.scrapers[county].get_county()
  File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/san_francisco.py", line 31, in get_county
    out["series"] = get_timeseries(session, RESOURCE_IDS)
  File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/san_francisco.py", line 81, in get_timeseries
    out_series["cases"] = get_cases_series(session, resource_ids)
  File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/san_francisco.py", line 93, in get_cases_series
    data = session.resource(resource_id, params=params)
  File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/utils.py", line 33, in resource
    return self.request(f'{self.resource_url}{resource_id}', **kwargs)
  File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/utils.py", line 29, in request
    response.raise_for_status()
  File "/Users/rbrackett/.pyenv/versions/data-covid19-sfbayarea/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://data.sfgov.org/resource/tvq9-ec9w?case_disposition=Confirmed&%24select=date%2Csum%28case_count%29+as+cases&%24group=date&%24order=date

Looking at the dataset, there is no longer a date column. Instead, there is now specimen_collection_date.

As a side note, Socrata sends back a [slightly] more informative error message:

{"message":"Invalid SoQL query","errorCode":"query.soql.invalid","data":{}}

We should probably rig up SocrataApi so that it automatically parses that to make a more informative exception.

New Data: Santa Clara County

Add news scraper for Marin county

News scrapers live in the news directory. You can follow the San Francisco scraper as an example.

Marin County

  1. The county newsroom page looks like a bit of a pain to scrape, but has the most reasonable headlines and useful-news-ness to its appearance: https://www.marincounty.org/main/newsroom

  2. The press-releases page is split up by office/agency, and is probably way too broad in the type of news it’s covering for what we want.

  3. Marin HHS has an updates page with three categories:

    • Status Updates (daily) — this might be a bit much, not sure.
    • Public Health Orders
    • Press Releases

    It probably makes the most sense to mix these together…? No RSS. https://coronavirus.marinhhs.org/updates

(2) is probably no good; (1) or (3) are probably the way to go. Maybe just start with (1).

Build API query to city parking datasets

sfbrigade/stop-covid19-sfbayarea#31

Purpose
Communicate another data point on the effects of shelter-in-place orders (and possibly how well people are adhering to them).

Possible Data Source
https://data.sfgov.org/Transportation/SFMTA-Parking-Meter-Detailed-Revenue-Transactions/imvp-dq3v

Implementation
Website to show parking data, ideally in map visualization of either "amount paid" sum or better, "number of transactions per block" for last 24 hours and compared to last month, year ago, etc.

Suggestion example
none yet. source visualizations aren't working for me :(

Create directory structure to organize news scrapers

All the code for scraping news sites currently lives in one file: scraper_news.py. That’s fine at the moment, but as we expand to have separate scrapers for each county, one file won’t be manageable.

We need:

  • A directory to hold all the news-scraper-related code, e.g. news/

  • A top-level python script to serve as an entry point to each of the scrapers. e.g. something like:

    # Run the SF scraper:
    > python3 scrape_news.py san_francisco
    
    # Run the Alameda scraper:
    > python3 scrape_news.py alameda
    
    # Run multiple scrapers:
    > python3 scrape_news.py san_francisco alameda

Build data API

Putting this data in the dashboard is going to be a lot easier if we build an API to let it fetch the data we scrape without having to upload that data to the frontend repo.

Apply data model to old scraper data

Right now, we have a script that fetches death and case data from the Corona data scraper. We should apply the new data model to that data so that the frontend can start using it, and so that anyone scraping the harder counties has something to work off of

Solano County News Summaries Have Missing Spaces

The news summaries for Solano county are sometimes missing spaces between words. See the first added item in this update, for example: https://github.com/sfbrigade/stop-covid19-sfbayarea/pull/237/files

{
  "summary": "At his press conference this afternoon, Governor Newsom, in coordination with the California Department of Public Health, ordered counties that have been on theCounty Monitoring Listfor three consecutive days or more to immediately close bars, brewpubs, breweries and pubs, as well as  indoor operations in certain business sectors for a minimum of three weeks, maybe longer depending on epidemiological indicators.  Solano County \u2013 along with 18 other California Counties \u2013 is on the Governor's list, and will close indoor operations at dine-in restaurants, wineries and tasting rooms, family entertainment centers, indoor movie theatres, indoor zoos and museums and cardrooms effective immediately.  Bars must close regardless if they serve drinks indoors or outdoors."
}

Note the lack of space in theCounty and Listfor around have been on theCounty Monitoring Listfor three consecutive.

To Reproduce

Run ./run_scraper_news.sh solano

Revise data model

  1. Consider including active/recovered cases. Solano County and Sonoma county have these.
  2. Research and implement best practices for demographic categories. In particular, we might prefer:
  • Black, replaces African_Amer
  • American_Indian_Alaska_Native, replaces Native_Amer
  • Latinx, replaces Latinx_or_Hispanic

New Data: SF County

Update SF County data scraper with deaths, and cases by gender. If these will not be available via the API, write methods to scrape the dashboards.

Add support for JSON feed for news

The news scraper currently outputs a custom, simplified news format like:

{
  "newsItems": [
    {
      "url": "https://sf.gov/news/expansion-coronavirus-testing-all-essential-workers-sf",
      "text": "Expansion of coronavirus testing for all essential workers in SF",
      "date": "2020-04-23T04:11:56Z"
    },
    {
      "url": "https://sf.gov/news/sfmta-starts-closing-some-sf-streets-promote-safety-and-physical-distancing",
      "text": "SFMTA starts closing some SF streets to promote safety and physical distancing",
      "date": "2020-04-21T21:05:46Z"
    },
    // etc...
}

But we should really output a more standardized format. JSON feed (https://jsonfeed.org/) is a JSON-based alternative to RSS that a lot of news readers support.

Because the front-end will also need to transition to consume the new format, we need to have a CLI argument to choose the format, e.g:

# JSON Feed
> python3 scraper_news.py --format json

# Current JSON style
> python3 scraper_news.py --format json-simple

The JSON Feed format should be the default if --format is not specified.

News scraper should default to *all* counties

If you don’t specify what counties you want, the news scraper currently defaults to only scraping San Francisco:

if len(counties) == 0:
# FIXME: this should be COUNTY_NAMES, but we need to fix how the
# stop-covid19-sfbayarea project uses this first.
counties = ('san_francisco',)

This is an artifact of how the script was originally set up for use in the frontend repo, and we had to do it for backwards compatibility. However, the frontend now explicitly specifies all counties, so it should be safe to update the default.

Add news scraper for San Mateo county

News scrapers live in the news directory. You can follow the San Francisco scraper as an example.

San Mateo County

  1. The County Manager’s Office hosts press releases that show up under the Coronavirus heading on the county home page: https://cmo.smcgov.org/press-releases

    (Bonus: this actually has an RSS feed! https://cmo.smcgov.org/news/feed)

  2. The county health office has health orders and health officer statements: https://www.smchealth.org/post/health-officer-statements-and-orders-0

  3. The county health office also has “local news:” https://www.smchealth.org/post/local-news-you-need It’s much less timely and detailed than the CMO office’s press releases. (Actually, it looks like a subset.)

  4. The Joint Information Center (JIC) has a bunch of stuff, but ultimately just shows the CMO press releases (1) for news: https://cmo.smcgov.org/jic

I think just using the RSS feed from the CMO (1) is probably the most sensible here.

Add news scraper for Alameda county

News scrapers live in the news directory. You can follow the San Francisco scraper as an example.

Alameda County

  1. Has a “county news & announcements” RSS feed, but it’s broader than Coronavirus-related news, and has less-than-stellar headlines (e.g. “Alameda County News & Information Update 4/23/20”). We could search the article descriptions for key terms like “covid” or “coronavirus” and filter the feed to entries that match.

  2. Has an “emergency news” RSS feed, but it’s not being used for COVID news (no updates since 2017).

  3. The Public Health Department’s coronavirus page has “situation updates” on the front page that we could scrape.

  4. The Public Health Department also has a coronavirus-specific press releases page we could scrape. It overlaps a little with what’s on the front page “situation updates.”

Not sure which of these is best, or if we should try and combine 1 + 3 + 4 or maybe just 3 + 4. (In both cases, we need to be careful to de-duplicate 3 and 4 since some press releases appear on both pages.)

Maintain stable sort order for news items

While checking the latest news scraper output, I noticed that sometimes news items are showing up in different orders from one scrape to the next. This is generally a result of sites that only surface the publish date and not the publish time, and that have more than one news item on a given day. We sort based on publish time:

def append(self, *items: NewsItem) -> None:
self.items.extend(items)
self.items.sort(reverse=True, key=lambda item: item.date_published)

The most sensible thing to do here is probably to fall back on other sorting keys when the time is the same, such as title, url, or id (although id is often the same as url for us, since underlying primary keys, etc. are often not surfaced).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.