Comments (7)
Can I be assigned this? I can add a time limit.
from data-covid19-sfbayarea.
Sure, go for it! 👍
from data-covid19-sfbayarea.
So I am working on adding a time limit to NewsScraper, but I have a few questions.
In the future, will NewsScraper be applied to multiple websites? In this case I think we run into the possibility of each website needing to be scraped differently depending on the setup of the specified website. By this logic it may be better to add date limits to individual methods which handle scraping the dates more specifically. In the meantime, it may be better to add a page limit(increasing the data scraped from 1 page to say 10-20).
Essentially, I am proposing to increase the amount of pages being scraped rather than doing a date check. In the future for each particular website you can add a time limit more specified to the website being scraped.
from data-covid19-sfbayarea.
In the future, will NewsScraper be applied to multiple websites? In this case I think we run into the possibility of each website needing to be scraped differently depending on the setup of the specified website.
Hmmm, I think there may be some miscommunication happening here. It already runs against multiple sites — NewsScraper
is just the abstract base class that is customized for each site. See:
AlamedaNews
ContraCostaNews
MarinNews
NapaNews
SanFranciscoNews
SanMateoNews
SantaClaraNews
SolanoNews
SonomaNews
What’s needed in NewsScraper
is a standard way to pass in and store the date range (i.e. add it to __init()__
and get_news()
). Then each implementing class needs to use the stored date range to when doing its work. I’m sorry if I didn’t describe that clearly enough.
In the meantime, it may be better to add a page limit(increasing the data scraped from 1 page to say 10-20).
I’m not sure I’m following here. What’s generally needed is to reduce the number of results returned. Scraping more pages from each site (at least for sites where the results are paginated) would only return more results. (I do think following pagination to allow for more results might also be nice in the future, but from a practical perspective, nobody is asking for that.)
In the future for each particular website you can add a time limit more specified to the website being scraped.
The actual time limit definitely should not have anything to do with what county/website you are generating a news feed for. The goal here is to be able to say “I want a news feed for [county X] that covers [the past Y days].” We can already do the first part:
$ ./run_scraper_news.sh [county_name]
# e.g:
$ ./run_scraper_news.sh san_francisco
And we need to add a way to specify the second part, e.g:
$ ./run_scraper_news.sh san_francisco --since 2020-06-01
# Or maybe in terms of days, so the CLI args don’t have to be different
# each time we run it. So for 2 weeks:
$ ./run_scraper_news.sh san_francisco --since 14
That means parsing the CLI options, creating a way to send them to any instance of NewsScraper
, and then ultimately making each implementation of NewsScraper
make use of that information when it does its work.
from data-covid19-sfbayarea.
@csaoma Just checking in — have you made any progress on this, or do you need any help with it?
from data-covid19-sfbayarea.
Hey! Sorry, I haven't made much progress on this recently due to my busy work schedule and other personal projects. I am going to go ahead and unassign myself. Sorry for the inconvenience!
from data-covid19-sfbayarea.
No worries! Thanks for responding. :)
from data-covid19-sfbayarea.
Related Issues (20)
- Santa Clara news scraper is missing news from 2020
- Include hospitalization data in Marin data scraper
- The San Mateo deaths data v2 is missing HOT 3
- Add constants/enums for field values
- Contra Costa news has obfuscated URLs
- Bad data for Solano county cases on 2020-09-02 HOT 2
- Figure out how to model vaccination data HOT 5
- "Cases by Source" Removed and Broke Sonoma County Data
- San Mateo scraper broken
- Hospital Data Scraper is Broken HOT 1
- Simplify error logs
- Get San Mateo deaths timeseries from LATimes or NYTimes
- Contra Costa county news fails frequently
- Santa Clara data is failing with dates from 1920/1921 HOT 1
- Marin County only shows data up to Feb 11 HOT 4
- Pull Sonoma County Data from ArcGIS
- Sonoma County `update_time` is Incorrect
- Marin County scraper not returning all available case data HOT 2
- Create a Data Dictionary to Describe Data, Known Issues, and Caveats HOT 1
- Sonoma County Genders Table is Gone
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-covid19-sfbayarea.