Git Product home page Git Product logo

Comments (7)

csaoma avatar csaoma commented on September 28, 2024

Can I be assigned this? I can add a time limit.

from data-covid19-sfbayarea.

Mr0grog avatar Mr0grog commented on September 28, 2024

Sure, go for it! 👍

from data-covid19-sfbayarea.

csaoma avatar csaoma commented on September 28, 2024

So I am working on adding a time limit to NewsScraper, but I have a few questions.

In the future, will NewsScraper be applied to multiple websites? In this case I think we run into the possibility of each website needing to be scraped differently depending on the setup of the specified website. By this logic it may be better to add date limits to individual methods which handle scraping the dates more specifically. In the meantime, it may be better to add a page limit(increasing the data scraped from 1 page to say 10-20).

Essentially, I am proposing to increase the amount of pages being scraped rather than doing a date check. In the future for each particular website you can add a time limit more specified to the website being scraped.

from data-covid19-sfbayarea.

Mr0grog avatar Mr0grog commented on September 28, 2024

In the future, will NewsScraper be applied to multiple websites? In this case I think we run into the possibility of each website needing to be scraped differently depending on the setup of the specified website.

Hmmm, I think there may be some miscommunication happening here. It already runs against multiple sites — NewsScraper is just the abstract base class that is customized for each site. See:

What’s needed in NewsScraper is a standard way to pass in and store the date range (i.e. add it to __init()__ and get_news()). Then each implementing class needs to use the stored date range to when doing its work. I’m sorry if I didn’t describe that clearly enough.

In the meantime, it may be better to add a page limit(increasing the data scraped from 1 page to say 10-20).

I’m not sure I’m following here. What’s generally needed is to reduce the number of results returned. Scraping more pages from each site (at least for sites where the results are paginated) would only return more results. (I do think following pagination to allow for more results might also be nice in the future, but from a practical perspective, nobody is asking for that.)

In the future for each particular website you can add a time limit more specified to the website being scraped.

The actual time limit definitely should not have anything to do with what county/website you are generating a news feed for. The goal here is to be able to say “I want a news feed for [county X] that covers [the past Y days].” We can already do the first part:

$ ./run_scraper_news.sh [county_name]
# e.g:
$ ./run_scraper_news.sh san_francisco

And we need to add a way to specify the second part, e.g:

$ ./run_scraper_news.sh san_francisco --since 2020-06-01

# Or maybe in terms of days, so the CLI args don’t have to be different
# each time we run it. So for 2 weeks:
$ ./run_scraper_news.sh san_francisco --since 14

That means parsing the CLI options, creating a way to send them to any instance of NewsScraper, and then ultimately making each implementation of NewsScraper make use of that information when it does its work.

from data-covid19-sfbayarea.

Mr0grog avatar Mr0grog commented on September 28, 2024

@csaoma Just checking in — have you made any progress on this, or do you need any help with it?

from data-covid19-sfbayarea.

csaoma avatar csaoma commented on September 28, 2024

Hey! Sorry, I haven't made much progress on this recently due to my busy work schedule and other personal projects. I am going to go ahead and unassign myself. Sorry for the inconvenience!

from data-covid19-sfbayarea.

Mr0grog avatar Mr0grog commented on September 28, 2024

No worries! Thanks for responding. :)

from data-covid19-sfbayarea.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.