Git Product home page Git Product logo

rssgrab's People

Contributors

milost avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rssgrab's Issues

Edit grabber

We should be able to edit the definition of a grabber. For example, change in which intervals the grabber is executed. The only things that should be editable for now are

  • the name
  • the feed URL
  • the interval in which the grabber will run

Config file

We need a config file where we can specify things like:

  • the type of the database that should be used (MongoDB etc.)
  • the server the grabber should connect to (host, port)
  • the database the grabber should connect to (default: rssgrab)

... and maybe other things.

Grabber stats

It would be interesting to show some simple stats like:

  • How long is the grabber running
  • How many articles has it grabbed since its start.
  • How many articles did it grab during its last execution.

Add new Grabber

It should be possible to define a new Grabber and add it to the system.

  • We should be able to specify how often a grabber gets executed.

Delete grabber

We should be able to delete a specific grabber from the system.

Add execution interval

Every grabber should know how often it should execute itself (every hour, once per day etc)

Execute grabber

During the execution, a grabber should fetch his feed and start to download all relevant articles in the feed. Use the request package to download the articles. After successfully downloading the content of a page, it should store this content in the database.

  • Download feed
  • Grab articles from feed
  • Store articles in database

Database configuration

We should be able to configure the database the grabber writes the data to. We could also do this in a YAML config file.

Start, Stop a grabber

We should be able to Start and Stop a specific grabber. This issue does not directly mean start stop but rather ... schedule for execution or not.

Pagination support

A downloaded article can consist of multiple sites. We should be able to get the entire article by paginating to all the relevant sites and storing them too. To do this there needs to be some mechanism on how to discover that we need to paginate through multiple sites.

Caching

During the execution a grabber #5, we are only interested in discovering new articles. Instead of saving articles multiple times (because they are mentioned in multiple subsequent feed grabbing sessions) we only want to identify what's new in a feed that we crawl multiple times.

Therefore, we need a way to memorize what was in the feed the last time we crawled it. Using this information we can easily calculate the difference between the two crawls and only grab the articles that are new.

One possibility is to use a simple caching service like a simple hashmap, Beaker or Redis to store all the URLs that were contained in the feed during the previous crawl.

This issue is directly connected to Issue #5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.