milost / rssgrab Goto Github PK

A tool for crawling rss feeds on a regular basis

Python 1.55% HTML 4.52% JavaScript 4.67% CSS 89.26%

rssgrab's Issues

Recreate grabbers during startup

All grabbers should be recreated from their stored definitions during the start of the system.

Edit grabber

We should be able to edit the definition of a grabber. For example, change in which intervals the grabber is executed. The only things that should be editable for now are

the name
the feed URL
the interval in which the grabber will run

Persist the definition of a grabber

The definition of a Grabber should be stored in the database. Use the SmplConnPool to connect to the database.

Config file

We need a config file where we can specify things like:

the type of the database that should be used (MongoDB etc.)
the server the grabber should connect to (host, port)
the database the grabber should connect to (default: rssgrab)

... and maybe other things.

Grabber stats

It would be interesting to show some simple stats like:

How long is the grabber running
How many articles has it grabbed since its start.
How many articles did it grab during its last execution.

Add new Grabber

It should be possible to define a new Grabber and add it to the system.

We should be able to specify how often a grabber gets executed.

Delete grabber

We should be able to delete a specific grabber from the system.

Add execution interval

Every grabber should know how often it should execute itself (every hour, once per day etc)

During the execution, a grabber should fetch his feed and start to download all relevant articles in the feed. Use the request package to download the articles. After successfully downloading the content of a page, it should store this content in the database.

Download feed
Grab articles from feed
Store articles in database

Database configuration

We should be able to configure the database the grabber writes the data to. We could also do this in a YAML config file.

Start, Stop a grabber

We should be able to Start and Stop a specific grabber. This issue does not directly mean start stop but rather ... schedule for execution or not.

Pagination support

A downloaded article can consist of multiple sites. We should be able to get the entire article by paginating to all the relevant sites and storing them too. To do this there needs to be some mechanism on how to discover that we need to paginate through multiple sites.

Caching

During the execution a grabber #5, we are only interested in discovering new articles. Instead of saving articles multiple times (because they are mentioned in multiple subsequent feed grabbing sessions) we only want to identify what's new in a feed that we crawl multiple times.

Therefore, we need a way to memorize what was in the feed the last time we crawled it. Using this information we can easily calculate the difference between the two crawls and only grab the articles that are new.

One possibility is to use a simple caching service like a simple hashmap, Beaker or Redis to store all the URLs that were contained in the feed during the previous crawl.

This issue is directly connected to Issue #5

milost / rssgrab Goto Github PK

rssgrab's People

Contributors

Watchers

rssgrab's Issues

Recreate grabbers during startup

Edit grabber

Persist the definition of a grabber

Config file

Grabber stats

Add new Grabber

Delete grabber

Add execution interval

Execute grabber

Database configuration

Start, Stop a grabber

Pagination support

Caching

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent