NewsRazor

Description

This application provides news from selected ressources and "razors" the news down to what interests you
The results of your search are depending on your filters.
You will be asked for the filters when launching the application. It's possible to overwrite these filters on each restart if required.
You can log in with different usernames to prevent others from overwriting your filters.

Supported Pages

If you want to collab and add support for further reliable news pages, refer to the Collaboration section

Installation

make sure python is installed
open cmd
> cd "project folder"
> pip install pipreqs
> pipreqs . --> generates ***requirements.txt*** for python scripts in the folder
> pip install -r requirements.txt

Usage

open cmd
cd "project folder"
python .\src\main.py
follow the instructions in the CLI

Collaboration

We need especially collaborators for supporting further reliable news pages. We work with feature branches and pull requests which need to be approved by the core group who is working on the project.

In order to add support for new pages, you need to do the following:

Add the url to urls.txt file
- make sure to add no slash at the end of the url, as this would cause issues with the crawler implementation
Ignore 'technical' links
- Run the app. By now, it should crawl also newly added webpage
- Check the output for 'technicalt links which containly provide no news (e.g. "/contact", "/sitemap", ...). If there are any, add them to ignoreByDefault.txt
- Attention: add the slash at the beginning in this file, to make sure to ignore only those specific urls!
- Attention: do not add them, if they contain buzzwords which could be used in news often!
add a "crawlXxx.py" file for the logic to search through the specific webpage
- crawlCbsNews.py is a good example on what and how to implement
- implement at least the functions
  - printArticle(url)
  - printRelatedArticles(url)
- inspect the webpage to figure out how the elements need to be addressed
- make sure to search for the 'h1' tag, since we want to display the headlines
finally, in spider.py file, connect the support in the functions at the pottom of the file.
- at the example of "readNewsPage(url)": add
```
elif url.startswith("https://your.url.com"):
  yourCrawler.printArticle(url)
  yourCrawler.printRelatedArticles(url)
```
add link to Supported Pages

Example Commit (note: readme update not included!)

vigi86 / stairhack2022 Goto Github PK

stairhack2022's Introduction

NewsRazor

Description

Supported Pages

Installation

Usage

Collaboration

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent