A simple scraper used to collect news from CNN by parsing its RSS feeding. Write in Python 2.7
Simply run cnn_scrape.py
-
Make sure you are connected to the Internet !!!
-
Make sure that "feedparser.py" is in the same folder with cnn_scrape.py
-
Make sure you have beautifulsoup4 installed.
feedparser.py is used to parse the XML file of RSS.
beautifulsoup is used to parse the html of the news webpage.
----What you can do:
-
Makes it more user friendly by adding instruction outputs and "try...catch..." statement.
-
Give it beautiful UI interface to make it an RSS reader.
-
Improve the crawler so that it can change its ip address automatically why crawling so that it won't be banned from the website server.
-
Implement information retrieval methods to it so that it can collect news according to the query given by the server.