-
Web Crawler Implementation
-
Download a given webpage from a website
-
Using appropriate techniques, extract the hyperlinks found on the downloaded page
-
Store the links in a database
-
Fetch new links from the database and display in a UI
-
Continue to crawl the new links found Notes:
-
Use multithreading and event handling where it is feasible
-
The application must compile and run in Visual Studio 2010 or 2012 (must include the data store added to the project, as well as all necessary libraries and resources)
-
As a guideline, you should spend maximum 8 hours in total to develop the application
-
Code contains following modules:
-
Downloader a. This module will download the web page and extract the links from the page using the HtmlAgilityPack DLL.
-
Crawl WebPage a. This module has information about crawl page
-
Components a. Multithreaded component handles the multiple threads to handle the crawling b. Queue component will feed the links to the multiple threads for crawling
-
Database a. Create one table called ‘crawl’ you can find the create script at ‘Pigo\Database\Create_Table.sql’
Note: Please change the app settings as below:
Change ConnectionString according to SQL server setup. Weakness:
- Validations are not properly handled in the code. i.e. validation about web page content and crawling validations
- Store Procedures are not used, Indexing for table is not done
- HTML Parser is not written by own. Improvement Areas:
- Write own efficient HTML Parser
- Crawl high rank web pages ahead of normal pages
- Write an Algorithm for Re-Visit crawling Policy
- Write Reinforcement Machine learning algorithm for focused crawling using some pre training data.
-
Write URL caching techniques for web crawling.
Crawling strategy:
Downloader is implemented using Multithreading and Queue techniques. To extract links from the given Page I have used ‘HtmlAgilityPack’ DLL (‘Pigo\Library\HtmlAgilityPack.dll’)
For saving Links to database I have used SQL Server 2008. Create Database with the name of ‘Crawler’ and use Create table script from the ‘Pigo\Database\Create_Table.sql’
I have put check in the application so that it can’t crawl the same page again.
GUI will continuously display the links that are queued up for crawling.