Web-Crawler

Web Crawler Implementation
Download a given webpage from a website
Using appropriate techniques, extract the hyperlinks found on the downloaded page
Store the links in a database
Fetch new links from the database and display in a UI
Continue to crawl the new links found Notes:
Use multithreading and event handling where it is feasible
The application must compile and run in Visual Studio 2010 or 2012 (must include the data store added to the project, as well as all necessary libraries and resources)
As a guideline, you should spend maximum 8 hours in total to develop the application
Code contains following modules:
Downloader a. This module will download the web page and extract the links from the page using the HtmlAgilityPack DLL.
Crawl WebPage a. This module has information about crawl page
Components a. Multithreaded component handles the multiple threads to handle the crawling b. Queue component will feed the links to the multiple threads for crawling
Database a. Create one table called ‘crawl’ you can find the create script at ‘Pigo\Database\Create_Table.sql’

Note: Please change the app settings as below:

Change ConnectionString according to SQL server setup. Weakness:

Validations are not properly handled in the code. i.e. validation about web page content and crawling validations
Store Procedures are not used, Indexing for table is not done
HTML Parser is not written by own. Improvement Areas:
Write own efficient HTML Parser
Crawl high rank web pages ahead of normal pages
Write an Algorithm for Re-Visit crawling Policy
Write Reinforcement Machine learning algorithm for focused crawling using some pre training data.

Write URL caching techniques for web crawling.

Crawling strategy:

Downloader is implemented using Multithreading and Queue techniques. To extract links from the given Page I have used ‘HtmlAgilityPack’ DLL (‘Pigo\Library\HtmlAgilityPack.dll’)

For saving Links to database I have used SQL Server 2008. Create Database with the name of ‘Crawler’ and use Create table script from the ‘Pigo\Database\Create_Table.sql’

I have put check in the application so that it can’t crawl the same page again.

GUI will continuously display the links that are queued up for crawling.

autocar / web-crawler Goto Github PK

web-crawler's Introduction

Web-Crawler

web-crawler's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent