Git Product home page Git Product logo

web-crawler's Introduction

Web-Crawler

  1. Web Crawler Implementation

  2. Download a given webpage from a website

  3. Using appropriate techniques, extract the hyperlinks found on the downloaded page

  4. Store the links in a database

  5. Fetch new links from the database and display in a UI

  6. Continue to crawl the new links found Notes:

  7. Use multithreading and event handling where it is feasible

  8. The application must compile and run in Visual Studio 2010 or 2012 (must include the data store added to the project, as well as all necessary libraries and resources)

  9. As a guideline, you should spend maximum 8 hours in total to develop the application

  10. Code contains following modules:

  11. Downloader a. This module will download the web page and extract the links from the page using the HtmlAgilityPack DLL.

  12. Crawl WebPage a. This module has information about crawl page

  13. Components a. Multithreaded component handles the multiple threads to handle the crawling b. Queue component will feed the links to the multiple threads for crawling

  14. Database a. Create one table called ‘crawl’ you can find the create script at ‘Pigo\Database\Create_Table.sql’

Note: Please change the app settings as below:

Change ConnectionString according to SQL server setup. Weakness:

  1. Validations are not properly handled in the code. i.e. validation about web page content and crawling validations
  2. Store Procedures are not used, Indexing for table is not done
  3. HTML Parser is not written by own. Improvement Areas:
  4. Write own efficient HTML Parser
  5. Crawl high rank web pages ahead of normal pages
  6. Write an Algorithm for Re-Visit crawling Policy
  7. Write Reinforcement Machine learning algorithm for focused crawling using some pre training data.
  8. Write URL caching techniques for web crawling.
    

Crawling strategy:

Downloader is implemented using Multithreading and Queue techniques. To extract links from the given Page I have used ‘HtmlAgilityPack’ DLL (‘Pigo\Library\HtmlAgilityPack.dll’)

For saving Links to database I have used SQL Server 2008. Create Database with the name of ‘Crawler’ and use Create table script from the ‘Pigo\Database\Create_Table.sql’

I have put check in the application so that it can’t crawl the same page again.

GUI will continuously display the links that are queued up for crawling.

web-crawler's People

Contributors

piyushkp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.