The following webcrawler will search for all links under a target domain. The web crawler utilizes Scrapy, applicaton framework for writing web spiders that crawl through the web.
-
Download and Install
-
Navigate to the spider's directory
cd webcrawler/spider/
-
Run the webcrawler
python activate.py
-
Make sure list of links is empty
[4]
Clear Links -
Set up Crawler Settings
[1]
Crawler Settings -
Run the Crawler
[5]
Run!
python activate.py
*************DOMAIN CRAWLER*************
[1] Crawler Settings
[2] View Links
[3] View HTTP Headers
[4] Clear Links
[5] Run!
[6] Exit
Enter Number: 4
*************DOMAIN CRAWLER*************
[1] Crawler Settings
[2] View Links
[3] View HTTP Headers
[4] Clear Links
[5] Run!
[6] Exit
Enter Number: 1
****************************************
Starting Url: https://twitter.com
Target Domain: twitter.com
Download Delay [Default: 0]: 0
*************DOMAIN CRAWLER*************
[1] Crawler Settings
[2] View Links
[3] View HTTP Headers
[4] Clear Links
[5] Run!
[6] Exit
Enter Number: 5
****************************************
//dev.twitter.com
http://m.twitter.com
https://about.twitter.com
https://apps.twitter.com
https://blog.twitter.com
........
- Highlight changes when running the crawler multiple times
- View HTTP headers