Git Product home page Git Product logo

crawler's Introduction

Java web crawler Build Status

Simple java (1.6) crawler to crawl web pages on one and same domain. If your page is redirected to another domain, that page is not picked up EXCEPT if it is the first URL that is tested. Basicly you can do this:

  • Crawl from a start point, defining the depth of the crawl and decide to crawl only a specific path
  • Output all working urls
  • Output the data to a csv file, separated by working (200 response code) and non working url
  • Output the data to two text files, one with working urls and one with none working. Each url will be on one new line.
  • Output url:s that contains a keyword in the html
  • Exprimental support for verifying that assets on a page work

How to crawl

A simple crawl have the following options, and will output the url:s crawled to system out. Note, only urls that returns 200 will be outputted by default:

usage: CrawlToSystemOut [-l ] [-np ] [-p ] -u  [-v ]
 -l,--level             how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath   no url:s on this path will be crawled [optional]
 -p,--followPath         stay on this path when crawling [optional]
 -u,--url                 the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify           verify that all links are returning 200, default is set to true [optional] 
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional]

You can choose to output the crawled list to two plain text files, one with working urls, and one with the none working:

usage: CrawlToFile [-ef ] [-f ] [-l ] [-np ] [-p ] -u  [-v ] [-ve ]
 -ef,--errorfilename    the name of the error output file, default name is errorurls.txt [optional]
 -f,--filename               the name of the output file, default name is urls.txt [optional]
 -l,--level                     how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath           no url:s on this path will be crawled [optional]
 -p,--followPath                 stay on this path when crawling [optional]
 -u,--url                         the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify                   verify that all links are returning 200, default is set to true [optional]
 -ve,--verbose                verbose logging, default is false [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 

You can choose to output the result in a csv file, and separate the urls by working and non working:

usage: CrawlToCsv [-f ] [-l ] [-np ] [-p ] -u  [-v ]
 -f,--filename        the name of the csv output file, default name is result.csv [optional]
 -l,--level              how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath    no url:s on this path will be crawled [optional]
 -p,--followPath          stay on this path when crawling [optional]
 -u,--url                  the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify            verify that all links are returning 200, default is set to true [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 

Crawl and output urls that contains specific keyword in the html

usage: CrawlToPlainTxtOnlyMatching -k  [-l ] [-np ] [-p ] -u  [-v ]
 -k,--keyword          the keyword to search for in the page [required]
 -l,--level              how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath    no url:s on this path will be crawled [optional]
 -p,--followPath          stay on this path when crawling [optional]
 -u,--url                  the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify            verify that all links are returning 200, default is set to true [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 

Configuration

There are also configuration that you either configure in the crawler.properties file or override them by adding them as a system property. By default they are configured:

## Override these properties by set a system property
com.soulgalore.crawler.nrofhttpthreads=5
com.soulgalore.crawler.threadsinworkingpool=5
com.soulgalore.crawler.http.socket.timeout=5000
com.soulgalore.crawler.http.connection.timeout=5000
# Auth like:
# soulislove.com:80:username:password,...
com.soulgalore.crawler.auth=
# Proxy properties, if you are behind a proxy.                                                                                                                                                          
## The host by this special format: http:proxy.soulgalore.com:80                                                                                                                                        
com.soulgalore.crawler.proxy=

The location of crawler.properties file can be set with the system property com.soulgalore.crawler.propertydir.

Examples

Checkout the project and compile your own full jar (all dependencies included):

git clone [email protected]:soulgalore/crawler.git

or add it to Maven, if you want to include the crawler in your project:

<dependency>
 <groupId>com.soulgalore</groupId>
 <artifactId>crawler</artifactId>
 <version>1.5.11</version>
</dependency>

Examples

Running from the jar, fetching two levels depth and only fetch urls that contains "/tagg/"

java -jar crawler-1.5.11-full.jar -u http://soulislove.com -l 2 -p /tagg/

Running from the jar, adding base auth

java -jar -Dcom.soulgalore.crawler.auth=soulgalore.com:80:peter:secret crawler-1.5.11-full.jar -u http://soulislove.com

Running from the jar, output urls in csv file

java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToCsv -u http://soulislove.com

Running from the jar, output urls into two text files: workingurls.txt and nonworkingurls.txt

java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToFile -u http://soulislove.com -f workingurls.txt -ef nonworkingurls.txt

Running from the jar, verify that assets are ok

java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlAndVerifyAssets -u http://www.peterhedenskog.com

License

Copyright 2014 Peter Hedenskog

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Bitdeli Badge

crawler's People

Contributors

soulgalore avatar tobli avatar bitdeli-chef avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.