Git Product home page Git Product logo

arachnid's Introduction

Arachnid

       .....    ..................  ..........................                ..
       ..        ...............  ..  .....................                   ..
                             ...  ..    ....................                ....
                     .            .I.      ................                 ....
                                  .I...       ...........                  .....
    .                     .    .....=....      .........                 .......
                 ..... ..............:I...    . ........               .........
               ...I....................=$.      ........            ............
                 ..+ ..................II..      ...               .............
                ....=..................,+...     ...               .............
                  ...7=.................?...     ....             ..............
                   ....=+...............+...   .....       ............    .....
                   ......?+.............?...    ..     .................... ....
                   .......,??...........$...         ......Z7$OZ,......=,.  ....
                    ........??.........+I..       ......7?7.......+$$Z...  .....
       .               ......ZI7.......$Z...........~7I?=................       
      ..                 .....=7+.....+$=..........=O,...............           
                           ....I$....,IO..........?$.............               
.                           ...=II...I$I..?77O,..I7+..... .                     
                   . .......  ..IO=..ZO=.I$Z8?$.?7I......                       
              ...............  .+ZZ..8DZ$7?+O$$~$I,......                   .   
           ........:=7?ZO+8$Z....?Z7.D88=8DZ++IO7,................              
         ....,?+7?=:.......+IO7Z.++$7$8ODD8Z7ODI~................               
         ..=$=. .............=ZIO7~$ZIODN?87OZI...IIZ+87Z,........             .
       ..=O....................,?I$$77ODD8Z+O7?O$II?....,O?I........          ..
     ..,Z. .     .........OO8$OZZO=IO+7$O7+7Z$?+~..........,I7,.....            
  ....O...     ........O=$D=~..7$:~?=?OD7Z???=.....  .........$I....            
  =Z....       .....~7Z,......:=:.IO=+$I7I+OI$77...    .........Z=....          
              .....I7?........,,....~~7+..+7?+?DO...     .........O....         
            .....+ZI................ ~7:...7+...OZ...     .........I7....       
          .....$77....................:?=..II....IZ...     ...........+$ZZ?.    
         ....+$7............  .............??....?$....      ............  .    
         ...=:........  ...             ...I.....=?.....       ..........       
         ..~?....  .                    ...+......I.....        ...........     
        ..?$..                     .    ..~?......?~.....         .........     
     ....II .                          .. OI..... ZI........       .........    
     ...7..                            ...O,......+I.........       .........   
  . ..=O..                             ...7........$..........       ......     
   .,Z...                              ...Z........7............         .      
  ..,   .                              ...7........8...............             
  ...                                  ...$........O?...............            
   .                                   . =,...  ...~+...............            
                                       ...  .    ...7.................          
                                  .               ..7...................        
               ...                                 .I.......................    
                                                   ..7.  ...................    
                                                        ...................    .
                                                       ... . .............    ..

Overview

Arachnid was built as an alternative to Anemone, which is a great and powerful ruby spidering library but unfortunately one that succumbs to some pretty serious memory bloat on big sites with a ton of pages. Arachnid relies on Bloom Filters to store the list of visited urls so it's extremely efficient for hundreds of thousands of urls, and the requests are handled by Typhoeus which is much more lightweight than a threaded Mechanize solution.

Additionally, Arachnid can be threaded with a gem such as Threadify so you can crawl multiple domains with multiple threads each. ...you can thread while you thread.

TL;DR: Give Arachnid a url, it will crawl every single page that it can find on that domain.

###Requirements Arachnid was built to run on Ruby 1.9.2 I'll be honest, I haven't really tested it on any other platforms, and probably won't in the near future. If you want to make it compatible with 1.8.7, feel free to fork it and go for it. Otherwise, I'd recommend using Anemone instead, as Arachnid was built by a lazy developer :)

###Installation gem install arachnid

###Usage

require 'arachnid'

Arachnid.new("http://domain.com", {:exclude_urls_with_images => true}).crawl({:threads => 2, :max_urls => 1000}) do |response|
  
    #"response" is just a Typhoeus response object.
    puts response.effective_url

    #You can retrieve the body of the page with response.body
    parsed_body = Nokogiri::HTML.parse(response.body)

end

###Options for Arachnid.new

:split_url_at_hash => true/false - For each new url that is discovered on a page, if set to true, Arachnid will split the url at the # in the url and only store the portion before the #. This will allow you to crawl one level deep with # marks (such as a comments page) but not crawl new urls with # in them (such as specific comment permalinks). :exclude_urls_with_hash must be set to false for this option to work. Defaults to false.

:exclude_urls_with_hash => true/false - Spider will ignore any url with a hash in the url (#). Set to true if crawling blogs or other pages that have a lot of # in permalinks. Defaults to false.

:exclude_urls_with_images => true/false - Spider will ignore any url with common image file extensions. Defaults to false.

:proxy_list => Array - Spider will choose one proxy at random for each request. Format is: "ip:port:user:pass" or "ip:port".

###Options for .crawl

:threads => (num_threads) - Number of Typhoeus Hydra threads to use when crawling a domain. Out of respect for sites being crawled, keep this number under 10 threads. Defaults to 1.

:max_urls => (num_urls) - Total number of pages to crawl on any domain. Use this when crawling large sites or sites with a ton of tag and category pages, as they'll often have tens of thousands of pages with duplicate content and the crawler will run for way too long. Defaults to unlimted urls.

arachnid's People

Contributors

dchuk avatar ringvold avatar jakeaustwick avatar yencn02 avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.