Git Product home page Git Product logo

site-walker's Introduction

SiteWalker.js

Simple web crawler with basic capability to crawl next page based on callback

How to install

$ npm install site-walker

Usage

var SiteWalker = require("site-walker")
var instance = new SiteWalker("http://someawesome.site.com",function(pageStr){
    //callback is fired when page is successfully crawled
    //pageStr contains crawled page, in string
    //do some scrapping here and there
    var nextUrl = "http://someawesome.site.com/page/2" //assume that page/2 is scrapped from current pageStr
    this.next(nextUrl)
})
instance
.then(function(){
    //fired when no nextUrl is supplied from callback
})
.catch(function(reason){
    //fired when error on retrieving page.
})
instance.crawl() //invoke crawling

You can call this.next(nextUrl) several times during callback. If so, the next url that will be crawled the first supplied nextUrl, and so on. For example :

    //supplied callback
    function(pageStr){
        //scrap scrap
        this.next(url1);
        this.next(url2);
        if(someConditionIsMet){
            this.next(url3)
        }
    }

the crawled page order will be :

url1 -> url2 -> url1 -> url2

If during callback, someConditionisMet evaluate to true, the order of execution will be :

url1 -> url2 -> url3 -> url1 -> url2

Notes

  • Currently, if during crawling a URL is failed to be crawled, SiteWalker will break the execution and throw reject
  • No stop() method is available. So, if you keep supplying nextUrl on callback, SiteWalker will run forever (theoretically)

GitHub

https://github.com/aerios/site-walker

site-walker's People

Contributors

aerios avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.