Git Product home page Git Product logo

spider's Introduction

Spider -- Programmable spidering of web sites with node.js and jQuery

Install

From source:

  git clone git://github.com/mikeal/spider.git 
  cd spider
  npm link ../spider

(How to use the) API

Creating a Spider

  var spider = require('spider');
  var s = spider();

spider(options)

The options object can have the following fields:

  • maxSockets - Integer containing the maximum amount of sockets in the pool. Defaults to 4.
  • userAgent - The User Agent String to be sent to the remote server along with our request. Defaults to Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7 (firefox userAgent String).
  • cache - The Cache object to be used as cache. Defaults to NoCache, see code for implementation details for a new Cache object.
  • pool - A hash object containing the agents for the requests. If omitted the requests will use the global pool which is set to maxSockets.

Adding a Route Handler

spider.route(hosts, pattern, cb)

Where the params are the following :

  • hosts - A string -- or an array of string -- representing the host part of the targeted URL(s).
  • pattern - The pattern against which spider tries to match the remaining (pathname + search + hash) of the URL(s).
  • cb - A function of the form function(window, $) where
    • this - Will be a variable referencing the Routes.match return object/value with some other goodies added from spider. For more info see https://github.com/aaronblohowiak/routes.js
    • window - Will be a variable referencing the document's window.
    • $ - Will be the variable referencing the jQuery Object.

Queuing an URL for spider to fetch.

spider.get(url) where url is the url to fetch.

Extending / Replacing the MemoryCache

Currently the MemoryCache must provide the following methods:

  • get(url, cb) - Returns url's body field via the cb callback/continuation if it exists. Returns null otherwise.
    • cb - Must be of the form function(retval) {...}
  • getHeaders(url, cb) - Returns url's headers field via the cb callback/continuation if it exists. Returns null otherwise.
    • cb - Must be of the form function(retval) {...}
  • set(url, headers, body) - Sets/Saves url's headers and body in the cache.

Setting the verbose/log level

spider.log(level) - Where level is a string that can be any of "debug", "info", "error"

spider's People

Contributors

chmac avatar gtzilla avatar lucasfcosta avatar mikeal avatar twleung avatar vermiculite avatar vieiralucas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spider's Issues

Possible to spider AJAX-loaded content?

Hey mikeal, been playing with the spider script today and loving the progress - just wondered if it's possible (or planned) to be able to spider AJAX-loaded data, or to click certain links on a page that trigger an internal state change (for example a site using sammy.js) and crawl or otherwise screw around with the DOM, after the new content has loaded?

Even if it's a case of click ... wait 5 seconds ... try DOM for changes ... get updated data, or else try again in 5 secs - would be really cool!

Cheers,

Cannot read property 'parsingMode' of null

Hello,

Seems like spider is failing. I'm using the NYTimes example. And with node 0.12.0 give me this error:

/AmazingDirectory/node_modules/spider/node_modules/jsdom/lib/jsdom.js:41
  if (options.parsingMode === undefined || options.parsingMode === 'auto') {
             ^
TypeError: Cannot read property 'parsingMode' of null
    at Object.exports.jsdom (/AmazingDirectory/node_modules/spider/node_modules/jsdom/lib/jsdom.js:41:14)
    at Spider._handler (/AmazingDirectory/node_modules/spider/main.js:175:26)
    at Request._callback (/AmazingDirectory/node_modules/spider/main.js:148:12)
    at Request.self.callback (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:344:22)
    at Request.emit (events.js:110:17)
    at Request.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1239:14)
    at Request.emit (events.js:129:20)
    at IncomingMessage.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1187:12)
    at IncomingMessage.emit (events.js:129:20)
    at _stream_readable.js:908:16

And with node 0.10.36 throw me this:

test1

TypeError: Cannot read property 'parsingMode' of null
    at Object.exports.jsdom (/AmazingDirectory/node_modules/spider/node_modules/jsdom/lib/jsdom.js:41:14)
    at Spider._handler (/AmazingDirectory/node_modules/spider/main.js:175:26)
    at Request._callback (/AmazingDirectory/node_modules/spider/main.js:148:12)
    at Request.self.callback (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:344:22)
    at Request.emit (events.js:98:17)
    at Request.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1239:14)
    at Request.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1187:12)
    at IncomingMessage.emit (events.js:117:20)
    at _stream_readable.js:944:16

process out of memory

I am creating a spider to walk the iTunes App Store.

It hits about 80 or so pages, then I get the following message:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

You can find the spider script, and the console log here:
https://gist.github.com/934787

TypeError: Object [ null ] has no method 'createWindow'

When running memory leak test, i get such error:

TypeError: Object [ null ] has no method 'createWindow'
    at Spider._handler (/var/www/sitemap/spider/main.js:179:27)
    at Request._callback (/var/www/sitemap/spider/main.js:152:12)
    at Request.self.callback (/var/www/sitemap/node_modules/request/request.js:123:22)
    at Request.EventEmitter.emit (events.js:98:17)
    at Request.<anonymous> (/var/www/sitemap/node_modules/request/request.js:1047:14)
    at Request.EventEmitter.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/var/www/sitemap/node_modules/request/request.js:998:12)
    at IncomingMessage.EventEmitter.emit (events.js:117:20)
    at _stream_readable.js:920:16
    at process._tickCallback (node.js:415:13)

passing the url to route callbacks

Often as I spider I create directories based on the url. Having this url handy in the callback means I don't have to do any nasty backflips. I'ved hacked spider as follows:

if (jsdom.defaultDocumentFeatures.ProcessExternalResources) {
$(function () { r.fn.call(r, window, window.$, url ); })
} else {
r.fn.call(r, window, window.$, url );
}

  1. Is there a cleaner way to do this?
  2. Would like a patch?

Cheers,
GF

TypeError: Parameter 'url' must be a string, not undefined

Seems that spider is crashing when finds bad (or empty?) urls. Running memory leak test:

TypeError: Parameter 'url' must be a string, not undefined
    at Url.parse (url.js:107:11)
    at urlParse (url.js:101:5)
    at Url.resolve (url.js:409:29)
    at urlResolve (url.js:405:40)
    at null.<anonymous> (/var/www/sitemap/spider/main.js:186:15)
    at Function.jQuery.extend.each (/var/www/sitemap/spider/jquery.js:641:20)
    at jQuery.fn.jQuery.each (/var/www/sitemap/spider/jquery.js:265:17)
    at window.$.fn.spider (/var/www/sitemap/spider/main.js:183:12)
    at Object.spider.route.route.article.title [as fn] (/var/www/sitemap/go.js:5:10)
    at Spider._handler (/var/www/sitemap/spider/main.js:196:12)

Segmentation Fault

Hey hey,

I tried running the tests for spider, and I consistently get behavior like the following:

josh@pidgey:~/spider$ node tests/test_nytimes.js 
Segmentation fault

Frustratingly, that's all the information I get.

Any ideas?

Documentation is a bit abstract ....

Can you please show a full example of how you use these libraries on a specific site? its not really clear what should be called in which order; a tangible use case would really help me figure out how to use this library.

i.e., I hae a website at foo.com, want to make sure I don't have broken internal links, .... what does that look like in Spider?

Broken on node 0.6.12

[02:03:25] [twl@SE-TEDLEUNG-MB1] ~/node_modules/spider/tests
[5830]> node test_texas.js
The "sys" module is now called "util". It should have a similar interface.

node.js:201
throw e; // process.nextTick error, or 'error' event on first tick
^
TypeError: Object [ null ] has no method 'trigger'
at /Users/twl/node_modules/spider/main.js:60:16
at Spider._handler (/Users/twl/node_modules/spider/main.js:189:5)
at Request._callback (/Users/twl/node_modules/spider/main.js:164:12)
at Request.callback (/Users/twl/node_modules/request/main.js:119:22)
at Request. (/Users/twl/node_modules/request/main.js:525:16)
at Request.emit (events.js:67:17)
at IncomingMessage. (/Users/twl/node_modules/request/main.js:484:14)
at IncomingMessage.emit (events.js:88:20)
at CleartextStream. (http.js:1204:17)
at CleartextStream.emit (events.js:88:20)
at Array.1 (tls.js:792:22)
at EventEmitter._tickCallback (node.js:192:40)
[1] 15007 exit 1 node test_texas.js

Form Filling

Is there a way to use spider to fill forms, for example, on a site that requires a log in?

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because we are using your CI build statuses to figure out when to notify you about breaking changes.

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

If you have already set up a CI for this repository, you might need to check your configuration. Make sure it will run on all new branches. If you don’t want it to run on every branch, you can whitelist branches starting with greenkeeper/.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

Once you have installed CI on this repository, you’ll need to re-trigger Greenkeeper’s initial Pull Request. To do this, please delete the greenkeeper/initial branch in this repository, and then remove and re-add this repository to the Greenkeeper integration’s white list on Github. You'll find this list on your repo or organiszation’s settings page, under Installed GitHub Apps.

you should update the dependencies of cookiejar to 1.3.0

the old version of cookiejar(1.2.0) has a bug in getCookie,which would crash when the target web source has some empty string in cookie.
you should update it to 1.3.0 now.(which test in my own project.1.2.0 crashed,but 1.3.0 works).

the bug due to Line 71/72 of cookiejar.js
pair=parts[i].match(/([^=]+)=((?:.|\n))/);
now change to , which allow "=" is empty.
pair=parts[i].match(/([^=]+)(?:=((?:.|\n)
))?/)

Trouble loading jQuery into Google

heres the code im using to test...

var spider = require('spider');
var s = spider();
s.log('debug');
s.route('www.google.com', '/search*', function (window, $) {

console.log($(window.document.body).text());
});
s.get('http://www.google.com/search?q=carter');

the error seems to occur in the jqueryify function... the first line to get context fails... im not really sure how any of this works but it dies on
Script.runInContext(jquery, ctx, filename);

then the document object is null and it tries to call trigger() which doesnt exist

this is the error:
document.trigger(
^
TypeError: Object [ null ] has no method 'trigger'
at /home/ccole/node/spyder/node_modules/spider/main.js:60:16
at Spider._handler (/home/ccole/node/spyder/node_modules/spider/main.js:189:5)
at [object Object].callback (/home/ccole/node/spyder/node_modules/spider/main.js:164:12)
at [object Object]. (/home/ccole/node/spyder/node_modules/request/main.js:273:21)
at [object Object].emit (events.js:64:17)
at IncomingMessage. (/home/ccole/node/spyder/node_modules/request/main.js:261:54)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Socket.onend (http.js:1265:12)
at Socket._onReadable (net.js:659:26)
at IOWatcher.onReadable as callback

Exception thrown when site is unavailable

When I try to crawl a site that is down, I Spider throws an exception:

TypeError: Cannot read property 'statusCode' of undefined
    at Request._callback (spider/main.js:131:15)
    at spider/node_modules/request/main.js:119:22
    at Request.emit (events.js:87:17)
    at ClientRequest.<anonymous> (spider/node_modules/request/main.js:207:10)
    at ClientRequest.emit (events.js:87:17)
    at Socket.<anonymous> (http.js:1265:11)
    at Socket.emit (events.js:87:17)
    at net.js:619:16
    at EventEmitter._tickCallback (node.js:245:11)

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because we are using your CI build statuses to figure out when to notify you about breaking changes.

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

If you have already set up a CI for this repository, you might need to check your configuration. Make sure it will run on all new branches. If you don’t want it to run on every branch, you can whitelist branches starting with greenkeeper/.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

Once you have installed CI on this repository, you’ll need to re-trigger Greenkeeper’s initial Pull Request. To do this, please delete the greenkeeper/initial branch in this repository, and then remove and re-add this repository to the Greenkeeper integration’s white list on Github. You'll find this list on your repo or organiszation’s settings page, under Installed GitHub Apps.

test fail

Object [ null ] has no method 'trigger'

Let new people help maintaining?

Hi @mikeal,
From what I see of this project, it's a good mix of function and simplicity. Would it be interesting to have other people join and maintain?

I'm not the best programmer, but I'm good at testing and documenting.

TypeError: Object #<error> has no method 'run'

Hello,

I get this error with test_nytimes.js :

TypeError: Object # has no method 'run'
at Spider._handler (C:\Users\me\AppData\Roaming\npm
node_modules\spider\main.js:176:12)
at Request.Spider.get [as _callback](C:UsersmeAp
pDataRoamingnpmnode_modulesspidermain.js:148:12)
at Request.init.self.callback (C:\Users\me\AppData\R
oaming\npm\node_modules\spider\node_modules\request\main.js:120:22)
at Request.EventEmitter.emit (events.js:91:17)
at Request. (C:\me\AppData\Roaming
npm\node_modules\spider\node_modules\request\main.js:555:16)
at Request.EventEmitter.emit (events.js:88:17)
at IncomingMessage.Request.start.self.req.self.httpModule.request.buffer (C:
\Users\me\AppData\Roaming\npm\node_modules\spider\node_m
odules\request\main.js:517:14)
at IncomingMessage.EventEmitter.emit (events.js:115:20)
at IncomingMessage._emitEnd (http.js:366:10)
at HTTPParser.parserOnMessag
eComplete [as onMessageComplete] (http.js:149:23

Regards,

Patrice

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.