mikeal / spider Goto Github PK

Programmable spidering of web sites with node.js and jQuery

JavaScript 100.00%

spider's Introduction

Spider -- Programmable spidering of web sites with node.js and jQuery

Install

From source:

  git clone git://github.com/mikeal/spider.git 
  cd spider
  npm link ../spider

(How to use the) API

Creating a Spider

  var spider = require('spider');
  var s = spider();

spider(options)

The options object can have the following fields:

maxSockets - Integer containing the maximum amount of sockets in the pool. Defaults to 4.
userAgent - The User Agent String to be sent to the remote server along with our request. Defaults to Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7 (firefox userAgent String).
cache - The Cache object to be used as cache. Defaults to NoCache, see code for implementation details for a new Cache object.
pool - A hash object containing the agents for the requests. If omitted the requests will use the global pool which is set to maxSockets.

Adding a Route Handler

spider.route(hosts, pattern, cb)

Where the params are the following :

hosts - A string -- or an array of string -- representing the host part of the targeted URL(s).
pattern - The pattern against which spider tries to match the remaining (pathname + search + hash) of the URL(s).
cb - A function of the form function(window, $) where
- this - Will be a variable referencing the Routes.match return object/value with some other goodies added from spider. For more info see https://github.com/aaronblohowiak/routes.js
- window - Will be a variable referencing the document's window.
- $ - Will be the variable referencing the jQuery Object.

Queuing an URL for spider to fetch.

spider.get(url) where url is the url to fetch.

Extending / Replacing the MemoryCache

Currently the MemoryCache must provide the following methods:

get(url, cb) - Returns url's body field via the cb callback/continuation if it exists. Returns null otherwise.
- cb - Must be of the form function(retval) {...}
getHeaders(url, cb) - Returns url's headers field via the cb callback/continuation if it exists. Returns null otherwise.
- cb - Must be of the form function(retval) {...}
set(url, headers, body) - Sets/Saves url's headers and body in the cache.

Setting the verbose/log level

spider.log(level) - Where level is a string that can be any of "debug", "info", "error"

spider's People

Contributors

Stargazers

Watchers

Forkers

victusfate glesperance hackervera chakri99 gmarcus francois2metz evandotpro gabrielkast twleung anandhak jaehess lahorichargha rgarcia zakkzhang blueblueblue gotomypc no9 psindrome willcode2surf joshmcarthur dolfly jschementi razorinc leadsplus wegorich no2key imclab jieqiuming rakhoush chofoo kevinrobinson yang5288 creativelyme chshouyu honestqiao lewisou macanhhuy marcoippolito bicccio tongwoojun dutchakdev jessetaylor84 qqqzhch zhings kinghe axure kinderli conceptblend chmac yushibujue pseudocorps supersunsun luvas2010 laststand99 coocon fiestabonita zhangweiabc kalixxxx jaredmansaakintola hasantayyar sacpop vieiralucas blockchainb duoani flexpad helloyu jetmuffin naivesoft chengwu628 sagacite2 jindw liulaisheng jwdunne guozhaokui cephredis wl790868 yunjie123 polokobe seanowenhayes liudanning ppdigger xin- atefbb stevekeol goblinintree jsontang0515 juleeyao www3838438 redux-cn nunodotferreira lbazz wiccasoft joshone hhy5277 sts0mrg0 masterbtc 5l1v3r1 helcaraxeals azmikaifi22 qiuguomuhai

spider's Issues

Possible to spider AJAX-loaded content?

Hey mikeal, been playing with the spider script today and loving the progress - just wondered if it's possible (or planned) to be able to spider AJAX-loaded data, or to click certain links on a page that trigger an internal state change (for example a site using sammy.js) and crawl or otherwise screw around with the DOM, after the new content has loaded?

Even if it's a case of click ... wait 5 seconds ... try DOM for changes ... get updated data, or else try again in 5 secs - would be really cool!

Cheers,

Cannot read property 'parsingMode' of null

Hello,

Seems like spider is failing. I'm using the NYTimes example. And with node 0.12.0 give me this error:

/AmazingDirectory/node_modules/spider/node_modules/jsdom/lib/jsdom.js:41
  if (options.parsingMode === undefined || options.parsingMode === 'auto') {
             ^
TypeError: Cannot read property 'parsingMode' of null
    at Object.exports.jsdom (/AmazingDirectory/node_modules/spider/node_modules/jsdom/lib/jsdom.js:41:14)
    at Spider._handler (/AmazingDirectory/node_modules/spider/main.js:175:26)
    at Request._callback (/AmazingDirectory/node_modules/spider/main.js:148:12)
    at Request.self.callback (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:344:22)
    at Request.emit (events.js:110:17)
    at Request.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1239:14)
    at Request.emit (events.js:129:20)
    at IncomingMessage.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1187:12)
    at IncomingMessage.emit (events.js:129:20)
    at _stream_readable.js:908:16

And with node 0.10.36 throw me this:

test1

TypeError: Cannot read property 'parsingMode' of null
    at Object.exports.jsdom (/AmazingDirectory/node_modules/spider/node_modules/jsdom/lib/jsdom.js:41:14)
    at Spider._handler (/AmazingDirectory/node_modules/spider/main.js:175:26)
    at Request._callback (/AmazingDirectory/node_modules/spider/main.js:148:12)
    at Request.self.callback (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:344:22)
    at Request.emit (events.js:98:17)
    at Request.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1239:14)
    at Request.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/AmazingDirectory/node_modules/spider/node_modules/request/request.js:1187:12)
    at IncomingMessage.emit (events.js:117:20)
    at _stream_readable.js:944:16

process out of memory

I am creating a spider to walk the iTunes App Store.

It hits about 80 or so pages, then I get the following message:
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

You can find the spider script, and the console log here:
https://gist.github.com/934787

TypeError: Object [ null ] has no method 'createWindow'

When running memory leak test, i get such error:

TypeError: Object [ null ] has no method 'createWindow'
    at Spider._handler (/var/www/sitemap/spider/main.js:179:27)
    at Request._callback (/var/www/sitemap/spider/main.js:152:12)
    at Request.self.callback (/var/www/sitemap/node_modules/request/request.js:123:22)
    at Request.EventEmitter.emit (events.js:98:17)
    at Request.<anonymous> (/var/www/sitemap/node_modules/request/request.js:1047:14)
    at Request.EventEmitter.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/var/www/sitemap/node_modules/request/request.js:998:12)
    at IncomingMessage.EventEmitter.emit (events.js:117:20)
    at _stream_readable.js:920:16
    at process._tickCallback (node.js:415:13)

passing the url to route callbacks

Often as I spider I create directories based on the url. Having this url handy in the callback means I don't have to do any nasty backflips. I'ved hacked spider as follows:

if (jsdom.defaultDocumentFeatures.ProcessExternalResources) {
$(function () { r.fn.call(r, window, window.$, url ); })
} else {
r.fn.call(r, window, window.$, url );
}

Is there a cleaner way to do this?
Would like a patch?

Cheers,
GF

TypeError: Parameter 'url' must be a string, not undefined

Seems that spider is crashing when finds bad (or empty?) urls. Running memory leak test:

TypeError: Parameter 'url' must be a string, not undefined
    at Url.parse (url.js:107:11)
    at urlParse (url.js:101:5)
    at Url.resolve (url.js:409:29)
    at urlResolve (url.js:405:40)
    at null.<anonymous> (/var/www/sitemap/spider/main.js:186:15)
    at Function.jQuery.extend.each (/var/www/sitemap/spider/jquery.js:641:20)
    at jQuery.fn.jQuery.each (/var/www/sitemap/spider/jquery.js:265:17)
    at window.$.fn.spider (/var/www/sitemap/spider/main.js:183:12)
    at Object.spider.route.route.article.title [as fn] (/var/www/sitemap/go.js:5:10)
    at Spider._handler (/var/www/sitemap/spider/main.js:196:12)

Segmentation Fault

Hey hey,

I tried running the tests for spider, and I consistently get behavior like the following:

josh@pidgey:~/spider$ node tests/test_nytimes.js 
Segmentation fault

Frustratingly, that's all the information I get.

Any ideas?

Documentation is a bit abstract ....

Can you please show a full example of how you use these libraries on a specific site? its not really clear what should be called in which order; a tangible use case would really help me figure out how to use this library.

i.e., I hae a website at foo.com, want to make sure I don't have broken internal links, .... what does that look like in Spider?

currentUrl

spider.currentUrl is never set

Broken on node 0.6.12

[02:03:25] [twl@SE-TEDLEUNG-MB1] ~/node_modules/spider/tests
[5830]> node test_texas.js
The "sys" module is now called "util". It should have a similar interface.

node.js:201
throw e; // process.nextTick error, or 'error' event on first tick
^
TypeError: Object [ null ] has no method 'trigger'
at /Users/twl/node_modules/spider/main.js:60:16
at Spider._handler (/Users/twl/node_modules/spider/main.js:189:5)
at Request._callback (/Users/twl/node_modules/spider/main.js:164:12)
at Request.callback (/Users/twl/node_modules/request/main.js:119:22)
at Request. (/Users/twl/node_modules/request/main.js:525:16)
at Request.emit (events.js:67:17)
at IncomingMessage. (/Users/twl/node_modules/request/main.js:484:14)
at IncomingMessage.emit (events.js:88:20)
at CleartextStream. (http.js:1204:17)
at CleartextStream.emit (events.js:88:20)
at Array.1 (tls.js:792:22)
at EventEmitter._tickCallback (node.js:192:40)
[1] 15007 exit 1 node test_texas.js

Form Filling

Is there a way to use spider to fill forms, for example, on a site that requires a log in?

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because we are using your CI build statuses to figure out when to notify you about breaking changes.

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

If you have already set up a CI for this repository, you might need to check your configuration. Make sure it will run on all new branches. If you don’t want it to run on every branch, you can whitelist branches starting with greenkeeper/.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

Once you have installed CI on this repository, you’ll need to re-trigger Greenkeeper’s initial Pull Request. To do this, please delete the greenkeeper/initial branch in this repository, and then remove and re-add this repository to the Greenkeeper integration’s white list on Github. You'll find this list on your repo or organiszation’s settings page, under Installed GitHub Apps.

you should update the dependencies of cookiejar to 1.3.0

the old version of cookiejar(1.2.0) has a bug in getCookie,which would crash when the target web source has some empty string in cookie.
you should update it to 1.3.0 now.(which test in my own project.1.2.0 crashed,but 1.3.0 works).

the bug due to Line 71/72 of cookiejar.js
pair=parts[i].match(/([^=]+)=((?:.|\n))/);
now change to , which allow "=" is empty.
pair=parts[i].match(/([^=]+)(?:=((?:.|\n)))?/)

Trouble loading jQuery into Google

heres the code im using to test...

var spider = require('spider');
var s = spider();
s.log('debug');
s.route('www.google.com', '/search*', function (window, $) {

console.log($(window.document.body).text());
});
s.get('http://www.google.com/search?q=carter');

the error seems to occur in the jqueryify function... the first line to get context fails... im not really sure how any of this works but it dies on
Script.runInContext(jquery, ctx, filename);

then the document object is null and it tries to call trigger() which doesnt exist

this is the error:
document.trigger(
^
TypeError: Object [ null ] has no method 'trigger'
at /home/ccole/node/spyder/node_modules/spider/main.js:60:16
at Spider._handler (/home/ccole/node/spyder/node_modules/spider/main.js:189:5)
at [object Object].callback (/home/ccole/node/spyder/node_modules/spider/main.js:164:12)
at [object Object]. (/home/ccole/node/spyder/node_modules/request/main.js:273:21)
at [object Object].emit (events.js:64:17)
at IncomingMessage. (/home/ccole/node/spyder/node_modules/request/main.js:261:54)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Socket.onend (http.js:1265:12)
at Socket._onReadable (net.js:659:26)
at IOWatcher.onReadable as callback

Exception thrown when site is unavailable

When I try to crawl a site that is down, I Spider throws an exception:

TypeError: Cannot read property 'statusCode' of undefined
    at Request._callback (spider/main.js:131:15)
    at spider/node_modules/request/main.js:119:22
    at Request.emit (events.js:87:17)
    at ClientRequest.<anonymous> (spider/node_modules/request/main.js:207:10)
    at ClientRequest.emit (events.js:87:17)
    at Socket.<anonymous> (http.js:1265:11)
    at Socket.emit (events.js:87:17)
    at net.js:619:16
    at EventEmitter._tickCallback (node.js:245:11)

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

test fail

Object [ null ] has no method 'trigger'

Let new people help maintaining?

Hi @mikeal,
From what I see of this project, it's a good mix of function and simplicity. Would it be interesting to have other people join and maintain?

I'm not the best programmer, but I'm good at testing and documenting.

TypeError: Object #<error> has no method 'run'

Hello,

I get this error with test_nytimes.js :

TypeError: Object # has no method 'run'
at Spider._handler (C:\Users\me\AppData\Roaming\npm
node_modules\spider\main.js:176:12)
at Request.Spider.get [as _callback](C:UsersmeAp
pDataRoamingnpmnode_modulesspidermain.js:148:12)
at Request.init.self.callback (C:\Users\me\AppData\R
oaming\npm\node_modules\spider\node_modules\request\main.js:120:22)
at Request.EventEmitter.emit (events.js:91:17)
at Request. (C:\me\AppData\Roaming
npm\node_modules\spider\node_modules\request\main.js:555:16)
at Request.EventEmitter.emit (events.js:88:17)
at IncomingMessage.Request.start.self.req.self.httpModule.request.buffer (C:
\Users\me\AppData\Roaming\npm\node_modules\spider\node_m
odules\request\main.js:517:14)
at IncomingMessage.EventEmitter.emit (events.js:115:20)
at IncomingMessage._emitEnd (http.js:366:10)
at HTTPParser.parserOnMessag
eComplete [as onMessageComplete] (http.js:149:23

Regards,

Patrice

mikeal / spider Goto Github PK

spider's Introduction

Spider -- Programmable spidering of web sites with node.js and jQuery

Install

(How to use the) API

Creating a Spider

spider(options)

Adding a Route Handler

spider.route(hosts, pattern, cb)

Queuing an URL for spider to fetch.

Extending / Replacing the MemoryCache

Setting the verbose/log level

spider's People

Contributors

Stargazers

Watchers

Forkers

spider's Issues

Recommend Projects

Recommend Topics

Recommend Org