Git Product home page Git Product logo

node-crawler's People

Contributors

authuir avatar beijingprotohuman avatar bkw avatar chloe899 avatar connorweng avatar cuixiping avatar darrenqc avatar digitalfrost avatar dong-gao avatar gardsted avatar jaredmansaakintola avatar jhurliman avatar kossidts avatar lahaxearnaud avatar mike442144 avatar namuol avatar nolandg avatar ozlevka avatar patrickzxk avatar paulvalla avatar petskratt avatar pzmarzly avatar racheet avatar rauno56 avatar suhaibmujahid avatar sunnyhuang2008 avatar swosko avatar sylvinus avatar thomas-hilaire avatar tzbee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-crawler's Issues

undefined:undefined<script> failed

No matter url I try to visit I get this error, is JSDOM going wrong?

[
  {
    "type": "error",
    "message": "Running http://www.longshanks-consulting.com/:undefined:undefined<script> failed.",
    "data": {
      "error": {},
      "filename": "http://www.longshanks-consulting.com/:undefined:undefined<script>"
    }
  }
]

My code : https://gist.github.com/haveaguess/6833379

Problems installing crawler: what do I have to do?

Hi Sylvain,
I'm trying to install your crawler in my Ubuntu T1 instance
node --version: v0.10.12 & npm --version: 1.2.32

but it says:
npm install crawler
npm ERR! Failed to parse json
npm ERR! Unexpected token g
npm ERR! File: /home/ubuntu/package.json
npm ERR! Failed to parse package.json data.
npm ERR! package.json must be actual JSON, not just JavaScript.
npm ERR!
npm ERR! This is not a bug in npm.
npm ERR! Tell the package author to fix their package.json file. JSON.parse

npm ERR! System Linux 3.2.0-40-virtual
npm ERR! command "/home/ubuntu/.nvm/v0.10.12/bin/node" "/home/ubuntu/.nvm/v0.10.12/bin/npm" "install" "crawler"
npm ERR! cwd /home/ubuntu
npm ERR! node -v v0.10.12
npm ERR! npm -v 1.2.32
npm ERR! file /home/ubuntu/package.json
npm ERR! code EJSONPARSE
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/ubuntu/npm-debug.log
npm ERR! not ok code 0

Opening the npm-debug.log up:
0 info it worked if it ends with ok
1 verbose cli [ '/home/ubuntu/.nvm/v0.10.12/bin/node',
1 verbose cli '/home/ubuntu/.nvm/v0.10.12/bin/npm',
1 verbose cli 'install',
1 verbose cli 'crawler' ]
2 info using [email protected]
3 info using [email protected]
4 verbose read json /home/ubuntu/package.json
5 error Failed to parse json
5 error Unexpected token g
6 error File: /home/ubuntu/package.json
7 error Failed to parse package.json data.
7 error package.json must be actual JSON, not just JavaScript.
7 error
7 error This is not a bug in npm.
7 error Tell the package author to fix their package.json file. JSON.parse
8 error System Linux 3.2.0-40-virtual
9 error command "/home/ubuntu/.nvm/v0.10.12/bin/node" "/home/ubuntu/.nvm/v0.10.12/bin/npm" "install" "
crawler"
10 error cwd /home/ubuntu
11 error node -v v0.10.12
12 error npm -v 1.2.32
13 error file /home/ubuntu/package.json
14 error code EJSONPARSE
15 verbose exit [ 1, true ]

What do I have to do?

Thank you in advance for your kind help.
Kind regards
Marco

error when test simple.js

Hi,

I just install nodejs (v0.4.9) and your module and I got :

http://jamendo.com/
http://tedxparis.com
SyntaxError: Unexpected token ILLEGAL
at Object.javascript (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onData as ondata
at Client._onReadable (net.js:683:27)
TypeError: Cannot read property 'prototype' of undefined
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/browser/index.js:84:16
at String. ([object Context]:19:13786)
at Function.each ([object Context]:12:7964)
at Object.add ([object Context]:19:13498)
at [object Context]:19:19008
at Function.each ([object Context]:12:7964)
at Object.each ([object Context]:12:1155)
at Object.one ([object Context]:19:18984)
at Object.bind ([object Context]:19:18793)
at [object Context]:19:21559
at [object Context]:19:39657
at Object.javascript (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onEnd as onend
at Client._onReadable (net.js:659:26)
at IOWatcher.onReadable as callback
TypeError: Cannot read property 'prototype' of undefined
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/browser/index.js:84:16
at String. ([object Context]:19:13786)
at Function.each ([object Context]:12:7964)
at Object.add ([object Context]:19:13498)
at B ([object Context]:19:21262)
at Object.ready ([object Context]:19:19655)
at [object Context]:1:18
at Object.javascript (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:239:23)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onEnd as onend
at Client._onReadable (net.js:659:26)
at IOWatcher.onReadable as callback

Could this work with cheerio

Avoiding JSDOM to gain better performance, can cheerio be plugged in instead? It's suppose to be (partially) compatible with JQuery.

Npm install crawler fails

Hello,

I'm trying to install node-crawler using node version 0.4.12 and npm version 1.0.22

Node-crawler requires a 0.4.x version of node, but it seems that a dependency, qunit, requires a node version of >= 0.5

Is it even possible to install this?

npm ERR! error installing [email protected] Error: Unsupported
npm ERR! error installing [email protected]     at checkEngine (/usr/local/lib/node_modules/npm/lib/install.js:570:14)
npm ERR! error installing [email protected]     at Array.0 (/usr/local/lib/node_modules/npm/node_modules/slide/lib/bind-actor.js:15:8)
npm ERR! error installing [email protected]     at LOOP (/usr/local/lib/node_modules/npm/node_modules/slide/lib/chain.js:15:13)
npm ERR! error installing [email protected]     at chain (/usr/local/lib/node_modules/npm/node_modules/slide/lib/chain.js:20:4)
npm ERR! error installing [email protected]     at installOne_ (/usr/local/lib/node_modules/npm/lib/install.js:548:3)
npm ERR! error installing [email protected]     at installOne (/usr/local/lib/node_modules/npm/lib/install.js:488:3)
npm ERR! error installing [email protected]     at /usr/local/lib/node_modules/npm/lib/install.js:425:9
npm ERR! error installing [email protected]     at /usr/local/lib/node_modules/npm/node_modules/slide/lib/async-map.js:54:35
npm ERR! error installing [email protected]     at Array.forEach (native)
npm ERR! error installing [email protected]     at /usr/local/lib/node_modules/npm/node_modules/slide/lib/async-map.js:54:11
npm ERR! Unsupported
npm ERR! Not compatible with your version of node/npm: [email protected]
npm ERR! Required: {"node":">= 0.5.0"}
npm ERR! Actual:   {"npm":"1.0.22","node":"v0.4.12"}
npm ERR! 
npm ERR! System Darwin 11.1.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "crawler"
npm ERR! cwd /Users/thomasbrus/Code/wordcount
npm ERR! node -v v0.4.12
npm ERR! npm -v 1.0.22
The following files couldn't be removed.
Remove them manually and try again

sudo rm -rf "/Users/thomasbrus/node_modules/crawler"

npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /Users/thomasbrus/Code/wordcount/npm-debug.log
npm not ok

Error: not found: make

I've got this error while trying to install using

npm install git://github.com/daraosn/node-crawler.git --save

Any suggestions?


spawn python [ '/Users/cardosor/.node-gyp/0.6.15/tools/gyp_addon',
'binding.gyp',
'-I/Users/cardosor/Sites/devel/nodejs/photod/node_modules/crawler/node_modules/jsdom/node_modules/contextify/build/config.gypi',
'-f',
'make' ]
ERR! Error: not found: make
at F (/usr/local/lib/node_modules/npm/node_modules/which/which.js:43:28)
at E (/usr/local/lib/node_modules/npm/node_modules/which/which.js:46:29)
at Object.oncomplete (/usr/local/lib/node_modules/npm/node_modules/which/which.js:57:16)
ERR! not ok
npm WARN optional dependency failed, continuing [email protected]

SyntaxError: Unexpected token ILLEGAL

NodeJS v0.4.9

var Crawler = require("crawler").Crawler;
var c = new Crawler({
    "maxConnections":10,
    "callback":function(error,result,$) {

    }
});
c.queue(["http://www.jamendo.com/"]);

http://www.jamendo.com/
SyntaxError: Unexpected token ILLEGAL
at Object.javascript (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at /node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onData as ondata
at Client._onReadable (net.js:683:27)

jQuery object undefined on callback

Ran this simple code:

var Crawler = require("crawler").Crawler;

var c = new Crawler({
    "maxConnections":10,
    "callback":function(error,result,$) {
        console.log(result.body);
        console.log($('body').text());
    }
});
c.queue("http://google.com");

result.body would show the html but $('body').text() gave me error:

TypeError: undefined is not a function
    at Object.Crawler.callback (C:\_besteat\samples\crawler\app.js:17:33)
    at exports.Crawler.self.onContent.jsdom.env.done (C:\_besteat\samples\crawle
r\node_modules\crawler\lib\crawler.js:255:37)
    at exports.env.exports.jsdom.env.scriptComplete (C:\_besteat\samples\crawler
\node_modules\crawler\node_modules\jsdom\lib\jsdom.js:205:39)
    at process.startup.processNextTick.process._tickCallback (node.js:244:9)

I was tracing back the code to this

                    fs.readFile(toQueue.jQueryUrl.replace(/^file\:\/\//,""),"utf-8",function(err,jq) {
                        if (err) {
                            toQueue.callback(err);
                            release(toQueue);
                        } else {
                            try {
                                jsd([jq]);
                            } catch (e) {
                                toQueue.callback(e);
                                release(toQueue);
                            }
                        }
                    });

Look like toQueue.callback(e); was called because jsd([jq]) failed.

That led me to jsdom. And somewhere between this

window     = exports.html(html, null, options).createWindow()

and this

features   = JSON.parse(JSON.stringify(window.document.implementation._features))

window.document.implementation doesn't exist. I have not gone further to find out why. It was confusing.

Any help appreciate.

TypeError: Cannot read property 'firstChild' of null

I'm creating crawler as:

var c = new Crawler({
'maxConnections':1,
'callback': function(error, result, $){
$('td.title + a').each(function(a){
var tmp = new blogitem(a.val, a.href);
news.push(tmp);
});
uploadPosts(news);
}
});

setInterval(function(){
c.queue([{
'uri': 'http://127.0.0.1/posts',
'timeout': 120
}]);
}, 10000);

When I run the server.js, it gives TypeError: Cannot read property 'firstChild' of null. I've tried just to put sys.puts(response) in the callback, but the result is the same...

Crawler getting really slow after ~ 1000 crawled pages

Hi.
First i like to thank you for creating this amazing simple crawler :) It`s amazingly usefull for any kind of research on the web.
While using youre module in quite a big scale i spotted a extrem slowdown after ~ 1000 crawled pages.
Is there any kind of cache which i can shut off or is there any other way to prevent this speedreduction?

Really hope there is a way to speed up node-crawler
yours sincerely
Daniel

jQuery undefined in response

Maybe it's just me but I'm getting an undefined jQuery response when running test/simple.js

$("a").each(function(i,a) {
^
TypeError: undefined is not a function

Passing data to Crawler callback when queuing URL

Alot of times passing local data associated with the url being queued to the callback is required. This for me is useful in building a relationship tree across gathered scraps or micro objects thrown in couchdb with nano, which I then combine into bigger objects later. Passing jquery gathered data while queing the URL lets me store attributes such as the kind of link or surrounding taxonomy hints which in some cases cannot be gathered locally from the pages markup depending on the flow of the website being crawled, if such information cannot be extracted from the page it self. I suggest having data passed to the callback while queuing it. Currently I append my data as a GET query string to the URL being queued and then parse request.uri back in the call back to get the passed data, even though this gets the job done, it is intrusive, the server admins can see what we are cooking, and overall not elegant.

ReferenceError: e is not defined in toQueue.callback(e)

I copied the sample code and added below to point to jQuery file

"jQueryUrl":'/vendor/jquery-1.8.3.min.js',

Then ran it but got this error:

C:\node_modules\crawler\node_modules\crawler\lib\crawler.js:275 toQueue.callback(e); ^ ReferenceError: e is not defined at exports.Crawler.self.onContent (C:\node_modules\crawler\node_modules\craw ler\lib\crawler.js:275:46) at fs.js:117:20 at Object.oncomplete (fs.js:297:15)

Obviously it's something about reading the jQuery file. What is the right way to fix this?

Unable to access double gzipped file

This code:

var crawler = require('crawler').Crawler;

var c =  new crawler({
        "maxConnections": 10,
        "forceUTF8": true,
        "debug": true,
        "callback": function(err,result,$) {
                console.log(result.body);
        }
});

c.queue("http://marketwatch.com/sitemap-search-index.xml.gz");

Produces this text:

I%&/m{JJ`$@iG#)*eVe]f@흼{{;N'?\fdlJɞ!?~|?"ǿ|{[e^7EGiVbyG|qS"[Y.7wMG]={j|uo\wvvv_<=Ųi4H
p~xwg#y.m^et>V
}ީGw~6zv?}tg{7~G?+GvY{{Dy7~G{7{Odtӟn
vwwo.Y^mRgam5ˮoCB_|m*ѺjᾉwaclHlhA0oVA{'+m,=
;hcf7vV"Ն

The problem is that the result.body produces a string, this string no longer has the information needed to be passed to gunzip.

More specifically this is what I am attempting to do:

var crawler = require('crawler').Crawler,
                zlib = require('zlib');

var c =  new crawler({
        "maxConnections": 1,
        "debug": true,
        "callback": function(err,result,$) {
                zlib.unzip(new Buffer(result.body), function(err, buffer) {
                        if(err) throw err;
                        console.log(buffer.toString());
                });
        }
});

c.queue("http://marketwatch.com/sitemap-search-index.xml.gz");

But it produces this result:

GET http://marketwatch.com/sitemap-search-index.xml.gz ...
Got http://marketwatch.com/sitemap-search-index.xml.gz (634 bytes)...

/home/sparq/ml/test_crawler.js:9
                        if(err) throw err;
                                      ^
Error: incorrect header check
    at Zlib._binding.onerror (zlib.js:295:17)

encoding: null results in undefined body

I am using this simple example and found a problem:

var Crawler = require("crawler").Crawler;

var c = new Crawler({
    "maxConnections":10,
    "callback":function(error,result,$) {
        console.log(result.body);
        //console.log($('body').text());
    }
});
c.queue("http://google.com");

In this particular line:

var req = request(_.pick.apply(this,[ropts].concat(requestArgs)), function(error,response,body) {

The input JSON was as following:

{ uri: 'http://google.com',
  method: 'GET',
  headers:
   { 'Accept-Encoding': 'gzip',
     'User-Agent': 'node-crawler/0.2.3' },
  encoding: null,
  timeout: 60000 }

As you see, I have to delete "encoding: null" in order for body variable to work.

Why is it "null" and how to fix this for websites like google.com?

The platform I am working on is Windows, btw (if that makes a different).

Option request.jar is broken

var r = require('crawler/node_modules/request');
var Crawler = require('crawler').Crawler;

var crawler = new Crawler({jar: r.jar()});
crawler.queue({
    uri: 'http://github.com',
    callback: test
}, {
    uri: 'http://gitcities.com',
    callback: test
});

function test(err, res) {
    console.log(res.headers);
}

throws

Object #<Object> has no method 'get'
TypeError: Object #<Object> has no method 'get'
    at Request.jar (/Users/anatoliy/projects/opensource/node-crawler/node_modules/request/index.js:1139:19)
    at Request.init (/Users/anatoliy/projects/opensource/node-crawler/node_modules/request/index.js:227:8)
    at new Request
(skipped rest of stacktrace)

I will send pull request as we need it published asap. Thanks!

Proxy support with authentication

I saw that there was a commit for proxy support that isn't yet in NPM. Is there anyway to provide authentication for the proxies?

Thanks!

Cannot set property 'body' of undefined

node test/simple.js

http://jamendo.com/
http://tedxparis.com

/home/crawler/node_modules/crawler/lib/crawler.js:74
response.body = body;
^
TypeError: Cannot set property 'body' of undefined
at Object.callback (/home/crawler/node_modules/crawler/lib/crawler.js:74:39)
at [object Object].callback (/home/crawler/node_modules/crawler/lib/crawler.js:70:43)
at [object Object]. (/home/crawler/node_modules/crawler/node_modules/request/main.js:151:67)
at [object Object].emit (events.js:64:17)
at Object._onTimeout (/home/crawler/node_modules/crawler/node_modules/request/main.js:300:19)
at Timer.callback (timers.js:83:39)

Benchmarks

How fast is it? Do any benchmarks exist?

No access to jQuery

Running the tests, I'm able to see the full html of the response in result.content, but any attempts to access via jQuery return nothing (no errors, etc.) if (jQuery) console.log('loaded') is not triggered either.

Here's the code with several diff variations on my attempts to access via jQuery-

var Crawler = require("../lib/crawler").Crawler,
c = new Crawler({
"maxConnections":10,
"timeout":120_1000, // seconds
"debug":false,
"callback":function(error,result,$) {
console.log("Got page");
//console.log(result.content);
//if (jQuery) console.log('loaded') ;
//response = $('

' + result.content + '
');
console.log($('title', jQuery(result.content)).text());
//console.log(result.content);
console.log($('meta[name=title]'), response);
/_
$("a").each(function(i,a) {
console.log(this.href);
console.log('got a link');
//c.queue(a.href);
})*/
}
});

c.queue(["http://urbanthesaur.us"]);

undefined is not a function

D:\node\polgastro\node_modules\crawler\lib\crawler.js:296
                          if (callbackError) throw callbackError;
                                                   ^
TypeError: undefined is not a function
    at Object.Crawler.callback (D:\node\polgastro\crawlr.js:10:3)
    at jsdom.env.done (D:\node\polgastro\node_modules\crawler\lib\crawler.js:272:41)
    at D:\node\polgastro\node_modules\crawler\node_modules\jsdom\lib\jsdom.js:252:9
    at process._tickCallback (node.js:415:13)
    at Function.Module.runMain (module.js:499:11)
    at startup (node.js:119:16)
    at node.js:901:3

Assertion failed: req->work_cb, file src\win\threadpool.c, line 46

not speak English

question 1
Crawler for test
npm install && npm test

Assertion failed: req->work_cb, file src\win\threadpool.c, line 46

How to solve the error

os: win7 (32)
nodejs:0.8.9
npm 1.1.61

question 2
npm test Success????

Global summary:
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳
━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Files ┃ Tests ┃ Assertions ┃ Failed ┃ Passed ┃ Runtime

┣━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━╋
━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━┫
┃ 1 ┃ 12 ┃ 31 ┃ 0 ┃ 31 ┃ 9001

┗━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━┻
━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━┛

Add support for crawling HTML fragments

If the crawler encounters a page such as http://hypem.com/blog/a/1?ax=1 that contains an HTML fragment and not a complete document, the crawler will crash when trying to appendChild() a jQuery script element on the undefined window.document.body object. One option might be to add support for the HTML5 parser library which should be able to handle fragments: https://github.com/aredridel/html5

A short repro:

var Crawler = require('node-crawler').Crawler;
var crawler = new Crawler({
  maxConnections: 1,
  callback: function(err, res, $) {
    console.log('worked!'); // The app will crash before this point
  }
});

crawler.queue(['http://hypem.com/blog/a/1?ax=1']);

Cannot read property 'timeout' of undefined

Full error:

/home/overconnected/OverConnected/node_modules/crawler/lib/crawler.js:296
                          if (callbackError) throw callbackError;
                                                   ^
TypeError: Cannot read property 'timeout' of undefined
    at /home/overconnected/OverConnected/node_modules/crawler/node_modules/underscore/underscore.js:803:18
    at Array.forEach (native)
    at _.each._.forEach (/home/overconnected/OverConnected/node_modules/crawler/node_modules/underscore/underscore.js:78:11)
    at Function._.defaults (/home/overconnected/OverConnected/node_modules/crawler/node_modules/underscore/underscore.js:800:5)
    at self.queue (/home/overconnected/OverConnected/node_modules/crawler/lib/crawler.js:356:11)
    at Object.<anonymous> (/home/overconnected/OverConnected/server.js:51:18)
    at Function.v.extend.each (http://www.pmlive.com/pharma_appointments/sarah_verhoeff_joins_the_board_at_pan_506016:undefined:undefined<script>:2:14543)
    at v.fn.v.each (http://www.pmlive.com/pharma_appointments/sarah_verhoeff_joins_the_board_at_pan_506016:undefined:undefined<script>:2:11217)
    at String.<anonymous> (/home/overconnected/OverConnected/server.js:25:20)
    at Function.v.extend.each (http://www.pmlive.com/pharma_appointments/sarah_verhoeff_joins_the_board_at_pan_506016:undefined:undefined<script>:2:14543)

TypeError: Cannot set property 'body' of undefined

$ node test/simple.js
http://jamendo.com/
http://tedxparis.com

/[...]/lib/crawler.js:74
response.body = body;
^
TypeError: Cannot set property 'body' of undefined
at Object.callback (/[...]/crawler/lib/crawler.js:74:39)
at Request.callback (/[...]/crawler/lib/crawler.js:70:43)
at Request. (/[...]/request/main.js:154:67)
at Request.emit (events.js:64:17)
at Object._onTimeout (/[...]/crawler/node_modules/request/main.js:320:19)
at Timer.callback (timers.js:83:39)

Detected as scripted

I always detected as spam, or scripted. And must enter a captcha to continue. How to solve that?

Install as a dependency fails

Hello,

I'm trying to include node-crawler in a project as a dependency.
I include it like that in my package.json and then run "npm install -d"

"dependencies": {
                "crawler": "0.0.2"
}

It is however blocking on qunit install, as the logs mention it:

npm info installOne [email protected]
npm ERR! error installing [email protected] Error: Unsupported
npm ERR! error installing [email protected] at checkEngine (/usr/local/lib/node_modules/npm/lib/install.js:561:14)
npm ERR! error installing [email protected] at nextStep (/usr/local/lib/node_modules/npm/lib/utils/chain.js:54:8)
npm ERR! error installing [email protected] at chain (/usr/local/lib/node_modules/npm/lib/utils/chain.js:27:3)
npm ERR! error installing [email protected] at installOne_ (/usr/local/lib/node_modules/npm/lib/install.js:539:3)
npm ERR! error installing [email protected] at installOne (/usr/local/lib/node_modules/npm/lib/install.js:479:3)
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/install.js:421:9
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/utils/async-map.js:57:35
npm ERR! error installing [email protected] at Array.forEach (native)
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/utils/async-map.js:57:11
npm ERR! error installing [email protected] at Array.forEach (native)
...
npm info rm fail ENOTEMPTY, Directory not empty '/path/to/node_modules/crawler'
npm ERR! Unsupported
npm ERR! Not compatible with your version of node/npm: [email protected]
npm ERR! Required: {"node":">= 0.5.0"}
npm ERR! Actual: {"npm":"1.0.15","node":"v0.4.12"}
npm ERR!
npm ERR! System Darwin 11.1.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "-d"

fatal when response error

When error response is undefined and get fatal because trying to assing to it 'uri'.

It could be fixed like this :

                request(q, function(error,response,body) {
                    if (response) {
                        response.uri = q.uri;
                    }
                    onContent(error,response,body,false);
                });

So in error case, it will be possible to know uri which cause error

Empty response

If remote side return empty response

Got http://some-internet-site.com/ (0 bytes)...

there is an exception:
/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:369
throw new Error("jsdom.env requires a '" + req + "' argument");
^
Error: jsdom.env requires a 'html' argument
at /home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:369:13
at Array.forEach (native)
at Function.processArguments (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:366:12)
at Object.env (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:179:29)
at /home/alexio/TRAVEL/graph/node_modules/crawler/lib/crawler.js:119:35
at [object Object].callback (/home/alexio/TRAVEL/graph/node_modules/crawler/lib/crawler.js:179:25)
at [object Object]. (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/request/main.js:294:21)
at [object Object].emit (events.js:64:17)
at IncomingMessage. (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/request/main.js:281:54)
at IncomingMessage.emit (events.js:81:20)

Upgrade version of Request

The version of Request used by Crawler is a bit old and includes the bug reported here: request/request#417 (which in turn might be masking other issues in Crawler or the code calling Crawler). I manually updated the Request module used by Crawler to the latest (2.25.0) and, anecdotally speaking, it seems to work fine.

Add local file to queue

//Local file was given, skip request
if (toQueue.file) {
fs.readFile( toQueue.file, function (err, data) {
if (err) {
console.error("file reading error: ",err);
return;
};
self.onContent(null,toQueue,{
body:data
},false);
});
return;
}

keep available .request from module 'request' in response

Actually response.request = toQueue

It scratch .request of module 'request'.

So effective url no more available and others useful thinks.

I propose .deepRequest :

line 112
...
} else {

                    response.content = body;
                    response.deepRequest = response.request; // added
                    response.request = toQueue;

                    if (toQueue.jQuery && toQueue.method!="HEAD") {

...

NPM install error

node-gyp rebuild

CC(target) Release/obj.target/iconv/deps/libiconv/libcharset/lib/localcharset.o
CC(target) Release/obj.target/iconv/deps/libiconv/lib/iconv.o
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:133:
../deps/libiconv/lib/utf7.h:162:13: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+base64count+1)
~ ^ ~~~~~~~~~~~~~~~~~~~
../deps/libiconv/lib/utf7.h:331:11: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count)
~ ^ ~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:209:
../deps/libiconv/lib/jisx0208.h:2381:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0100)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:210:
../deps/libiconv/lib/jisx0212.h:2161:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0460)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:213:
../deps/libiconv/lib/gb2312.h:2539:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0460)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:214:
In file included from ../deps/libiconv/lib/isoir165.h:81:
../deps/libiconv/lib/isoir165ext.h:760:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0200)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:217:
In file included from ../deps/libiconv/lib/cns11643.h:38:
../deps/libiconv/lib/cns11643_inv.h:15373:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0100)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:218:
../deps/libiconv/lib/big5.h:4124:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0100)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:220:
../deps/libiconv/lib/ksc5601.h:2988:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0460)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:235:
In file included from ../deps/libiconv/lib/gb18030.h:186:
../deps/libiconv/lib/gb18030uni.h:185:23: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (i >= 0 && i <= 39419) {
~ ^ ~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:235:
../deps/libiconv/lib/gb18030.h:249:25: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (i >= 0 && i < 0x100000) {
~ ^ ~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:238:
../deps/libiconv/lib/hz.h:39:13: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+2)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:51:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+1)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:57:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+1)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:65:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+1)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:80:11: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+2)
~ ^ ~~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:241:
In file included from ../deps/libiconv/lib/cp950.h:130:
../deps/libiconv/lib/cp950ext.h:39:11: warning: equality comparison with extraneous parentheses [-Wparentheses-equality]
if ((c1 == 0xf9)) {
~~~^~~~~~~
../deps/libiconv/lib/cp950ext.h:39:11: note: remove extraneous parentheses around the comparison to silence this warning
if ((c1 == 0xf9)) {
~ ^ ~
../deps/libiconv/lib/cp950ext.h:39:11: note: use '=' to turn this equality comparison into an assignment
if ((c1 == 0xf9)) {
^~
=
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:242:
In file included from ../deps/libiconv/lib/big5hkscs1999.h:46:
../deps/libiconv/lib/hkscs1999.h:2957:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x02d0)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:243:
In file included from ../deps/libiconv/lib/big5hkscs2001.h:48:
../deps/libiconv/lib/hkscs2001.h:63:11: warning: equality comparison with extraneous parentheses [-Wparentheses-equality]
if ((c1 == 0x8c)) {
~~~^~~~~~~
../deps/libiconv/lib/hkscs2001.h:63:11: note: remove extraneous parentheses around the comparison to silence this warning
if ((c1 == 0x8c)) {
~ ^ ~
../deps/libiconv/lib/hkscs2001.h:63:11: note: use '=' to turn this equality comparison into an assignment
if ((c1 == 0x8c)) {
^~
=
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:245:
In file included from ../deps/libiconv/lib/big5hkscs2008.h:48:
../deps/libiconv/lib/hkscs2008.h:59:11: warning: equality comparison with extraneous parentheses [-Wparentheses-equality]
if ((c1 == 0x87)) {
~~~^~~~~~~
../deps/libiconv/lib/hkscs2008.h:59:11: note: remove extraneous parentheses around the comparison to silence this warning
if ((c1 == 0x87)) {
~ ^ ~
../deps/libiconv/lib/hkscs2008.h:59:11: note: use '=' to turn this equality comparison into an assignment
if ((c1 == 0x87)) {
^~
=
In file included from ../deps/libiconv/lib/iconv.c:136:
In file included from ../deps/libiconv/lib/loops.h:23:
../deps/libiconv/lib/loop_unicode.h:47:28: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(sub_outcount <= outleft)) abort();
~~~~~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:91:32: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(sub_outcount <= outleft)) abort();
~~~~~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:142:28: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(sub_outcount <= outleft)) abort();
~~~~~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:258:22: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:418:22: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:422:19: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(incount <= inleft)) abort();
~~~~~~~ ^ ~~~~~~
../deps/libiconv/lib/loop_unicode.h:503:24: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:519:22: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:154:
lib/aliases.gperf:779:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:14: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:20: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:26: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:32: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:38: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:44: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:309:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:289:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:208:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:245:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1},
^
lib/aliases.gperf:245:14: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1},
^
lib/aliases.gperf:181:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:324:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:178:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:52:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:312:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:196:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:196:14: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:196:20: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1},
^

                                             ^

lib/aliases.gperf:362:28: warning: static variable 'aliases' is used in an inline function with external linkage [-Wstatic-in-inline]
register int o = aliases[key].name;
^
lib/aliases.gperf:348:1: note: use 'static' to give inline function 'aliases_lookup' internal linkage
__inline
^
static
lib/aliases.gperf:777:27: note: 'aliases' declared here
static const struct alias aliases[] =
^
lib/aliases.gperf:365:44: warning: static variable 'stringpool_contents' is used in an inline function with external linkage
[-Wstatic-in-inline]
register const char *s = o + stringpool;
^
lib/aliases.gperf:775:37: note: expanded from macro 'stringpool'

define stringpool ((const char *) &stringpool_contents)

                                ^

lib/aliases.gperf:348:1: note: use 'static' to give inline function 'aliases_lookup' internal linkage
__inline
^
static
lib/aliases.gperf:425:34: note: 'stringpool_contents' declared here
static const struct stringpool_t stringpool_contents =
^
lib/aliases.gperf:368:25: warning: static variable 'aliases' is used in an inline function with external linkage [-Wstatic-in-inline]
return &aliases[key];
^
lib/aliases.gperf:348:1: note: use 'static' to give inline function 'aliases_lookup' internal linkage
__inline
^
static
lib/aliases.gperf:777:27: note: 'aliases' declared here
static const struct alias aliases[] =
^
../deps/libiconv/lib/iconv.c:188:22: warning: static variable 'stringpool2_contents' is used in an inline function with external linkage
[-Wstatic-in-inline]
if (!strcmp(str, stringpool2 + ptr->name))
^
../deps/libiconv/lib/iconv.c:173:38: note: expanded from macro 'stringpool2'

define stringpool2 ((const char *) &stringpool2_contents)

                                 ^

../deps/libiconv/lib/iconv.c:180:1: note: use 'static' to give inline function 'aliases2_lookup' internal linkage
__inline
^
static
../deps/libiconv/lib/iconv.c:168:35: note: 'stringpool2_contents' declared here
static const struct stringpool2_t stringpool2_contents = {
^
In file included from ../deps/libiconv/lib/iconv.c:238:
../deps/libiconv/lib/iconv_open2.h:84:32: warning: unused variable 'wcd' [-Wunused-variable]
struct wchar_conv_struct * wcd = (struct wchar_conv_struct *) cd;
^
In file included from ../deps/libiconv/lib/iconv.c:294:
../deps/libiconv/lib/iconv_open2.h:84:32: warning: unused variable 'wcd' [-Wunused-variable]
struct wchar_conv_struct * wcd = (struct wchar_conv_struct *) cd;
^
In file included from ../deps/libiconv/lib/iconv.c:136:
In file included from ../deps/libiconv/lib/loops.h:24:
../deps/libiconv/lib/loop_wchar.h:470:15: warning: unused function 'wchar_id_loop_reset' [-Wunused-function]
static size_t wchar_id_loop_reset (iconv_t icd,
^
624 warnings generated.

Detecting charset from <meta>

// detecting charset from meta in content of web page

var data = response.body;

var meta_charset = "";
var meta = ( data.toString() + "<meta >" ).match( /<meta ([^>]+)>/g );
for ( var idx = 0; idx < meta.length; idx++ ){
var charset = ( meta[idx] + "charset=undefined " ).match( /charset=([\w-]+)["' ]/g )[0].replace( /(charset=)|(["' ]$)|-/g, "" );
if ( charset != "undefined" ){
meta_charset = charset;
}
}

install problem

not work npm install crawler

i use ubuntu 11.10

npm ERR! Unsupported
npm ERR! Not compatible with your operating system or architecture: [email protected]
npm ERR! Valid OS: linux,macos,win
npm ERR! Valid Arch: x86,ppc,x86_64
npm ERR! Actual OS: linux
npm ERR! Actual Arch: ia32
npm ERR!
npm ERR! System Linux 3.0.0-19-generic
npm ERR! command "node" "/usr/bin/npm" "install" "crawler"
npm ERR! cwd /var/www/nodetest
npm ERR! node -v v0.6.16
npm ERR! npm -v 1.1.19
npm ERR! code EBADPLATFORM
npm ERR! message Unsupported
npm ERR! errno {}
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /var/www/nodetest/npm-debug.log
npm not ok

thanks.

Does not handle gzipped/compressed response

I am trying to crawl a site which uses gzip on most of their servers. The only thing the callback sees is the binary gzipped body. Needless to say that $ does not work at all in the callback.

error when not html deserved

Sometimes, link seem to deserve html and finaly, not...
Actually, that cause error (jsdom).

I propose to add something like that (line 118) :

                        var contentType =  response.headers['content-type'] || false;
                        if (contentType && contentType.indexOf('text/html') < 0
                            || contentType.indexOf('application/xhtml') < 0
                            || contentType.indexOf('application/xml') < 0) {
                            return ???????
                        }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.