bda-research / node-crawler Goto Github PK
View Code? Open in Web Editor NEWWeb Crawler/Spider for NodeJS + server-side jQuery ;-)
License: MIT License
Web Crawler/Spider for NodeJS + server-side jQuery ;-)
License: MIT License
why crawler don't use cheerio ( https://github.com/MatthewMueller/cheerio )? it's more clean way to use jQuery selectors on server side. i modified crawler in my computer and it's work very well.
No matter url I try to visit I get this error, is JSDOM going wrong?
[
{
"type": "error",
"message": "Running http://www.longshanks-consulting.com/:undefined:undefined<script> failed.",
"data": {
"error": {},
"filename": "http://www.longshanks-consulting.com/:undefined:undefined<script>"
}
}
]
Hi Sylvain,
I'm trying to install your crawler in my Ubuntu T1 instance
node --version: v0.10.12 & npm --version: 1.2.32
but it says:
npm install crawler
npm ERR! Failed to parse json
npm ERR! Unexpected token g
npm ERR! File: /home/ubuntu/package.json
npm ERR! Failed to parse package.json data.
npm ERR! package.json must be actual JSON, not just JavaScript.
npm ERR!
npm ERR! This is not a bug in npm.
npm ERR! Tell the package author to fix their package.json file. JSON.parse
npm ERR! System Linux 3.2.0-40-virtual
npm ERR! command "/home/ubuntu/.nvm/v0.10.12/bin/node" "/home/ubuntu/.nvm/v0.10.12/bin/npm" "install" "crawler"
npm ERR! cwd /home/ubuntu
npm ERR! node -v v0.10.12
npm ERR! npm -v 1.2.32
npm ERR! file /home/ubuntu/package.json
npm ERR! code EJSONPARSE
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/ubuntu/npm-debug.log
npm ERR! not ok code 0
Opening the npm-debug.log up:
0 info it worked if it ends with ok
1 verbose cli [ '/home/ubuntu/.nvm/v0.10.12/bin/node',
1 verbose cli '/home/ubuntu/.nvm/v0.10.12/bin/npm',
1 verbose cli 'install',
1 verbose cli 'crawler' ]
2 info using [email protected]
3 info using [email protected]
4 verbose read json /home/ubuntu/package.json
5 error Failed to parse json
5 error Unexpected token g
6 error File: /home/ubuntu/package.json
7 error Failed to parse package.json data.
7 error package.json must be actual JSON, not just JavaScript.
7 error
7 error This is not a bug in npm.
7 error Tell the package author to fix their package.json file. JSON.parse
8 error System Linux 3.2.0-40-virtual
9 error command "/home/ubuntu/.nvm/v0.10.12/bin/node" "/home/ubuntu/.nvm/v0.10.12/bin/npm" "install" "
crawler"
10 error cwd /home/ubuntu
11 error node -v v0.10.12
12 error npm -v 1.2.32
13 error file /home/ubuntu/package.json
14 error code EJSONPARSE
15 verbose exit [ 1, true ]
What do I have to do?
Thank you in advance for your kind help.
Kind regards
Marco
Hi,
I just install nodejs (v0.4.9) and your module and I got :
http://jamendo.com/
http://tedxparis.com
SyntaxError: Unexpected token ILLEGAL
at Object.javascript (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onData as ondata
at Client._onReadable (net.js:683:27)
TypeError: Cannot read property 'prototype' of undefined
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/browser/index.js:84:16
at String. ([object Context]:19:13786)
at Function.each ([object Context]:12:7964)
at Object.add ([object Context]:19:13498)
at [object Context]:19:19008
at Function.each ([object Context]:12:7964)
at Object.each ([object Context]:12:1155)
at Object.one ([object Context]:19:18984)
at Object.bind ([object Context]:19:18793)
at [object Context]:19:21559
at [object Context]:19:39657
at Object.javascript (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onEnd as onend
at Client._onReadable (net.js:659:26)
at IOWatcher.onReadable as callback
TypeError: Cannot read property 'prototype' of undefined
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/browser/index.js:84:16
at String. ([object Context]:19:13786)
at Function.each ([object Context]:12:7964)
at Object.add ([object Context]:19:13498)
at B ([object Context]:19:21262)
at Object.ready ([object Context]:19:19655)
at [object Context]:1:18
at Object.javascript (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at Object.check (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:239:23)
at /home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/home/fabien/Projets/crawler/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onEnd as onend
at Client._onReadable (net.js:659:26)
at IOWatcher.onReadable as callback
Avoiding JSDOM to gain better performance, can cheerio be plugged in instead? It's suppose to be (partially) compatible with JQuery.
i try to crawl this page
http://list.jd.com/737-794-870-0-0-0-0-0-0-0-1-1-1-1-1-72-33.html
it seems no friendly to 'gbk'
Should be very easy to plug a mongodb / memcached cache
Every test (except maybe one, on a reliable url like google.com?) should hit the local mockserver
Hello,
I'm trying to install node-crawler using node version 0.4.12 and npm version 1.0.22
Node-crawler requires a 0.4.x version of node, but it seems that a dependency, qunit, requires a node version of >= 0.5
Is it even possible to install this?
npm ERR! error installing [email protected] Error: Unsupported
npm ERR! error installing [email protected] at checkEngine (/usr/local/lib/node_modules/npm/lib/install.js:570:14)
npm ERR! error installing [email protected] at Array.0 (/usr/local/lib/node_modules/npm/node_modules/slide/lib/bind-actor.js:15:8)
npm ERR! error installing [email protected] at LOOP (/usr/local/lib/node_modules/npm/node_modules/slide/lib/chain.js:15:13)
npm ERR! error installing [email protected] at chain (/usr/local/lib/node_modules/npm/node_modules/slide/lib/chain.js:20:4)
npm ERR! error installing [email protected] at installOne_ (/usr/local/lib/node_modules/npm/lib/install.js:548:3)
npm ERR! error installing [email protected] at installOne (/usr/local/lib/node_modules/npm/lib/install.js:488:3)
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/install.js:425:9
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/node_modules/slide/lib/async-map.js:54:35
npm ERR! error installing [email protected] at Array.forEach (native)
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/node_modules/slide/lib/async-map.js:54:11
npm ERR! Unsupported
npm ERR! Not compatible with your version of node/npm: [email protected]
npm ERR! Required: {"node":">= 0.5.0"}
npm ERR! Actual: {"npm":"1.0.22","node":"v0.4.12"}
npm ERR!
npm ERR! System Darwin 11.1.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "crawler"
npm ERR! cwd /Users/thomasbrus/Code/wordcount
npm ERR! node -v v0.4.12
npm ERR! npm -v 1.0.22
The following files couldn't be removed.
Remove them manually and try again
sudo rm -rf "/Users/thomasbrus/node_modules/crawler"
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /Users/thomasbrus/Code/wordcount/npm-debug.log
npm not ok
I've got this error while trying to install using
npm install git://github.com/daraosn/node-crawler.git --save
Any suggestions?
spawn python [ '/Users/cardosor/.node-gyp/0.6.15/tools/gyp_addon',
'binding.gyp',
'-I/Users/cardosor/Sites/devel/nodejs/photod/node_modules/crawler/node_modules/jsdom/node_modules/contextify/build/config.gypi',
'-f',
'make' ]
ERR! Error: not found: make
at F (/usr/local/lib/node_modules/npm/node_modules/which/which.js:43:28)
at E (/usr/local/lib/node_modules/npm/node_modules/which/which.js:46:29)
at Object.oncomplete (/usr/local/lib/node_modules/npm/node_modules/which/which.js:57:16)
ERR! not ok
npm WARN optional dependency failed, continuing [email protected]
jQuery ($ parameter) receives incorrect (file://) hrefs
NodeJS v0.4.9
var Crawler = require("crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
"callback":function(error,result,$) {
}
});
c.queue(["http://www.jamendo.com/"]);
http://www.jamendo.com/
SyntaxError: Unexpected token ILLEGAL
at Object.javascript (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:1195:46)
at /node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:43:20
at Object.check (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:235:11)
at /node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/node_modules/crawler/node_modules/jsdom/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Client.onData as ondata
at Client._onReadable (net.js:683:27)
Ran this simple code:
var Crawler = require("crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
"callback":function(error,result,$) {
console.log(result.body);
console.log($('body').text());
}
});
c.queue("http://google.com");
result.body
would show the html but $('body').text()
gave me error:
TypeError: undefined is not a function
at Object.Crawler.callback (C:\_besteat\samples\crawler\app.js:17:33)
at exports.Crawler.self.onContent.jsdom.env.done (C:\_besteat\samples\crawle
r\node_modules\crawler\lib\crawler.js:255:37)
at exports.env.exports.jsdom.env.scriptComplete (C:\_besteat\samples\crawler
\node_modules\crawler\node_modules\jsdom\lib\jsdom.js:205:39)
at process.startup.processNextTick.process._tickCallback (node.js:244:9)
I was tracing back the code to this
fs.readFile(toQueue.jQueryUrl.replace(/^file\:\/\//,""),"utf-8",function(err,jq) {
if (err) {
toQueue.callback(err);
release(toQueue);
} else {
try {
jsd([jq]);
} catch (e) {
toQueue.callback(e);
release(toQueue);
}
}
});
Look like toQueue.callback(e);
was called because jsd([jq])
failed.
That led me to jsdom
. And somewhere between this
window = exports.html(html, null, options).createWindow()
and this
features = JSON.parse(JSON.stringify(window.document.implementation._features))
window.document.implementation
doesn't exist. I have not gone further to find out why. It was confusing.
Any help appreciate.
I'm creating crawler as:
var c = new Crawler({
'maxConnections':1,
'callback': function(error, result, $){
$('td.title + a').each(function(a){
var tmp = new blogitem(a.val, a.href);
news.push(tmp);
});
uploadPosts(news);
}
});
setInterval(function(){
c.queue([{
'uri': 'http://127.0.0.1/posts',
'timeout': 120
}]);
}, 10000);
When I run the server.js, it gives TypeError: Cannot read property 'firstChild' of null. I've tried just to put sys.puts(response) in the callback, but the result is the same...
Hi.
First i like to thank you for creating this amazing simple crawler :) It`s amazingly usefull for any kind of research on the web.
While using youre module in quite a big scale i spotted a extrem slowdown after ~ 1000 crawled pages.
Is there any kind of cache which i can shut off or is there any other way to prevent this speedreduction?
Really hope there is a way to speed up node-crawler
yours sincerely
Daniel
Maybe it's just me but I'm getting an undefined jQuery response when running test/simple.js
$("a").each(function(i,a) {
^
TypeError: undefined is not a function
Alot of times passing local data associated with the url being queued to the callback is required. This for me is useful in building a relationship tree across gathered scraps or micro objects thrown in couchdb with nano, which I then combine into bigger objects later. Passing jquery gathered data while queing the URL lets me store attributes such as the kind of link or surrounding taxonomy hints which in some cases cannot be gathered locally from the pages markup depending on the flow of the website being crawled, if such information cannot be extracted from the page it self. I suggest having data passed to the callback while queuing it. Currently I append my data as a GET query string to the URL being queued and then parse request.uri back in the call back to get the passed data, even though this gets the job done, it is intrusive, the server admins can see what we are cooking, and overall not elegant.
I copied the sample code and added below to point to jQuery file
"jQueryUrl":'/vendor/jquery-1.8.3.min.js',
Then ran it but got this error:
C:\node_modules\crawler\node_modules\crawler\lib\crawler.js:275 toQueue.callback(e); ^ ReferenceError: e is not defined at exports.Crawler.self.onContent (C:\node_modules\crawler\node_modules\craw ler\lib\crawler.js:275:46) at fs.js:117:20 at Object.oncomplete (fs.js:297:15)
Obviously it's something about reading the jQuery file. What is the right way to fix this?
This code:
var crawler = require('crawler').Crawler;
var c = new crawler({
"maxConnections": 10,
"forceUTF8": true,
"debug": true,
"callback": function(err,result,$) {
console.log(result.body);
}
});
c.queue("http://marketwatch.com/sitemap-search-index.xml.gz");
Produces this text:
I%&/m{JJ`$@iG#)*eVe]f@흼{{;N'?\fdlJɞ!?~|?"ǿ|{[e^7EGiVbyG|qS"[Y.7wMG]={j|uo\wvvv_<=Ųi4H
p~xwg#y.m^et>V
}ީGw~6zv?}tg{7~G?+GvY{{Dy7~G{7{Odtӟn
vwwo.Y^mRgam5ˮoCB_|m*ѺjᾉwaclHlhA0oVA{'+m,=
;hcf7vV"Ն
The problem is that the result.body produces a string, this string no longer has the information needed to be passed to gunzip.
More specifically this is what I am attempting to do:
var crawler = require('crawler').Crawler,
zlib = require('zlib');
var c = new crawler({
"maxConnections": 1,
"debug": true,
"callback": function(err,result,$) {
zlib.unzip(new Buffer(result.body), function(err, buffer) {
if(err) throw err;
console.log(buffer.toString());
});
}
});
c.queue("http://marketwatch.com/sitemap-search-index.xml.gz");
But it produces this result:
GET http://marketwatch.com/sitemap-search-index.xml.gz ...
Got http://marketwatch.com/sitemap-search-index.xml.gz (634 bytes)...
/home/sparq/ml/test_crawler.js:9
if(err) throw err;
^
Error: incorrect header check
at Zlib._binding.onerror (zlib.js:295:17)
I am using this simple example and found a problem:
var Crawler = require("crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
"callback":function(error,result,$) {
console.log(result.body);
//console.log($('body').text());
}
});
c.queue("http://google.com");
In this particular line:
var req = request(_.pick.apply(this,[ropts].concat(requestArgs)), function(error,response,body) {
The input JSON was as following:
{ uri: 'http://google.com',
method: 'GET',
headers:
{ 'Accept-Encoding': 'gzip',
'User-Agent': 'node-crawler/0.2.3' },
encoding: null,
timeout: 60000 }
As you see, I have to delete "encoding: null" in order for body
variable to work.
Why is it "null" and how to fix this for websites like google.com?
The platform I am working on is Windows, btw (if that makes a different).
Hi, is there a way to set the maximum number of links to follow?
var r = require('crawler/node_modules/request');
var Crawler = require('crawler').Crawler;
var crawler = new Crawler({jar: r.jar()});
crawler.queue({
uri: 'http://github.com',
callback: test
}, {
uri: 'http://gitcities.com',
callback: test
});
function test(err, res) {
console.log(res.headers);
}
throws
Object #<Object> has no method 'get'
TypeError: Object #<Object> has no method 'get'
at Request.jar (/Users/anatoliy/projects/opensource/node-crawler/node_modules/request/index.js:1139:19)
at Request.init (/Users/anatoliy/projects/opensource/node-crawler/node_modules/request/index.js:227:8)
at new Request
(skipped rest of stacktrace)
I will send pull request as we need it published asap. Thanks!
I saw that there was a commit for proxy support that isn't yet in NPM. Is there anyway to provide authentication for the proxies?
Thanks!
node test/simple.js
http://jamendo.com/
http://tedxparis.com
/home/crawler/node_modules/crawler/lib/crawler.js:74
response.body = body;
^
TypeError: Cannot set property 'body' of undefined
at Object.callback (/home/crawler/node_modules/crawler/lib/crawler.js:74:39)
at [object Object].callback (/home/crawler/node_modules/crawler/lib/crawler.js:70:43)
at [object Object]. (/home/crawler/node_modules/crawler/node_modules/request/main.js:151:67)
at [object Object].emit (events.js:64:17)
at Object._onTimeout (/home/crawler/node_modules/crawler/node_modules/request/main.js:300:19)
at Timer.callback (timers.js:83:39)
I tried to crawler this page: http://www.books.com.tw/exep/prod/booksfile.php?item=0010000004. It was Big5 but was detected windows-1252. No matter I set forceUTF8
or not, the output was garbled.
There should be an option to set your own UserAgent string
How fast is it? Do any benchmarks exist?
Running the tests, I'm able to see the full html of the response in result.content, but any attempts to access via jQuery return nothing (no errors, etc.) if (jQuery) console.log('loaded') is not triggered either.
Here's the code with several diff variations on my attempts to access via jQuery-
var Crawler = require("../lib/crawler").Crawler,
c = new Crawler({
"maxConnections":10,
"timeout":120_1000, // seconds
"debug":false,
"callback":function(error,result,$) {
console.log("Got page");
//console.log(result.content);
//if (jQuery) console.log('loaded') ;
//response = $('
c.queue(["http://urbanthesaur.us"]);
D:\node\polgastro\node_modules\crawler\lib\crawler.js:296
if (callbackError) throw callbackError;
^
TypeError: undefined is not a function
at Object.Crawler.callback (D:\node\polgastro\crawlr.js:10:3)
at jsdom.env.done (D:\node\polgastro\node_modules\crawler\lib\crawler.js:272:41)
at D:\node\polgastro\node_modules\crawler\node_modules\jsdom\lib\jsdom.js:252:9
at process._tickCallback (node.js:415:13)
at Function.Module.runMain (module.js:499:11)
at startup (node.js:119:16)
at node.js:901:3
not speak English
question 1
Crawler for test
npm install && npm test
Assertion failed: req->work_cb, file src\win\threadpool.c, line 46
How to solve the error
os: win7 (32)
nodejs:0.8.9
npm 1.1.61
question 2
npm test Success????
Global summary:
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳
━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Files ┃ Tests ┃ Assertions ┃ Failed ┃ Passed ┃ Runtime
┃
┣━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━╋
━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━┫
┃ 1 ┃ 12 ┃ 31 ┃ 0 ┃ 31 ┃ 9001
┃
┗━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━┻
━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━┛
If the crawler encounters a page such as http://hypem.com/blog/a/1?ax=1 that contains an HTML fragment and not a complete document, the crawler will crash when trying to appendChild() a jQuery script element on the undefined window.document.body object. One option might be to add support for the HTML5 parser library which should be able to handle fragments: https://github.com/aredridel/html5
A short repro:
var Crawler = require('node-crawler').Crawler;
var crawler = new Crawler({
maxConnections: 1,
callback: function(err, res, $) {
console.log('worked!'); // The app will crash before this point
}
});
crawler.queue(['http://hypem.com/blog/a/1?ax=1']);
Hi,
Does node-crawler obey the robots.txt exclusion standard, described at http://www.robotstxt.org/wc/exclusion.html#robotstxt and
robots META tag, as described at http://www.robotstxt.org/wc/meta-user.html? If not how to achieve this with node-crawler ?
Full error:
/home/overconnected/OverConnected/node_modules/crawler/lib/crawler.js:296
if (callbackError) throw callbackError;
^
TypeError: Cannot read property 'timeout' of undefined
at /home/overconnected/OverConnected/node_modules/crawler/node_modules/underscore/underscore.js:803:18
at Array.forEach (native)
at _.each._.forEach (/home/overconnected/OverConnected/node_modules/crawler/node_modules/underscore/underscore.js:78:11)
at Function._.defaults (/home/overconnected/OverConnected/node_modules/crawler/node_modules/underscore/underscore.js:800:5)
at self.queue (/home/overconnected/OverConnected/node_modules/crawler/lib/crawler.js:356:11)
at Object.<anonymous> (/home/overconnected/OverConnected/server.js:51:18)
at Function.v.extend.each (http://www.pmlive.com/pharma_appointments/sarah_verhoeff_joins_the_board_at_pan_506016:undefined:undefined<script>:2:14543)
at v.fn.v.each (http://www.pmlive.com/pharma_appointments/sarah_verhoeff_joins_the_board_at_pan_506016:undefined:undefined<script>:2:11217)
at String.<anonymous> (/home/overconnected/OverConnected/server.js:25:20)
at Function.v.extend.each (http://www.pmlive.com/pharma_appointments/sarah_verhoeff_joins_the_board_at_pan_506016:undefined:undefined<script>:2:14543)
$ node test/simple.js
http://jamendo.com/
http://tedxparis.com
/[...]/lib/crawler.js:74
response.body = body;
^
TypeError: Cannot set property 'body' of undefined
at Object.callback (/[...]/crawler/lib/crawler.js:74:39)
at Request.callback (/[...]/crawler/lib/crawler.js:70:43)
at Request. (/[...]/request/main.js:154:67)
at Request.emit (events.js:64:17)
at Object._onTimeout (/[...]/crawler/node_modules/request/main.js:320:19)
at Timer.callback (timers.js:83:39)
I always detected as spam, or scripted. And must enter a captcha to continue. How to solve that?
Does it lie in the scope of this project to support robots.txt?
for unique hosts
Hello,
I'm trying to include node-crawler in a project as a dependency.
I include it like that in my package.json and then run "npm install -d"
"dependencies": {
"crawler": "0.0.2"
}
It is however blocking on qunit install, as the logs mention it:
npm info installOne [email protected]
npm ERR! error installing [email protected] Error: Unsupported
npm ERR! error installing [email protected] at checkEngine (/usr/local/lib/node_modules/npm/lib/install.js:561:14)
npm ERR! error installing [email protected] at nextStep (/usr/local/lib/node_modules/npm/lib/utils/chain.js:54:8)
npm ERR! error installing [email protected] at chain (/usr/local/lib/node_modules/npm/lib/utils/chain.js:27:3)
npm ERR! error installing [email protected] at installOne_ (/usr/local/lib/node_modules/npm/lib/install.js:539:3)
npm ERR! error installing [email protected] at installOne (/usr/local/lib/node_modules/npm/lib/install.js:479:3)
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/install.js:421:9
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/utils/async-map.js:57:35
npm ERR! error installing [email protected] at Array.forEach (native)
npm ERR! error installing [email protected] at /usr/local/lib/node_modules/npm/lib/utils/async-map.js:57:11
npm ERR! error installing [email protected] at Array.forEach (native)
...
npm info rm fail ENOTEMPTY, Directory not empty '/path/to/node_modules/crawler'
npm ERR! Unsupported
npm ERR! Not compatible with your version of node/npm: [email protected]
npm ERR! Required: {"node":">= 0.5.0"}
npm ERR! Actual: {"npm":"1.0.15","node":"v0.4.12"}
npm ERR!
npm ERR! System Darwin 11.1.0
npm ERR! command "node" "/usr/local/bin/npm" "install" "-d"
After installing crawler it says:
$npm ls
npm ERR! extraneous: [email protected] /home/ubuntu/node_modules/crawler
npm ERR! not ok code 0
What to do in order to fix the problem?
Kind regards
Marco
When error response is undefined and get fatal because trying to assing to it 'uri'.
It could be fixed like this :
request(q, function(error,response,body) {
if (response) {
response.uri = q.uri;
}
onContent(error,response,body,false);
});
So in error case, it will be possible to know uri which cause error
If remote side return empty response
Got http://some-internet-site.com/ (0 bytes)...
there is an exception:
/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:369
throw new Error("jsdom.env requires a '" + req + "' argument");
^
Error: jsdom.env requires a 'html' argument
at /home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:369:13
at Array.forEach (native)
at Function.processArguments (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:366:12)
at Object.env (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/jsdom/lib/jsdom.js:179:29)
at /home/alexio/TRAVEL/graph/node_modules/crawler/lib/crawler.js:119:35
at [object Object].callback (/home/alexio/TRAVEL/graph/node_modules/crawler/lib/crawler.js:179:25)
at [object Object]. (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/request/main.js:294:21)
at [object Object].emit (events.js:64:17)
at IncomingMessage. (/home/alexio/TRAVEL/graph/node_modules/crawler/node_modules/request/main.js:281:54)
at IncomingMessage.emit (events.js:81:20)
The version of Request used by Crawler is a bit old and includes the bug reported here: request/request#417 (which in turn might be masking other issues in Crawler or the code calling Crawler). I manually updated the Request module used by Crawler to the latest (2.25.0) and, anecdotally speaking, it seems to work fine.
//Local file was given, skip request
if (toQueue.file) {
fs.readFile( toQueue.file, function (err, data) {
if (err) {
console.error("file reading error: ",err);
return;
};
self.onContent(null,toQueue,{
body:data
},false);
});
return;
}
Actually response.request = toQueue
It scratch .request of module 'request'.
So effective url no more available and others useful thinks.
I propose .deepRequest :
line 112
...
} else {
response.content = body;
response.deepRequest = response.request; // added
response.request = toQueue;
if (toQueue.jQuery && toQueue.method!="HEAD") {
...
node-gyp rebuild
CC(target) Release/obj.target/iconv/deps/libiconv/libcharset/lib/localcharset.o
CC(target) Release/obj.target/iconv/deps/libiconv/lib/iconv.o
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:133:
../deps/libiconv/lib/utf7.h:162:13: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+base64count+1)
~ ^ ~~~~~~~~~~~~~~~~~~~
../deps/libiconv/lib/utf7.h:331:11: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count)
~ ^ ~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:209:
../deps/libiconv/lib/jisx0208.h:2381:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0100)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:210:
../deps/libiconv/lib/jisx0212.h:2161:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0460)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:213:
../deps/libiconv/lib/gb2312.h:2539:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0460)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:214:
In file included from ../deps/libiconv/lib/isoir165.h:81:
../deps/libiconv/lib/isoir165ext.h:760:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0200)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:217:
In file included from ../deps/libiconv/lib/cns11643.h:38:
../deps/libiconv/lib/cns11643_inv.h:15373:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0100)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:218:
../deps/libiconv/lib/big5.h:4124:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0100)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:220:
../deps/libiconv/lib/ksc5601.h:2988:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x0460)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:235:
In file included from ../deps/libiconv/lib/gb18030.h:186:
../deps/libiconv/lib/gb18030uni.h:185:23: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (i >= 0 && i <= 39419) {
~ ^ ~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:235:
../deps/libiconv/lib/gb18030.h:249:25: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (i >= 0 && i < 0x100000) {
~ ^ ~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:238:
../deps/libiconv/lib/hz.h:39:13: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+2)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:51:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+1)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:57:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+1)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:65:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+1)
~ ^ ~~~~~~~
../deps/libiconv/lib/hz.h:80:11: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
if (n < count+2)
~ ^ ~~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:241:
In file included from ../deps/libiconv/lib/cp950.h:130:
../deps/libiconv/lib/cp950ext.h:39:11: warning: equality comparison with extraneous parentheses [-Wparentheses-equality]
if ((c1 == 0xf9)) {
~~~^~~~~~~
../deps/libiconv/lib/cp950ext.h:39:11: note: remove extraneous parentheses around the comparison to silence this warning
if ((c1 == 0xf9)) {
~ ^ ~
../deps/libiconv/lib/cp950ext.h:39:11: note: use '=' to turn this equality comparison into an assignment
if ((c1 == 0xf9)) {
^~
=
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:242:
In file included from ../deps/libiconv/lib/big5hkscs1999.h:46:
../deps/libiconv/lib/hkscs1999.h:2957:12: warning: comparison of unsigned expression >= 0 is always true [-Wtautological-compare]
if (wc >= 0x0000 && wc < 0x02d0)
~~ ^ ~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:243:
In file included from ../deps/libiconv/lib/big5hkscs2001.h:48:
../deps/libiconv/lib/hkscs2001.h:63:11: warning: equality comparison with extraneous parentheses [-Wparentheses-equality]
if ((c1 == 0x8c)) {
~~~^~~~~~~
../deps/libiconv/lib/hkscs2001.h:63:11: note: remove extraneous parentheses around the comparison to silence this warning
if ((c1 == 0x8c)) {
~ ^ ~
../deps/libiconv/lib/hkscs2001.h:63:11: note: use '=' to turn this equality comparison into an assignment
if ((c1 == 0x8c)) {
^~
=
In file included from ../deps/libiconv/lib/iconv.c:71:
In file included from ../deps/libiconv/lib/converters.h:245:
In file included from ../deps/libiconv/lib/big5hkscs2008.h:48:
../deps/libiconv/lib/hkscs2008.h:59:11: warning: equality comparison with extraneous parentheses [-Wparentheses-equality]
if ((c1 == 0x87)) {
~~~^~~~~~~
../deps/libiconv/lib/hkscs2008.h:59:11: note: remove extraneous parentheses around the comparison to silence this warning
if ((c1 == 0x87)) {
~ ^ ~
../deps/libiconv/lib/hkscs2008.h:59:11: note: use '=' to turn this equality comparison into an assignment
if ((c1 == 0x87)) {
^~
=
In file included from ../deps/libiconv/lib/iconv.c:136:
In file included from ../deps/libiconv/lib/loops.h:23:
../deps/libiconv/lib/loop_unicode.h:47:28: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(sub_outcount <= outleft)) abort();
~~~~~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:91:32: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(sub_outcount <= outleft)) abort();
~~~~~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:142:28: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(sub_outcount <= outleft)) abort();
~~~~~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:258:22: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:418:22: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:422:19: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(incount <= inleft)) abort();
~~~~~~~ ^ ~~~~~~
../deps/libiconv/lib/loop_unicode.h:503:24: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
../deps/libiconv/lib/loop_unicode.h:519:22: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long')
[-Wsign-compare]
if (!(outcount <= outleft)) abort();
~~~~~~~~ ^ ~~~~~~~
In file included from ../deps/libiconv/lib/iconv.c:154:
lib/aliases.gperf:779:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:14: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:20: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:26: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:32: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:38: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:779:44: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:309:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:289:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:208:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:245:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1},
^
lib/aliases.gperf:245:14: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1},
^
lib/aliases.gperf:181:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:324:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:178:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:52:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:312:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1},
^
lib/aliases.gperf:196:8: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:196:14: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1},
^
lib/aliases.gperf:196:20: warning: missing field 'encoding_index' initializer [-Wmissing-field-initializers]
{-1}, {-1}, {-1}, {-1}, {-1},
^
^
lib/aliases.gperf:362:28: warning: static variable 'aliases' is used in an inline function with external linkage [-Wstatic-in-inline]
register int o = aliases[key].name;
^
lib/aliases.gperf:348:1: note: use 'static' to give inline function 'aliases_lookup' internal linkage
__inline
^
static
lib/aliases.gperf:777:27: note: 'aliases' declared here
static const struct alias aliases[] =
^
lib/aliases.gperf:365:44: warning: static variable 'stringpool_contents' is used in an inline function with external linkage
[-Wstatic-in-inline]
register const char *s = o + stringpool;
^
lib/aliases.gperf:775:37: note: expanded from macro 'stringpool'
^
lib/aliases.gperf:348:1: note: use 'static' to give inline function 'aliases_lookup' internal linkage
__inline
^
static
lib/aliases.gperf:425:34: note: 'stringpool_contents' declared here
static const struct stringpool_t stringpool_contents =
^
lib/aliases.gperf:368:25: warning: static variable 'aliases' is used in an inline function with external linkage [-Wstatic-in-inline]
return &aliases[key];
^
lib/aliases.gperf:348:1: note: use 'static' to give inline function 'aliases_lookup' internal linkage
__inline
^
static
lib/aliases.gperf:777:27: note: 'aliases' declared here
static const struct alias aliases[] =
^
../deps/libiconv/lib/iconv.c:188:22: warning: static variable 'stringpool2_contents' is used in an inline function with external linkage
[-Wstatic-in-inline]
if (!strcmp(str, stringpool2 + ptr->name))
^
../deps/libiconv/lib/iconv.c:173:38: note: expanded from macro 'stringpool2'
^
../deps/libiconv/lib/iconv.c:180:1: note: use 'static' to give inline function 'aliases2_lookup' internal linkage
__inline
^
static
../deps/libiconv/lib/iconv.c:168:35: note: 'stringpool2_contents' declared here
static const struct stringpool2_t stringpool2_contents = {
^
In file included from ../deps/libiconv/lib/iconv.c:238:
../deps/libiconv/lib/iconv_open2.h:84:32: warning: unused variable 'wcd' [-Wunused-variable]
struct wchar_conv_struct * wcd = (struct wchar_conv_struct *) cd;
^
In file included from ../deps/libiconv/lib/iconv.c:294:
../deps/libiconv/lib/iconv_open2.h:84:32: warning: unused variable 'wcd' [-Wunused-variable]
struct wchar_conv_struct * wcd = (struct wchar_conv_struct *) cd;
^
In file included from ../deps/libiconv/lib/iconv.c:136:
In file included from ../deps/libiconv/lib/loops.h:24:
../deps/libiconv/lib/loop_wchar.h:470:15: warning: unused function 'wchar_id_loop_reset' [-Wunused-function]
static size_t wchar_id_loop_reset (iconv_t icd,
^
624 warnings generated.
// detecting charset from meta in content of web page
var data = response.body;
var meta_charset = "";
var meta = ( data.toString() + "<meta >" ).match( /<meta ([^>]+)>/g );
for ( var idx = 0; idx < meta.length; idx++ ){
var charset = ( meta[idx] + "charset=undefined " ).match( /charset=([\w-]+)["' ]/g )[0].replace( /(charset=)|(["' ]$)|-/g, "" );
if ( charset != "undefined" ){
meta_charset = charset;
}
}
not work npm install crawler
i use ubuntu 11.10
npm ERR! Unsupported
npm ERR! Not compatible with your operating system or architecture: [email protected]
npm ERR! Valid OS: linux,macos,win
npm ERR! Valid Arch: x86,ppc,x86_64
npm ERR! Actual OS: linux
npm ERR! Actual Arch: ia32
npm ERR!
npm ERR! System Linux 3.0.0-19-generic
npm ERR! command "node" "/usr/bin/npm" "install" "crawler"
npm ERR! cwd /var/www/nodetest
npm ERR! node -v v0.6.16
npm ERR! npm -v 1.1.19
npm ERR! code EBADPLATFORM
npm ERR! message Unsupported
npm ERR! errno {}
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /var/www/nodetest/npm-debug.log
npm not ok
thanks.
I am trying to crawl a site which uses gzip on most of their servers. The only thing the callback sees is the binary gzipped body. Needless to say that $ does not work at all in the callback.
libxml2:
eroh92@b7e99dd
Sometimes, link seem to deserve html and finaly, not...
Actually, that cause error (jsdom).
I propose to add something like that (line 118) :
var contentType = response.headers['content-type'] || false;
if (contentType && contentType.indexOf('text/html') < 0
|| contentType.indexOf('application/xhtml') < 0
|| contentType.indexOf('application/xml') < 0) {
return ???????
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.