node-js-libs / node.io Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
As a test, I'm trying to scrape a very simple HTML page that has divs with links in them.
<html>
<body>
<div class="playerNameAndInfo"><a href="#">Will W.</a></div>
<div class="playerNameAndInfo"><a href="#">Frank F.</a></div>
<div class="playerNameAndInfo"><a href="#">James J.</a></div>
<div class="playerNameAndInfo"><a href="#">Billy B.</a></div>
<div class="playerNameAndInfo"><a href="#">Harv H.</a></div>
<div class="playerNameAndInfo"><a href="#">Chris C.</a></div>
</body>
</html>
my node.io job looks like
var nodeio = require('node.io');
var methods = {
input: false,
run: function() {
this.getHtml('http://www.willwyatt.com/nio.html', function(err, $) {
//Handle any request / parsing errors
if (err) this.exit(err);
var output = [];
var playerFirstName = [], playerLastName = [], playerPostion = [], playerTeam = [];
$('.playerNameAndInfo').each(function(player) {
thisPlayer = $('a', player.rawtext);
output.push(thisPlayer.length + "\n");
});
this.emit(output);
});
}
}
exports.job = new nodeio.Job({timeout:10}, methods);
I expect this to output 6 lines that say '1' since thisPlayer should have been limited to the context of player. Instead, it looks like thisPlayer is ignoring the context called player. What am I doing wrong?
Thanks.
A body isn't empty.
var nodeio = require('/usr/lib/node_modules/node.io');
options = { jsdom: true};
exports.job = new nodeio.Job(options,{
input: false,
run: function () {
this.getHtml('http://127.0.0.4/index.php?r=home/login&language=en',function(err,$,data,headers) {
console.log($);
});
}
});
DEBUG: Running 1 worker..
DEBUG: GET http://127.0.0.4/index.php?r=home/login&language=en (request 98917)
DEBUG: | GET /index.php?r=home/login&language=en HTTP/1.1
DEBUG: | Accept: */*
DEBUG: | Accept-charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
DEBUG: | User-agent: node.io
DEBUG: | Host: 127.0.0.4
DEBUG: 200 http://127.0.0.4/index.php?r=home/login&language=en (response 98917)
DEBUG: | Date: Fri, 02 Sep 2011 21:45:51 GMT
DEBUG: | Server: Apache/2.2.19 (Fedora)
DEBUG: | X-powered-by: PHP/5.3.6
DEBUG: | Set-cookie: lang=en; expires=Sat, 01-Sep-2012 21:45:51 GMT; path=/,PHPSESSID=fi9tlebs2uft88fou0p10gvfc0; path=/
DEBUG: | Expires: Thu, 19 Nov 1981 08:52:00 GMT
DEBUG: | Cache-control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
DEBUG: | Pragma: no-cache
DEBUG: | Content-length: 7099
DEBUG: | Connection: close
DEBUG: | Content-type: text/html; charset=UTF-8
undefined
But if i use
this.getHtml('http://www.google.com',function(err,$,data,headers) {
console.log($);
});
$ is a function. and work good.
What is the recommended way to catch or prevent errors from throwing when scraping using the $ object?
This is the type of error I am getting:
/usr/local/lib/node_modules/node.io/lib/node.io/dom.js:25
throw new Error("No elements matching '" + selector + "'");
^
Error: No elements matching 'div.foo p'
Somewhat related -- if the $ selection only returns one result, .each fails.
Suggestions on how I should best handle these edge cases?
My whole problem with screen scraping is that my ip eventually gets blocked. It would be great to be able to create a list of urls to push work out to ( as slaves) or to publish a URL for clients (browsers) to pull work from a queue.
So in practice I could create an account on no.de which runs my server, tell it which slaves are available (ip or urls) and have it publish a URL for client side workers. This would certainly help me work around ip blocking.
Have you seen map-crowd-reduce (node.js project)?
Just a quick question. I am trying to get
throw new Error("No elements matching '" + selector + "'");
^
Error: No elements matching 'html'
at [object Object].$ (/usr/local/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/dom.js:22:15)...
There seems to be a problem with request.js, handling certain URLs, e.g.
http://investing.businessweek.com/research/stocks/snapshot/snapshot.asp?capId=101648
I'm trying to scrape a list of URLs, but this URL consistently fails with an error 'EBADNAME, Misformatted domain name'
My code is here, (it's forcing this URL so you don't need to supply any input to see the issue)
https://gist.github.com/833888
ncb000gt kindly supplied a dump of the output of urlparse, which seems to get called twice, the second time it runs the URL gets corrupted:
https://gist.github.com/833921
fwiw, believe line 21 should be "seen_lines", not "seel_lines"
node v0.4.10
npm 1.0.22
Error (from a local install or -g) is:
[email protected] preinstall /home/will/sources/nfl/node_modules/node.io/node_modules/jsdom/node_modules/contextify
node-waf clean || true; node-waf configure build
sh: node-waf: not found
sh: node-waf: not found
npm ERR! error installing [email protected] Error: [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! error installing [email protected] sh "-c" "node-waf clean || true; node-waf configure build"
failed with 127
npm ERR! error installing [email protected] at ChildProcess. (/usr/lib/node_modules/npm/lib/utils/exec.js:49:20)
npm ERR! error installing [email protected] at ChildProcess.emit (events.js:67:17)
npm ERR! error installing [email protected] at ChildProcess.onexit (child_process.js:192:12)
npm ERR! error installing [email protected] Error: [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! error installing [email protected] sh "-c" "node-waf clean || true; node-waf configure build"
failed with 127
npm ERR! error installing [email protected] at ChildProcess. (/usr/lib/node_modules/npm/lib/utils/exec.js:49:20)
npm ERR! error installing [email protected] at ChildProcess.emit (events.js:67:17)
npm ERR! error installing [email protected] at ChildProcess.onexit (child_process.js:192:12)
npm ERR! error installing [email protected] Error: [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! error installing [email protected] sh "-c" "node-waf clean || true; node-waf configure build"
failed with 127
npm ERR! error installing [email protected] at ChildProcess. (/usr/lib/node_modules/npm/lib/utils/exec.js:49:20)
npm ERR! error installing [email protected] at ChildProcess.emit (events.js:67:17)
npm ERR! error installing [email protected] at ChildProcess.onexit (child_process.js:192:12)
npm ERR! [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! sh "-c" "node-waf clean || true; node-waf configure build"
failed with 127
npm ERR!
npm ERR! Failed at the [email protected] preinstall script.
npm ERR! This is most likely a problem with the contextify package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-waf clean || true; node-waf configure build
npm ERR! You can get their info via:
npm ERR! npm owner ls contextify
npm ERR! There is likely additional logging output above.
npm ERR!
npm ERR! System Linux 2.6.39.1-linode34
npm ERR! command "node" "/usr/bin/npm" "install" "node.io"
npm ERR! cwd /home/will/sources/nfl/node_modules
npm ERR! node -v v0.4.10
npm ERR! npm -v 1.0.22
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/will/sources/nfl/node_modules/npm-debug.log
npm not ok
Any ideas? Thanks.
Content-Length header in request.js is currently being set based on string.length instead of Buffer.byteLength(string)
I am trying to call the get methods one inside the other but it doesn't seem to output anything. I am a noob, am I missing something here?
var nodeio = require('node.io'), options = {timeout: 1000, max: 30}, exports.job = new nodeio.Job(options, { input : ['CHI'], run: function (keyword) { outerThis = this; this.getHtml('http://www.kayak.com/s/search/air/?l1=ATL&df=us1&ft=ow&d1=6/21/2011&l2='+keyword, function (err, $, data, headers) { var results = ''; $('.flightresult').each(function(selector){ outerThis.getHtml('http://www.google.com/', function(err, $, data, headers) { results = "inside google"; outerThis.emit("asdadada"); // DOESN'T EMIT ANYTHING, DOESN'T RUN AT ALL }); }); this.emit(i++ +'. Results for ATL to '+keyword); this.emit(results); }); } });
An example of the error:
hornairs@bishop:~/Code/busters (master *)$ cat jobs/hello.coffee
nodeio = require 'node.io'
class Hello extends nodeio.JobClass
input: ['']
run: (num) -> @emit 'Hello World!'
@class = Hello
@job = new Hello()
hornairs@bishop:~/Code/busters (master *)$ node.io --debug jobs/hello.coffee
DEBUG: Compiling jobs/hello.coffee => hello_compiled.js
DEBUG: Running 1 worker..
DEBUG: Writing to STDOUT
Hello World!
OK: Job complete
hornairs@bishop:~/Code/busters (master *)$ node.io --debug -i input.txt jobs/hello.coffee
DEBUG: Compiling jobs/hello.coffee => hello_compiled.js
/usr/local/lib/node_modules/node.io/lib/node.io/processor.js:241
job_obj = job_obj.extend(options.extend.options || {}, options.extend.
^
TypeError: Object #<Hello> has no method 'extend'
at [object Object].processJob (/usr/local/lib/node_modules/node.io/lib/node.io/processor.js:241:27)
at /usr/local/lib/node_modules/node.io/lib/node.io/processor.js:328:30
Seems that the processing code expects all jobs to have an extend
method so it can use it here: https://github.com/chriso/node.io/blob/master/lib/node.io/processor.js#L241, but CoffeeScript jobs don't end up with extend
.
CoffeeScript jobs extend (using the CoffeeScript class system) the JobClass
property, as opposed to calling the Job
function. Job.extend
is defined in that Job()
function (https://github.com/chriso/node.io/blob/master/lib/node.io/job.js#L106). JobClass
never gets extend, so the processing code errors when it calls it.
I think extend
is in there because of the JobClass
getter which returns a new object each time (so the prototype can't be mangled). Each Job()
call operates on an effectively different JobClass
, so the extend
function body closes over the JobClass
it's defined on. Maybe the solution to this is to define extend
on the prototype and then bind it in the Job()
function?
This is similar to #17 I believe, but I couldn't reproduce either.
I'm trying to ask node.io to skip a url that was piped in from a file since it return a 404 error. I tried the following but the code continued past the this.skip() and as a result failed on the subsequent selector processing.
16 this.getHtml(url, function (err, $) {
17 //Handle any request / parsing errors
18 if (err) {
19 this.skip();
20 console.log(err);
21 if (err == '404') {
22 console.log('skipping');
23 this.skip(); //<<----It's not skipping to the next url from stdin
24 }
25 else {
26 this.exit(err);
27 }
28 }
.....................
I don't get it, how can I use this module inside my own module without having to export a job as a module??
Global install does not work because require('node.io') breaks.
And in case that works, how can I start a job? is there any undocumented method to start a job?
I'm testing out the save google.html example and it's giving me the following error:
ERROR: Module does not contain job "http://www.google.com/"
Tried both js & coffee versions and ran: node.io -s save "http://www.google.com/" > google.html
node.io: 0.2.9-2
node.js: 0.5.0-pre
node 0.4 on OSX using the hello.coffee example in wiki.
$ node.io-web -p 8080 .node_modules
INFO: Listening on port 8080
Running job "hello.coffee"
/Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:113
job_obj = job_obj.extend(options.extend.options || {}, options.ext
^
TypeError: Object #<Hello> has no method 'extend'
at /Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:113:31
at /Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:154:13
at /Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:319:53
Is there a possibility to make a delay between jobs/threads when input is true?
Meaning, that after scraping, reducing, outputting, etc the data node.io will wait before scraping again.
running multiple forks on a simple hello example throws an exception
node.io -f 1 hello
/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/processor.js:296
this.status(JSON.stringify(e), 'debug');
^
TypeError: Object # has no method 'status'
at /usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/processor.js:296:22
at [object Object].handleMasterMessage (/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/process_worker.js:28:9)
at EventEmitter. (/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/processor.js:235:22)
at EventEmitter.emit (events.js:42:17)
at Socket. (/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/multi_node.js:152:29)
at Socket.emit (events.js:42:17)
at Socket._onReadable (net.js:649:14)
at IOWatcher.onReadable as callback
Running the first example of the wiki to scrape a webpage (equivalent of curl -v http://example.com) I'm getting the following error:
node.io scrape.js "http://perdu.com/"
<html><head><title>Vous Etes Perdu ?</title></head><body><h1>Perdu sur l'Internet ?</h1><h2>Pas de panique, on va vous aider</h2><strong><pre> * <----- vous êtes ici</pre></strong></body></html>
http.js:330
this.socket.destroy(error);
^
TypeError: Cannot call method 'destroy' of null
at ClientRequest.destroy (http.js:330:15)
at /Users/kev/local/lib/node/.npm/node.io/0.2.1-15/package/lib/node.io/request.js:225:25
at IncomingMessage. (/Users/kev/local/lib/node/.npm/node.io/0.2.1-15/package/lib/node.io/request.js:322:13)
at IncomingMessage.emit (events.js:59:20)
at HTTPParser.onMessageComplete (http.js:111:23)
at Socket.ondata (http.js:990:22)
at Socket._onReadable (net.js:623:27)
at IOWatcher.onReadable [as callback] (net.js:156:10)
$ npm list installed npm info it worked if it ends with ok npm info using [email protected] npm info using [email protected] [email protected] =jashkenas active installed latest remote stable Unfancy JavaScript javascript language coffeescrip [email protected] =indexzero active installed latest remote stable Add-on for creating *nix daemons [email protected] =tjholowaychuk active installed latest remote TDD framework, light-weight, fast, CI-friendly [email protected] =tautologistics active installed latest remote Forgiving HTML/XML/RSS Parser in JS for *both* Node and [email protected] =cohara87 active installed remote A distributed data scraping and processing framework for node.js [email protected] =cohara87 installed remote A distributed data scraping and processing framework for node.js data ma [email protected] =cohara87 installed latest remote A distributed data scraping and processing framework for node.js [email protected] =caolan active installed latest remote Easy unit testing for node.js and the browser. [email protected] =isaacs installed remote A package manager for node package manager modules install package.json [email protected] =isaacs active installed latest remote A package manager for node package manager modules install p [email protected] =harryf active installed latest remote Adds CSS selector support to htmlparser for scraping activities [email protected] =cohara87 active installed latest remote Data validation, filtering and sanitization for node.js va npm ok
On the client side I have:
now.core.on("rv", function() {
jQuery('#messages div:first-child').html('Value replaced.');
});
This is causing:
Uncaught TypeError: Function.prototype.apply: Arguments list has wrong type on now.js:116.
This is a great framework, very easy to use! I have one suggestion: a very important part of scraping (especially with APIs) is not clobbering the server with hundreds of requests at once. Many APIs will ban/block you if you make too many requests in a small time period. For instance, the Wikitravel API requires that you not make more than one request per 30 seconds.
I suggest adding an Job option called "delay", in which you can specify the amount of time to wait before launching the next thread. That's pretty much the only thing that's missing. Thanks!
I i'm having a real problem trying to add extra methods to a job. Currently i'm using nodeio.JobProto.prototype to add my methods for my base class , but the problem comes when i try to access said function in the 'run' method stating that the method cannot be found.
nodeio.JobProto.prototype.test=function(){
this.debug("hello world");
}
new nodeio.Job({
run:function(){
this.test();
})
I got around this my mixin my methods in the first run call but this isnt the best way since the methods are referencing the same function in all job instances
In test/request.test.js
I get twelve ECONNREFUSED
errors (on lines lines 66, 92, ...).
Is this a known problem? Could be an expresso issue. Which versions are you using assuming it works for you?
$ git rev-parse HEAD 89d775fb3e263fa67b23d7fbf64fd2ad19186c45 $ node --version v0.4.11 $ expresso --version 0.8.1
Cheers!
Felix
Thanks for your work on node.io, it's a fantastic tool!
I've just enabled jsdom in order to use jQuery, and now get the following error:
$ node.io nuffield
/usr/local/lib/node_modules/node.io/lib/node.io/dom.js:58
utils.fatal('jQuery is not installed. Run `npm install jquery`');
^
TypeError: Object #<Object> has no method 'fatal'
at [object Object].parseHtml (/usr/local/lib/node_modules/node.io/lib/node.io/dom.js:58:19)
at [object Object].<anonymous> (/usr/local/lib/node_modules/node.io/lib/node.io/request.js:108:18)
at /usr/local/lib/node_modules/node.io/lib/node.io/request.js:217:25
at /usr/local/lib/node_modules/node.io/lib/node.io/request.js:379:17
at IncomingMessage.<anonymous> (/usr/local/lib/node_modules/node.io/lib/node.io/request.js:384:17)
at IncomingMessage.emit (events.js:81:20)
at HTTPParser.onMessageComplete (http.js:133:23)
at Socket.ondata (http.js:1227:22)
at Socket._onReadable (net.js:683:27)
at IOWatcher.onReadable [as callback] (net.js:177:10)
I have run npm install jquery; node.io is installed globally while all the other modules are installed in my home directory.
Installed modules:
$ npm list
/Users/peter
├── [email protected]
├── [email protected]
├── [email protected]
├─┬ [email protected]
│ ├── [email protected]
│ └── [email protected]
├─┬ [email protected] -> /usr/local/lib/node_modules/node.io
│ ├── [email protected]
│ ├── [email protected]
│ └─┬ [email protected]
│ └─┬ [email protected]
│ └── [email protected]
└── [email protected]
and
$ npm list -g
/usr/local/lib
├─┬ [email protected]
│ ├── [email protected]
│ ├── [email protected]
│ └─┬ [email protected]
│ └─┬ [email protected]
│ └── [email protected]
└─┬ [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
└── [email protected]
Please, do you have any idea what is wrong?
> echo "mastercard.com" | node.io pagerank
ERROR: Failed to load job "pagerank". Please check that the job exists and compiles correctly.
From [email protected] on the mailing list:
request.js line 505 should be 'response.setEncoding('binary');'
for multibyte chunked response
Hi there, and congratulations for the good work done so far with Node.io.
I've been trying to re-use some of your clever dom augmentation methods in my own screen scraping project. I believe these methods could come in handy even outside of node.io, as they at least provide a decent read only DOM API, which is currently lacking in both Tautologistics's html parser and Soupselect. However, I've noticed that the element's 'text' getter does not work when called with a node that has children elements. For instance, try selecting the header texts from this DOM document:
$('h3',document).each(function(item){
puts(item.text)
});
home/me/mycode/dom.js:173
text = self.filter(text).entityDecode();
^
TypeError: Object #<an Object> has no method 'filter'
In your code, this is on line 225. My guess is that you have not fully implemented this bit yet. I would be happy to help you testing this.
Best and thanks for sharing,
Andrea
As you suggested earlier, I tried using jsdom: true
option, but nothing happens. It all works fine when I use jsdom:false
. When I use it with jsdom: true, I always get a timeout error.
var nodeio = require('node.io'); var options = { max: 20, timeout: 20, jsdom: true }; exports.job = new nodeio.Job(options, { input: false, run: function () { this.getHtml('http://node.io/', function (err, $, data) { if(err) this.exit(err); this.emit("jquery object:"+ $); }); } });
I am using
node.io -v 0.3.0 jquery -v 1.5.1 jsdom -v 0.2.0
I din't have the permission to reopen the previous issue so creating a new one...
I'm talking about the node.js Example 1 here: https://github.com/chriso/node.io/wiki/Scraping
Instead of
} else {
self.emit(data);
}
}
it should be
} else {
self.emit(data);
}
});
Also, this is less important, but you may want a semicolon after self.exit(err)
in the same example.
I'm using node.io to run a bit of a web scraper (grabbing game data from nhl.com) and I have a problem wherein node.io exits before all of the getHtml requests are completed. How does it decide when to exit? This uses node 0.2.5.
Hi, Can you provide some code that shows how one would use a proxy? I realize that the code is not tested, but I'm not even sure how to include it in the ob options.
I'm assuming something like
new nodeio.HttpProxy("http://proxy:port")
is how it's meant to function?
Thanks for this excellent library. If I could get it to work through the proxy, it would be excellent.
When screen scraping, it can be very convenient sometimes, to automatically generate a post request from a form in the html document. To get the point across, here is some code I wrote which does that for me.
submitForm = ($, callback) =>
body = {}
for item in $('form select, form input[type!=submit], form input[type=submit]:first')
if $(item).val()
body[$(item).attr('name')] = $(item).val()
url = root + $('form').attr('action').replace('&', '&')
header = {'Content-Type': 'application/x-www-form-urlencoded'}
@postHtml url, body, header, callback
Would it be possible to have that feature in node.io?
I noticed a few issues with regards to HTTP redirects-
Thanks!
i want to kick off a script in a cron. the script works fine if manually run. but from the cron the output i get is this...
/bin/sh: node.io: not found
im using the plain old user crontab -e to setup the cron, so it should run it as my standard user
any ideas?
the command im using is something like this
*/1 * * * * node.io /home/myuname/code/project/import.js >> /home/myuname/code/project/import_log.log 2>&1
I've just started implementing a build tool that needs to scrape some data and return as a jsonp request. The problem I have is that the node.io timeout kills the parent process which I want to keep live all the time, is it possible to exit or tidy the node.io scrape process without killing the entire node process?
Hello, I'm trying to pass complex objects as my obj.
On first try It seems that I can't pass along an "object" itself as the job just exits without any error. I then tried to encode my job using various methods, JSON.stringify and querystring.stringify but the url passed to the "run" function only contained the first character :
exports.job = new nodeio.Job({
input:function(start, num, callback) {
setTimeout(function() {
var body = {
url:"http://google.com",
foo:"boo",
bar:"bar"
};
callback(JSON.stringify(body)); // only outputs the first char which is { same with querystring.stringify
}, 1000);
},
run:function(job) {
console.log(job);
}
})
I'm trying to run the intro job hello.js but when I run this command:
node.io hello
I get this output:
ERROR: Failed to load job "hello". Please check that the job exists and compiles correctly.
When I debug:
DEBUG: {"stack":"Error: Cannot find module 'node.io'\n at Function._resolveFilename (module.js:322:11)\n at Function._load (module.js:267:25)\n at require (module.js:351:19)\n at Object.<anonymous> (/Users/username/Documents/Nodejs/hello.js:1:75)\n at Module._compile (module.js:407:26)\n at Object..js (module.js:413:10)\n at Module.load (module.js:339:31)\n at Function._load (module.js:298:12)\n at require (module.js:351:19)\n at [object Object].loadJob (/opt/local/lib/node_modules/node.io/lib/node.io/processor.js:294:37)","message":"Cannot find module 'node.io'"}
However, echo "mastercard.com" | node.io -s pagerank
works just fine just as node.io query "http://www.reddit.com/" a.title
does.
Hi Chris,
I think the Timer reset code in pullInput
might sometimes prevent the complete
event from being emitted. The reliability seems to be related to the size of the input, the max threads and the response speed for the last few requests.
My limited async debugging skills mean I've filled my local version of node.io with status('blah','debug')
calls to what's happening. It seems that when there's no more input data left to pull there could be requests still running, and if they don't complete within the 300ms before the next check in handle_input()
then it's possible that the crucial final check is never made.
By changing line 129 in process_master.js
from:
if (completeCheckInterval) {
clearInterval(completeCheckInterval);
}
to something like:
if (completeCheckInterval && job.input.length > pull) {
clearInterval(completeCheckInterval);
}
...it works each time, regardless of job.options.max
or network conditions. In fact, removing that block is also fine as it only normally takes a few seconds for the last few threads to finish.
I'm using node.io within other node scripts. I run my jobs using nodeio.start(job, options, callback, true)
and I'm providing a job.input()
to load data from a filename passed in options.args
. Not sure whether that would make a difference to either a) the way the input data is collected; or b) the behaviour of complete
.
I can try and write some test scripts to recreate the problem I'm having if you think the issue is elsewhere?
I've found node.io to be really useful, thanks for sharing it. Let me know if you need a hand with any documentation, examples etc. I would love to contribute if I can.
Many thanks
Paul
This is fantastic! Currently I use a ruby script to scrape a web database, there's one thing that my script has that I'd love to see here. And that's the ability to validate the uniqueness of a record before writing to file or insertion in database.
And with regards to the distribution across servers, I'm truly looking forward to this!
So:
I'm using .scrape() within another node.js application and the fact that node.io makes a hard kill after job completion is becoming a problem. After I call .scrape() and the job is completed it also kills my node.js server after outputting the job status.
Is there a way around this?
Also: I am aware that node.io is primarily a console application meant to be running on it's own; but it's just so sexy!
As a new nodejs user, the global installation of node.io has been confusing, as I've had to install node.io globally and locally.
It appears by default (and design) that global installations are not in the path - this link https://github.com/isaacs/npm/issues/775#issuecomment-1389398 makes it clear this is intended (via http://stackoverflow.com/questions/6159552/problem-running-jobs-with-node-io/6415110#6415110).
I appreciate that you have a different view on the subject, so could you add steps to your installation page to explain what needs changing to the path and why, as the link above implies it's a bad idea to do?
Many thanks!
This happens on every request (something to do with how destroy has changed/moved around)
Also, createClient has been removed from the public API docs
http.js:330
this.socket.destroy(error);
^
TypeError: Cannot call method 'destroy' of null
at ClientRequest.destroy (http.js:330:15)
at /usr/local/lib/node/.npm/node.io/0.2.1-18/package/lib/node.io/request.js:225:25
at IncomingMessage. (/usr/local/lib/node/.npm/node.io/0.2.1-18/package/lib/node.io/request.js:321:13)
at IncomingMessage.emit (events.js:59:20)
at HTTPParser.onMessageComplete (http.js:111:23)
at Client.onData as ondata
at Client._onReadable (net.js:623:27)
at IOWatcher.onReadable as callback
Is it possible to use jQuery in the place of $ object? OR is there any way I can create a jQuery window object with the help of $ object?
var nodeio = require('node.io'), options = {timeout: 10}, jQuery = require('jquery'); exports.job = new nodeio.Job(options, { input: ['hello', 'foobar', 'weather'], run: function (keyword) { this.getHtml('http://www.google.com/search?q=' + encodeURIComponent(keyword), function (err, $) { // SOMEHOW CREATE THE JQUERY OBJECT USING $ var results = $('#resultStats').text.toLowerCase(); this.emit(keyword + ' has ' + results); }); } });
Hi,
maybe I'm doing something wrong, but I have simple function that handles request. It's just copy&paste of your example. It works fine, it prints all titles. However after that the console says "OK: Job complete" and the server is stopped, without any warning. Am I doing something wrong?
Code:
function index(response, request){
require('node.io').scrape(function() {
this.getHtml('http://www.reddit.com/', function(err, $) {
var stories = [];
$('a.title').each(function(title) {
stories.push(title.text);
});
this.emit(stories);
});
});
Sometimes in crawler process i catch:
TypeError: Object # has no method 'destroy'
at /home/ec2-user/.node_libraries/.npm/node.io/0.1.1-15/package/lib/node.io/request.js:102:21
Make simple workaround in this line, adding:
if(request && request.destroy) {
...
}
But I believe that the problem is more severe
I tried reddit.coffee from the tutorial and didn't get any output. For reference, here's the source code:
nodeio = require('node.io') titles = [] scores = [] output = [] class Reddit extends nodeio.JobClass input: false run: -> @getHtml 'http://www.reddit.com/', (err, $, data) => @exit err if err? $('a.title').each (a) -> titles.push a.text $('div.score.unvoted').each (div) -> scores.push div.rawtext @exit 'Title / score mismatch' if scores.length isnt titles.length for score, i in scores if score is '•' then continue @assert(score).isInt() output.push '[' + score + '] ' + titles[i] @emit output @class = Reddit @job = new Reddit({timeout:10})
After a bit of debugging, I realized that something had changed since the tutorial had been written so that div.rawtext was undefined. This was triggering @Assert(score).isInt() to fail so that threads would terminate without printing undefined. However, there wasn't any sort of message that the assertion had failed, even when I ran things with the -g flag:
john-laptop :: ~/tmp » node.io -g reddit.coffee 1 ↵ INFO: Compiling /home/john/tmp/reddit.coffee INFO: Running 1 worker.. DEBUG: GET http://www.reddit.com/ DEBUG: 200 http://www.reddit.com/ OK: Job complete
Is there something I can do to see when an assert fails?
I just needed to realize form login by a POST request, it then redirected to the member page as expected. The problem is, after redirect, GET request should be used, but node.io (request.js) used POST by copying previous request. I changed this line:
self.doRequest(method, location, null, headers, callback, parse, ++redirects);
to
self.doRequest('GET', location, null, headers, callback, parse, ++redirects);
Error message:
john-laptop :: ~/tmp » node.io hello node.js:63 throw e; ^ Error: Cannot find module '/home/john/tmp/hello' at loadModule (node.js:275:15) at require (node.js:411:14) at [object Object].loadJob (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:298:16) at Object.start (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:152:30) at /usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/cli.js:124:19 at Object.cli (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/cli.js:279:5) at Object. (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/bin/node.io:3:20) at Module._compile (node.js:462:23) at Module._loadScriptSync (node.js:469:10) at Module.loadSync (node.js:338:12)
However, this seems to work:
node.io hello.coffee
And oddly enough, this works (using double.coffee from the tutorial):
node.io double
A related issue is that require statements don't seem to work properly. Using double.coffee and quad.coffee from the tutorial,
john-laptop :: ~/tmp » node.io quad.coffee 1 ↵ INFO: Compiling /home/john/tmp/quad.coffee node.js:275 throw new Error("Cannot find module '" + request + "'"); ^ Error: Cannot find module './double' at loadModule (node.js:275:15) at require (node.js:411:14) at Object. (/home/john/tmp/quad_compiled.js:12:12) at Object. (/home/john/tmp/quad_compiled.js:25:4) at Module._compile (node.js:462:23) at Module._loadScriptSync (node.js:469:10) at Module.loadSync (node.js:338:12) at loadModule (node.js:283:14) at require (node.js:411:14) at [object Object].loadJob (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:298:16)
If I modify the original quad.coffee so it refers to double.coffee:
john-laptop :: ~/tmp » node.io quad.coffee 1 ↵ INFO: Compiling /home/john/tmp/quad.coffee /home/john/tmp/double.coffee:3 class Double extends nodeio.JobClass ^^^^^^ SyntaxError: Unexpected identifier at Module._compile (node.js:458:37) at Module._loadScriptSync (node.js:469:10) at Module.loadSync (node.js:338:12) at loadModule (node.js:283:14) at require (node.js:411:14) at Object. (/home/john/tmp/quad_compiled.js:12:12) at Object. (/home/john/tmp/quad_compiled.js:25:4) at Module._compile (node.js:462:23) at Module._loadScriptSync (node.js:469:10) at Module.loadSync (node.js:338:12)
I even tried compiling double.coffee in to double.js and referring to it that way:
john-laptop :: ~/tmp » node.io quad.coffee john-laptop :: ~/tmp » node.io quad.coffee INFO: Compiling /home/john/tmp/quad.coffee /home/john/tmp/quad_compiled.js:6 ctor.prototype = parent.prototype; ^ TypeError: Cannot read property 'prototype' of undefined at /home/john/tmp/quad_compiled.js:6:28 at /home/john/tmp/quad_compiled.js:17:5 at Object. (/home/john/tmp/quad_compiled.js:22:3) at Object. (/home/john/tmp/quad_compiled.js:25:4) at Module._compile (node.js:462:23) at Module._loadScriptSync (node.js:469:10) at Module.loadSync (node.js:338:12) at loadModule (node.js:283:14) at require (node.js:411:14) at [object Object].loadJob (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:298:16)
I am using npm 1.0rc8 and node 0.4.5
When I try running the following code:
var nodeio = require('node.io');
exports.job = new nodeio.Job({
input: false,
run: function() {
var self = this;
this.getHtml('http://google.com/', function(err, $) {
if (err) { self.exit(err) }
self.emit($('input'));
})
}
});
I get this error:
Error: Cannot find module 'htmlparser'
at Function._resolveFilename (module.js:320:11)
at Function._load (module.js:266:25)
at require (module.js:348:19)
at Object.<anonymous> (/opt/local/lib/node_modules/node.io/vendor/soupselect/lib/soupselect.js:8:16)
at Module._compile (module.js:404:26)
at Object..js (module.js:410:10)
at Module.load (module.js:336:31)
at Function._load (module.js:297:12)
at require (module.js:348:19)
at [object Object].$ (/opt/local/lib/node_modules/node.io/lib/node.io/dom.js:20:22)
There is no way of using .each when there is just a single element matching a class. Take this HTML as an example.
<div class='example'>This is text contained in a div with the class example</div>
<div class='ex2'>This is also text, this time contained in a class named ex2</div>
<div class='example'>Oh look, same class as the first div, example</div>
All fun and games right?
Let's try to get the text out of every div with the class example:
var a = []; $('div.example').each(function(e){ a.push(e.text); }); this.emit(a);
This should work without a hitch.
Now let's try to get the text out of every div with the class ex2, we have no idea how many classes like this are used so we use each to just get them all, even if it's just one:
var z = []; $('div.ex2').each(function(e){ z.push(e.text); }); this.emit(z);
Awww man! That didn't work at all!
.each just works for classes used multiple times. With CSSLint fanatics starting to use classes in place of # this might become a larger issue in the future.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.