node-js-libs / node.io Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 147.0 3.17 MB

License: MIT License

Perl 0.10% JavaScript 90.71% Io 0.11% CoffeeScript 9.08%

node.io's People

Contributors

Stargazers

Watchers

Forkers

sobtiankit haifeng ajithhub rgaidot andrewschaaf willglynn jaehess codeinvain orenfalkowitz iwater josephmisiti tornak47 grapevinebeta nedzynski johnantoni frederickcook airhorns mledu pombredanne netconstructor mmoulton jtidal erkan-yilmaz gotomypc ishara iteam1337 d-ne0 crystalneth samvj chanil1218 akkuma netcoid vermiculite tomafc330 catesandrew strogo aarono couldo nvdnkpr qiuqiang1985 amite jacobsingh theigroup mattnull afshinm damianof rusintez nivertech liuyunclouder davidlevy notjrbauer zhamid netcon-source leadsplus mapping scraping-xx big-data hosts-xx hosting-scripts openhosts bigdata-tools opensource- openscripts-xx remotecontrol ygrabovskiy ibank priamcode skykog sononix ambled intrepidal zazabe anuj16 contextworks pccowboy skopp sequoiar wanni autovalue oliland yumitsu listings-xx arthurchui imclab web5design giapdangle parsing vvu mborromed veyond andrewleith ferland willcode2surf thebennos joeromero marcoippolito slatemine markwalls vitaliy-yarovuy natefriedman

node.io's Issues

How to use selector context?

As a test, I'm trying to scrape a very simple HTML page that has divs with links in them.

<html>
    <body>
        <div class="playerNameAndInfo"><a href="#">Will W.</a></div>
        <div class="playerNameAndInfo"><a href="#">Frank F.</a></div>
        <div class="playerNameAndInfo"><a href="#">James J.</a></div>
        <div class="playerNameAndInfo"><a href="#">Billy B.</a></div>
        <div class="playerNameAndInfo"><a href="#">Harv H.</a></div>
        <div class="playerNameAndInfo"><a href="#">Chris C.</a></div>
    </body>
</html>

my node.io job looks like

var nodeio = require('node.io');
var methods = {
    input: false,
    run: function() {
        this.getHtml('http://www.willwyatt.com/nio.html', function(err, $) {

            //Handle any request / parsing errors
            if (err) this.exit(err);

            var output = [];

            var playerFirstName = [], playerLastName = [], playerPostion = [], playerTeam = [];

            $('.playerNameAndInfo').each(function(player) {
                thisPlayer = $('a', player.rawtext);
                output.push(thisPlayer.length + "\n");
            });

            this.emit(output);
        });
    }
}

exports.job = new nodeio.Job({timeout:10}, methods);

I expect this to output 6 lines that say '1' since thisPlayer should have been limited to the context of player. Instead, it looks like thisPlayer is ignoring the context called player. What am I doing wrong?

Thanks.

why $ can be undefined ?

A body isn't empty.

var nodeio = require('/usr/lib/node_modules/node.io');
options = { jsdom: true};
exports.job = new nodeio.Job(options,{
        input: false,
        run: function () {
            this.getHtml('http://127.0.0.4/index.php?r=home/login&language=en',function(err,$,data,headers) {
                console.log($);
            });
        }
});

DEBUG: Running 1 worker..
DEBUG: GET http://127.0.0.4/index.php?r=home/login&language=en (request 98917)
DEBUG:   | GET /index.php?r=home/login&language=en HTTP/1.1
DEBUG:   | Accept: */*
DEBUG:   | Accept-charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
DEBUG:   | User-agent: node.io
DEBUG:   | Host: 127.0.0.4
DEBUG: 200 http://127.0.0.4/index.php?r=home/login&language=en (response 98917)
DEBUG:   | Date: Fri, 02 Sep 2011 21:45:51 GMT
DEBUG:   | Server: Apache/2.2.19 (Fedora)
DEBUG:   | X-powered-by: PHP/5.3.6
DEBUG:   | Set-cookie: lang=en; expires=Sat, 01-Sep-2012 21:45:51 GMT; path=/,PHPSESSID=fi9tlebs2uft88fou0p10gvfc0; path=/
DEBUG:   | Expires: Thu, 19 Nov 1981 08:52:00 GMT
DEBUG:   | Cache-control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
DEBUG:   | Pragma: no-cache
DEBUG:   | Content-length: 7099
DEBUG:   | Connection: close
DEBUG:   | Content-type: text/html; charset=UTF-8
undefined

But if i use

this.getHtml('http://www.google.com',function(err,$,data,headers) {
                console.log($);
            });

$ is a function. and work good.

$ object catching errors

What is the recommended way to catch or prevent errors from throwing when scraping using the $ object?

This is the type of error I am getting:

/usr/local/lib/node_modules/node.io/lib/node.io/dom.js:25
throw new Error("No elements matching '" + selector + "'");
^
Error: No elements matching 'div.foo p'

Somewhat related -- if the $ selection only returns one result, .each fails.

Suggestions on how I should best handle these edge cases?

Distributing work

My whole problem with screen scraping is that my ip eventually gets blocked. It would be great to be able to create a list of urls to push work out to ( as slaves) or to publish a URL for clients (browsers) to pull work from a queue.

So in practice I could create an account on no.de which runs my server, tell it which slaves are available (ip or urls) and have it publish a URL for client side workers. This would certainly help me work around ip blocking.

Have you seen map-crowd-reduce (node.js project)?

$("html")

Just a quick question. I am trying to get $("html") and while $("body") or $("head") seem to work, $("html") doesn't. Any idea?

    throw new Error("No elements matching '" + selector + "'");
          ^

Error: No elements matching 'html'
at [object Object].$ (/usr/local/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/dom.js:22:15)...

Problem with request.js

There seems to be a problem with request.js, handling certain URLs, e.g.
http://investing.businessweek.com/research/stocks/snapshot/snapshot.asp?capId=101648

I'm trying to scrape a list of URLs, but this URL consistently fails with an error 'EBADNAME, Misformatted domain name'

My code is here, (it's forcing this URL so you don't need to supply any input to see the issue)
https://gist.github.com/833888

ncb000gt kindly supplied a dump of the output of urlparse, which seems to get called twice, the second time it runs the URL gets corrupted:
https://gist.github.com/833921

Typo in node.io / builtin / duplicates.coffee

fwiw, believe line 21 should be "seen_lines", not "seel_lines"

installation error

node v0.4.10
npm 1.0.22

Error (from a local install or -g) is:

[email protected] preinstall /home/will/sources/nfl/node_modules/node.io/node_modules/jsdom/node_modules/contextify
node-waf clean || true; node-waf configure build

sh: node-waf: not found
sh: node-waf: not found
npm ERR! error installing [email protected] Error: [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! error installing [email protected] sh "-c" "node-waf clean || true; node-waf configure build" failed with 127
npm ERR! error installing [email protected] at ChildProcess. (/usr/lib/node_modules/npm/lib/utils/exec.js:49:20)
npm ERR! error installing [email protected] at ChildProcess.emit (events.js:67:17)
npm ERR! error installing [email protected] at ChildProcess.onexit (child_process.js:192:12)
npm ERR! error installing [email protected] Error: [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! error installing [email protected] sh "-c" "node-waf clean || true; node-waf configure build" failed with 127
npm ERR! error installing [email protected] at ChildProcess. (/usr/lib/node_modules/npm/lib/utils/exec.js:49:20)
npm ERR! error installing [email protected] at ChildProcess.emit (events.js:67:17)
npm ERR! error installing [email protected] at ChildProcess.onexit (child_process.js:192:12)
npm ERR! error installing [email protected] Error: [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! error installing [email protected] sh "-c" "node-waf clean || true; node-waf configure build" failed with 127
npm ERR! error installing [email protected] at ChildProcess. (/usr/lib/node_modules/npm/lib/utils/exec.js:49:20)
npm ERR! error installing [email protected] at ChildProcess.emit (events.js:67:17)
npm ERR! error installing [email protected] at ChildProcess.onexit (child_process.js:192:12)
npm ERR! [email protected] preinstall: node-waf clean || true; node-waf configure build
npm ERR! sh "-c" "node-waf clean || true; node-waf configure build" failed with 127
npm ERR!
npm ERR! Failed at the [email protected] preinstall script.
npm ERR! This is most likely a problem with the contextify package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node-waf clean || true; node-waf configure build
npm ERR! You can get their info via:
npm ERR! npm owner ls contextify
npm ERR! There is likely additional logging output above.
npm ERR!
npm ERR! System Linux 2.6.39.1-linode34
npm ERR! command "node" "/usr/bin/npm" "install" "node.io"
npm ERR! cwd /home/will/sources/nfl/node_modules
npm ERR! node -v v0.4.10
npm ERR! npm -v 1.0.22
npm ERR!
npm ERR! Additional logging details can be found in:
npm ERR! /home/will/sources/nfl/node_modules/npm-debug.log
npm not ok

Any ideas? Thanks.

Content-Length for utf8

Content-Length header in request.js is currently being set based on string.length instead of Buffer.byteLength(string)

nesting of methods

I am trying to call the get methods one inside the other but it doesn't seem to output anything. I am a noob, am I missing something here?

var nodeio  = require('node.io'),
      options = {timeout: 1000, max: 30},

exports.job = new nodeio.Job(options, {
    input : ['CHI'],
    run: function (keyword) {
        outerThis = this;
        this.getHtml('http://www.kayak.com/s/search/air/?l1=ATL&df=us1&ft=ow&d1=6/21/2011&l2='+keyword, function (err, $, data, headers) {
            var results = '';

            $('.flightresult').each(function(selector){
                outerThis.getHtml('http://www.google.com/', function(err, $, data, headers) {
                    results = "inside google";
                    outerThis.emit("asdadada");   // DOESN'T EMIT ANYTHING, DOESN'T RUN AT ALL
                });
            });

            this.emit(i++ +'. Results for ATL to '+keyword);
            this.emit(results);
        });
    }
});

CoffeeScript jobs can't be used with command line options which require 'extend'

An example of the error:

hornairs@bishop:~/Code/busters (master *)$ cat jobs/hello.coffee 
nodeio = require 'node.io'
class Hello extends nodeio.JobClass
    input: ['']
    run: (num) -> @emit 'Hello World!'

@class = Hello
@job = new Hello()

hornairs@bishop:~/Code/busters (master *)$ node.io --debug jobs/hello.coffee 
DEBUG: Compiling jobs/hello.coffee => hello_compiled.js
DEBUG: Running 1 worker..
DEBUG: Writing to STDOUT
Hello World!
OK: Job complete

hornairs@bishop:~/Code/busters (master *)$ node.io --debug -i input.txt jobs/hello.coffee 
DEBUG: Compiling jobs/hello.coffee => hello_compiled.js

/usr/local/lib/node_modules/node.io/lib/node.io/processor.js:241
        job_obj = job_obj.extend(options.extend.options || {}, options.extend.
                          ^
TypeError: Object #<Hello> has no method 'extend'
    at [object Object].processJob (/usr/local/lib/node_modules/node.io/lib/node.io/processor.js:241:27)
    at /usr/local/lib/node_modules/node.io/lib/node.io/processor.js:328:30

Seems that the processing code expects all jobs to have an extend method so it can use it here: https://github.com/chriso/node.io/blob/master/lib/node.io/processor.js#L241, but CoffeeScript jobs don't end up with extend.

CoffeeScript jobs extend (using the CoffeeScript class system) the JobClass property, as opposed to calling the Job function. Job.extend is defined in that Job() function (https://github.com/chriso/node.io/blob/master/lib/node.io/job.js#L106). JobClass never gets extend, so the processing code errors when it calls it.

I think extend is in there because of the JobClass getter which returns a new object each time (so the prototype can't be mangled). Each Job() call operates on an effectively different JobClass, so the extend function body closes over the JobClass it's defined on. Maybe the solution to this is to define extend on the prototype and then bind it in the Job() function?

This is similar to #17 I believe, but I couldn't reproduce either.

How does skip work?

I'm trying to ask node.io to skip a url that was piped in from a file since it return a 404 error. I tried the following but the code continued past the this.skip() and as a result failed on the subsequent selector processing.

16 this.getHtml(url, function (err, $) {
17 //Handle any request / parsing errors
18 if (err) {
19 this.skip();
20 console.log(err);
21 if (err == '404') {
22 console.log('skipping');
23 this.skip(); //<<----It's not skipping to the next url from stdin
24 }
25 else {
26 this.exit(err);
27 }
28 }
.....................

Local instalation?

I don't get it, how can I use this module inside my own module without having to export a job as a module??
Global install does not work because require('node.io') breaks.
And in case that works, how can I start a job? is there any undocumented method to start a job?

Module does not contain job

I'm testing out the save google.html example and it's giving me the following error:

ERROR: Module does not contain job "http://www.google.com/"

Tried both js & coffee versions and ran: node.io -s save "http://www.google.com/" > google.html

node.io: 0.2.9-2
node.js: 0.5.0-pre

TypeError: Object #<Hello> has no method 'extend'

node 0.4 on OSX using the hello.coffee example in wiki.

$ node.io-web -p 8080 .node_modules
INFO: Listening on port 8080

Running job "hello.coffee"

/Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:113
            job_obj = job_obj.extend(options.extend.options || {}, options.ext
                              ^
TypeError: Object #<Hello> has no method 'extend'
    at /Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:113:31
    at /Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:154:13
    at /Users/sridharr/.nvm/v0.4.0/lib/node/.npm/node.io/0.2.2-5/package/lib/node.io/processor.js:319:53

Pause, wait, etc

Is there a possibility to make a delay between jobs/threads when input is true?
Meaning, that after scraping, reducing, outputting, etc the data node.io will wait before scraping again.

node.io -f throws exception

running multiple forks on a simple hello example throws an exception

node.io -f 1 hello

/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/processor.js:296
this.status(JSON.stringify(e), 'debug');
^
TypeError: Object # has no method 'status'
at /usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/processor.js:296:22
at [object Object].handleMasterMessage (/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/process_worker.js:28:9)
at EventEmitter. (/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/processor.js:235:22)
at EventEmitter.emit (events.js:42:17)
at Socket. (/usr/local/lib/node/.npm/node.io/0.2.5-5/package/lib/node.io/multi_node.js:152:29)
at Socket.emit (events.js:42:17)
at Socket._onReadable (net.js:649:14)
at IOWatcher.onReadable as callback

http.js error

Running the first example of the wiki to scrape a webpage (equivalent of curl -v http://example.com) I'm getting the following error:

node.io scrape.js "http://perdu.com/"
<html><head><title>Vous Etes Perdu ?</title></head><body><h1>Perdu sur l'Internet ?</h1><h2>Pas de panique, on va vous aider</h2><strong><pre>    * <----- vous êtes ici</pre></strong></body></html>


http.js:330
  this.socket.destroy(error);
              ^
TypeError: Cannot call method 'destroy' of null
    at ClientRequest.destroy (http.js:330:15)
    at /Users/kev/local/lib/node/.npm/node.io/0.2.1-15/package/lib/node.io/request.js:225:25
    at IncomingMessage. (/Users/kev/local/lib/node/.npm/node.io/0.2.1-15/package/lib/node.io/request.js:322:13)
    at IncomingMessage.emit (events.js:59:20)
    at HTTPParser.onMessageComplete (http.js:111:23)
    at Socket.ondata (http.js:990:22)
    at Socket._onReadable (net.js:623:27)
    at IOWatcher.onReadable [as callback] (net.js:156:10)

$ npm list installed
npm info it worked if it ends with ok
npm info using [email protected]
npm info using [email protected]
[email protected]   =jashkenas active installed latest remote stable   Unfancy JavaScript     javascript language coffeescrip
[email protected]          =indexzero active installed latest remote stable   Add-on for creating *nix daemons    
[email protected]        =tjholowaychuk active installed latest remote   TDD framework, light-weight, fast, CI-friendly    
[email protected]      =tautologistics active installed latest remote   Forgiving HTML/XML/RSS Parser in JS for *both* Node and 
[email protected]      =cohara87 active installed remote   A distributed data scraping and processing framework for node.js     
[email protected]      =cohara87 installed remote   A distributed data scraping and processing framework for node.js     data ma
[email protected]      =cohara87 installed latest remote   A distributed data scraping and processing framework for node.js     
[email protected]        =caolan active installed latest remote   Easy unit testing for node.js and the browser.    
[email protected]            =isaacs installed remote   A package manager for node     package manager modules install package.json
[email protected]            =isaacs active installed latest remote   A package manager for node     package manager modules install p
[email protected]      =harryf active installed latest remote   Adds CSS selector support to htmlparser for scraping activities 
[email protected]       =cohara87 active installed latest remote   Data validation, filtering and sanitization for node.js     va
npm ok

now.core.on throws error in chrome

On the client side I have:

now.core.on("rv", function() {
  jQuery('#messages div:first-child').html('Value replaced.');
});

This is causing:
Uncaught TypeError: Function.prototype.apply: Arguments list has wrong type on now.js:116.

Add "delay" option to pause between calls to run()

This is a great framework, very easy to use! I have one suggestion: a very important part of scraping (especially with APIs) is not clobbering the server with hundreds of requests at once. Many APIs will ban/block you if you make too many requests in a small time period. For instance, the Wikitravel API requires that you not make more than one request per 30 seconds.

I suggest adding an Job option called "delay", in which you can specify the amount of time to wait before launching the next thread. That's pretty much the only thing that's missing. Thanks!

Unable to add extra methods

I i'm having a real problem trying to add extra methods to a job. Currently i'm using nodeio.JobProto.prototype to add my methods for my base class , but the problem comes when i try to access said function in the 'run' method stating that the method cannot be found.

nodeio.JobProto.prototype.test=function(){
this.debug("hello world");
}
new nodeio.Job({
run:function(){
this.test();
})

I got around this my mixin my methods in the first run call but this isnt the best way since the methods are referencing the same function in all job instances

Request Tests Fail

In test/request.test.js I get twelve ECONNREFUSED errors (on lines lines 66, 92, ...).

Is this a known problem? Could be an expresso issue. Which versions are you using assuming it works for you?

$ git rev-parse HEAD 89d775fb3e263fa67b23d7fbf64fd2ad19186c45 $ node --version v0.4.11 $ expresso --version 0.8.1

Cheers!
Felix

utils.fatal - Object has no method 'fatal'

Thanks for your work on node.io, it's a fantastic tool!

I've just enabled jsdom in order to use jQuery, and now get the following error:

$ node.io nuffield

/usr/local/lib/node_modules/node.io/lib/node.io/dom.js:58
            utils.fatal('jQuery is not installed. Run `npm install jquery`');
                  ^
TypeError: Object #<Object> has no method 'fatal'
    at [object Object].parseHtml (/usr/local/lib/node_modules/node.io/lib/node.io/dom.js:58:19)
    at [object Object].<anonymous> (/usr/local/lib/node_modules/node.io/lib/node.io/request.js:108:18)
    at /usr/local/lib/node_modules/node.io/lib/node.io/request.js:217:25
    at /usr/local/lib/node_modules/node.io/lib/node.io/request.js:379:17
    at IncomingMessage.<anonymous> (/usr/local/lib/node_modules/node.io/lib/node.io/request.js:384:17)
    at IncomingMessage.emit (events.js:81:20)
    at HTTPParser.onMessageComplete (http.js:133:23)
    at Socket.ondata (http.js:1227:22)
    at Socket._onReadable (net.js:683:27)
    at IOWatcher.onReadable [as callback] (net.js:177:10)

I have run npm install jquery; node.io is installed globally while all the other modules are installed in my home directory.

Installed modules:

$ npm list
/Users/peter
├── [email protected] 
├── [email protected] 
├── [email protected] 
├─┬ [email protected] 
│ ├── [email protected] 
│ └── [email protected] 
├─┬ [email protected] -> /usr/local/lib/node_modules/node.io
│ ├── [email protected] 
│ ├── [email protected] 
│ └─┬ [email protected] 
│   └─┬ [email protected] 
│     └── [email protected] 
└── [email protected]

and

$ npm list -g
/usr/local/lib
├─┬ [email protected] 
│ ├── [email protected] 
│ ├── [email protected] 
│ └─┬ [email protected] 
│   └─┬ [email protected] 
│     └── [email protected] 
└─┬ [email protected] 
  ├── [email protected] 
  ├── [email protected] 
  ├── [email protected] 
  └── [email protected]

Please, do you have any idea what is wrong?

Missing something?

> echo "mastercard.com" | node.io pagerank    
ERROR: Failed to load job "pagerank". Please check that the job exists and compiles correctly.

Multibyte chunked response

From [email protected] on the mailing list:

request.js line 505 should be 'response.setEncoding('binary');'

for multibyte chunked response

bindToDomElement.text fails when the node has nested children

Hi there, and congratulations for the good work done so far with Node.io.
I've been trying to re-use some of your clever dom augmentation methods in my own screen scraping project. I believe these methods could come in handy even outside of node.io, as they at least provide a decent read only DOM API, which is currently lacking in both Tautologistics's html parser and Soupselect. However, I've noticed that the element's 'text' getter does not work when called with a node that has children elements. For instance, try selecting the header texts from this DOM document:

$('h3',document).each(function(item){   
    puts(item.text)
});

home/me/mycode/dom.js:173
    text = self.filter(text).entityDecode();
                ^
TypeError: Object #<an Object> has no method 'filter'

In your code, this is on line 225. My guess is that you have not fully implemented this bit yet. I would be happy to help you testing this.

Best and thanks for sharing,
Andrea

jquery not working with jsdom true

As you suggested earlier, I tried using jsdom: true option, but nothing happens. It all works fine when I use jsdom:false. When I use it with jsdom: true, I always get a timeout error.

var nodeio = require('node.io');
var options = {
      max: 20,
      timeout: 20, 
      jsdom: true
    };

exports.job = new nodeio.Job(options, {
    input: false,
    run: function () {
        this.getHtml('http://node.io/', function (err, $, data) {
          if(err) this.exit(err);
          this.emit("jquery object:"+ $);
        });
    }
});

I am using

node.io -v 0.3.0
jquery -v 1.5.1
jsdom -v 0.2.0

I din't have the permission to reopen the previous issue so creating a new one...

Scraping example is missing a parenthesis

I'm talking about the node.js Example 1 here: https://github.com/chriso/node.io/wiki/Scraping

Instead of
} else {
self.emit(data);
}
}

it should be
} else {
self.emit(data);
}
});

Also, this is less important, but you may want a semicolon after self.exit(err) in the same example.

Early exit?

I'm using node.io to run a bit of a web scraper (grabbing game data from nhl.com) and I have a problem wherein node.io exits before all of the getHtml requests are completed. How does it decide when to exit? This uses node 0.2.5.

Working with proxies

Hi, Can you provide some code that shows how one would use a proxy? I realize that the code is not tested, but I'm not even sure how to include it in the ob options.

I'm assuming something like

new nodeio.HttpProxy("http://proxy:port")

is how it's meant to function?

Thanks for this excellent library. If I could get it to work through the proxy, it would be excellent.

Generate POST request from Form

When screen scraping, it can be very convenient sometimes, to automatically generate a post request from a form in the html document. To get the point across, here is some code I wrote which does that for me.

    submitForm = ($, callback) =>
        body = {}

        for item in $('form select, form input[type!=submit], form input[type=submit]:first')
            if $(item).val()
                body[$(item).attr('name')] = $(item).val()

        url = root + $('form').attr('action').replace('&amp;', '&')
        header = {'Content-Type': 'application/x-www-form-urlencoded'}

        @postHtml url, body, header, callback

Would it be possible to have that feature in node.io?

Redirect Issues

I noticed a few issues with regards to HTTP redirects-

Cookies that are set for the initial request are not sent to the server on the redirect requests (when redirects are enabled).
Set-Cookie headers received after automatically following a redirect are not returned with the .get/.getHtml response headers
When setting redirect=0, I am able to catch the redirect by detecting 'err' being set in the callback, however again the headers for the callback are not being set (such as Location).

Thanks!

issue running a node.io script in a cron

i want to kick off a script in a cron. the script works fine if manually run. but from the cron the output i get is this...

/bin/sh: node.io: not found

im using the plain old user crontab -e to setup the cron, so it should run it as my standard user

any ideas?

the command im using is something like this
*/1 * * * * node.io /home/myuname/code/project/import.js >> /home/myuname/code/project/import_log.log 2>&1

able to stop timeout exiting the process?

I've just started implementing a build tool that needs to scrape some data and return as a jsonp request. The problem I have is that the node.io timeout kills the parent process which I want to keep live all the time, is it possible to exit or tidy the node.io scrape process without killing the entire node process?

Unable to pass complex objects as my input

Hello, I'm trying to pass complex objects as my obj.

On first try It seems that I can't pass along an "object" itself as the job just exits without any error. I then tried to encode my job using various methods, JSON.stringify and querystring.stringify but the url passed to the "run" function only contained the first character :

exports.job = new nodeio.Job({

    input:function(start, num, callback) {

        setTimeout(function() {
            var body = {
                url:"http://google.com",
                foo:"boo",
                bar:"bar"
            };
            callback(JSON.stringify(body)); // only outputs the first char which is { same with querystring.stringify
        }, 1000);

    },
    run:function(job) {
        console.log(job);
    }


})

Failed to load job "hello"

I'm trying to run the intro job hello.js but when I run this command:

node.io hello

I get this output:

 ERROR: Failed to load job "hello". Please check that the job exists and compiles correctly.

When I debug:

DEBUG: {"stack":"Error: Cannot find module 'node.io'\n    at Function._resolveFilename (module.js:322:11)\n    at Function._load (module.js:267:25)\n    at require (module.js:351:19)\n    at Object.<anonymous> (/Users/username/Documents/Nodejs/hello.js:1:75)\n    at Module._compile (module.js:407:26)\n    at Object..js (module.js:413:10)\n    at Module.load (module.js:339:31)\n    at Function._load (module.js:298:12)\n    at require (module.js:351:19)\n    at [object Object].loadJob (/opt/local/lib/node_modules/node.io/lib/node.io/processor.js:294:37)","message":"Cannot find module 'node.io'"}

However, echo "mastercard.com" | node.io -s pagerank works just fine just as node.io query "http://www.reddit.com/" a.title does.

Jobs might not always complete when job.options.max > 1

Hi Chris,

I think the Timer reset code in pullInput might sometimes prevent the complete event from being emitted. The reliability seems to be related to the size of the input, the max threads and the response speed for the last few requests.

My limited async debugging skills mean I've filled my local version of node.io with status('blah','debug') calls to what's happening. It seems that when there's no more input data left to pull there could be requests still running, and if they don't complete within the 300ms before the next check in handle_input() then it's possible that the crucial final check is never made.

By changing line 129 in process_master.js from:

if (completeCheckInterval) {
    clearInterval(completeCheckInterval);
}

to something like:

if (completeCheckInterval && job.input.length > pull) {
    clearInterval(completeCheckInterval);
}

...it works each time, regardless of job.options.max or network conditions. In fact, removing that block is also fine as it only normally takes a few seconds for the last few threads to finish.

I'm using node.io within other node scripts. I run my jobs using nodeio.start(job, options, callback, true) and I'm providing a job.input() to load data from a filename passed in options.args. Not sure whether that would make a difference to either a) the way the input data is collected; or b) the behaviour of complete.

I can try and write some test scripts to recreate the problem I'm having if you think the issue is elsewhere?

I've found node.io to be really useful, thanks for sharing it. Let me know if you need a hand with any documentation, examples etc. I would love to contribute if I can.

Many thanks
Paul

Uniqueness of results before usage

This is fantastic! Currently I use a ruby script to scrape a web database, there's one thing that my script has that I'd love to see here. And that's the ability to validate the uniqueness of a record before writing to file or insertion in database.
And with regards to the distribution across servers, I'm truly looking forward to this!

Quiet job completion for .scrape()?

So:

I'm using .scrape() within another node.js application and the fact that node.io makes a hard kill after job completion is becoming a problem. After I call .scrape() and the job is completed it also kills my node.js server after outputting the job status.

Is there a way around this?

Also: I am aware that node.io is primarily a console application meant to be running on it's own; but it's just so sexy!

Installation documentation

As a new nodejs user, the global installation of node.io has been confusing, as I've had to install node.io globally and locally.

It appears by default (and design) that global installations are not in the path - this link https://github.com/isaacs/npm/issues/775#issuecomment-1389398 makes it clear this is intended (via http://stackoverflow.com/questions/6159552/problem-running-jobs-with-node-io/6415110#6415110).

I appreciate that you have a different view on the subject, so could you add steps to your installation page to explain what needs changing to the path and why, as the link above implies it's a bad idea to do?

Many thanks!

Broken for 0.3.7

This happens on every request (something to do with how destroy has changed/moved around)

Also, createClient has been removed from the public API docs

http.js:330
this.socket.destroy(error);
^
TypeError: Cannot call method 'destroy' of null
at ClientRequest.destroy (http.js:330:15)
at /usr/local/lib/node/.npm/node.io/0.2.1-18/package/lib/node.io/request.js:225:25
at IncomingMessage. (/usr/local/lib/node/.npm/node.io/0.2.1-18/package/lib/node.io/request.js:321:13)
at IncomingMessage.emit (events.js:59:20)
at HTTPParser.onMessageComplete (http.js:111:23)
at Client.onData as ondata
at Client._onReadable (net.js:623:27)
at IOWatcher.onReadable as callback

use jquery for object $

Is it possible to use jQuery in the place of $ object? OR is there any way I can create a jQuery window object with the help of $ object?

var nodeio = require('node.io'), options = {timeout: 10},
    jQuery = require('jquery');

exports.job = new nodeio.Job(options, {
    input: ['hello', 'foobar', 'weather'],
    run: function (keyword) {
        this.getHtml('http://www.google.com/search?q=' + encodeURIComponent(keyword), function (err, $) {

            // SOMEHOW CREATE THE JQUERY OBJECT USING $

            var results = $('#resultStats').text.toLowerCase();
            this.emit(keyword + ' has ' + results);
        });
    }
});

Scrapping stops the server

Hi,
maybe I'm doing something wrong, but I have simple function that handles request. It's just copy&paste of your example. It works fine, it prints all titles. However after that the console says "OK: Job complete" and the server is stopped, without any warning. Am I doing something wrong?

Code:
function index(response, request){

require('node.io').scrape(function() {
    this.getHtml('http://www.reddit.com/', function(err, $) {
        var stories = [];
        $('a.title').each(function(title) {
            stories.push(title.text);
        });
        this.emit(stories);
    });
});

Requests sometimes fail

Sometimes in crawler process i catch:

TypeError: Object # has no method 'destroy'
at /home/ec2-user/.node_libraries/.npm/node.io/0.1.1-15/package/lib/node.io/request.js:102:21

Make simple workaround in this line, adding:

if(request && request.destroy) {
...
}

But I believe that the problem is more severe

Asserts fail silently?

I tried reddit.coffee from the tutorial and didn't get any output. For reference, here's the source code:

nodeio = require('node.io')

titles = []
scores = []
output = []

class Reddit extends nodeio.JobClass
    input: false
    run: -> 
        @getHtml 'http://www.reddit.com/', (err, $, data) =>
            @exit err if err?

            $('a.title').each (a) -> titles.push a.text
            $('div.score.unvoted').each (div) -> scores.push div.rawtext

            @exit 'Title / score mismatch' if scores.length isnt titles.length

            for score, i in scores
                if score is '•' then continue
                @assert(score).isInt()
                output.push '[' + score + '] ' + titles[i]

            @emit output

@class = Reddit
@job = new Reddit({timeout:10})

After a bit of debugging, I realized that something had changed since the tutorial had been written so that div.rawtext was undefined. This was triggering @Assert(score).isInt() to fail so that threads would terminate without printing undefined. However, there wasn't any sort of message that the assertion had failed, even when I ran things with the -g flag:

john-laptop :: ~/tmp » node.io -g reddit.coffee                             1 ↵
INFO: Compiling /home/john/tmp/reddit.coffee
INFO: Running 1 worker..
DEBUG: GET http://www.reddit.com/
DEBUG: 200 http://www.reddit.com/
OK: Job complete

Is there something I can do to see when an assert fails?

Changing 3xx redirect of POST to GET

I just needed to realize form login by a POST request, it then redirected to the member page as expected. The problem is, after redirect, GET request should be used, but node.io (request.js) used POST by copying previous request. I changed this line:

self.doRequest(method, location, null,  headers, callback, parse, ++redirects);

self.doRequest('GET', location, null,  headers, callback, parse, ++redirects);

Distributed processing / scraping

Module loading issues?

Error message:

john-laptop :: ~/tmp » node.io hello

node.js:63
    throw e;
    ^
Error: Cannot find module '/home/john/tmp/hello'
    at loadModule (node.js:275:15)
    at require (node.js:411:14)
    at [object Object].loadJob (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:298:16)
    at Object.start (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:152:30)
    at /usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/cli.js:124:19
    at Object.cli (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/cli.js:279:5)
    at Object. (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/bin/node.io:3:20)
    at Module._compile (node.js:462:23)
    at Module._loadScriptSync (node.js:469:10)
    at Module.loadSync (node.js:338:12)

However, this seems to work:

node.io hello.coffee

And oddly enough, this works (using double.coffee from the tutorial):

node.io double

A related issue is that require statements don't seem to work properly. Using double.coffee and quad.coffee from the tutorial,

john-laptop :: ~/tmp » node.io quad.coffee                                  1 ↵
INFO: Compiling /home/john/tmp/quad.coffee

node.js:275
        throw new Error("Cannot find module '" + request + "'");
              ^
Error: Cannot find module './double'
    at loadModule (node.js:275:15)
    at require (node.js:411:14)
    at Object. (/home/john/tmp/quad_compiled.js:12:12)
    at Object. (/home/john/tmp/quad_compiled.js:25:4)
    at Module._compile (node.js:462:23)
    at Module._loadScriptSync (node.js:469:10)
    at Module.loadSync (node.js:338:12)
    at loadModule (node.js:283:14)
    at require (node.js:411:14)
    at [object Object].loadJob (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:298:16)

If I modify the original quad.coffee so it refers to double.coffee:

john-laptop :: ~/tmp » node.io quad.coffee                                  1 ↵
INFO: Compiling /home/john/tmp/quad.coffee

/home/john/tmp/double.coffee:3
class Double extends nodeio.JobClass
      ^^^^^^
SyntaxError: Unexpected identifier
    at Module._compile (node.js:458:37)
    at Module._loadScriptSync (node.js:469:10)
    at Module.loadSync (node.js:338:12)
    at loadModule (node.js:283:14)
    at require (node.js:411:14)
    at Object. (/home/john/tmp/quad_compiled.js:12:12)
    at Object. (/home/john/tmp/quad_compiled.js:25:4)
    at Module._compile (node.js:462:23)
    at Module._loadScriptSync (node.js:469:10)
    at Module.loadSync (node.js:338:12)

I even tried compiling double.coffee in to double.js and referring to it that way:

john-laptop :: ~/tmp » node.io quad.coffee
john-laptop :: ~/tmp » node.io quad.coffee           
INFO: Compiling /home/john/tmp/quad.coffee

/home/john/tmp/quad_compiled.js:6
    ctor.prototype = parent.prototype;
                           ^
TypeError: Cannot read property 'prototype' of undefined
    at /home/john/tmp/quad_compiled.js:6:28
    at /home/john/tmp/quad_compiled.js:17:5
    at Object. (/home/john/tmp/quad_compiled.js:22:3)
    at Object. (/home/john/tmp/quad_compiled.js:25:4)
    at Module._compile (node.js:462:23)
    at Module._loadScriptSync (node.js:469:10)
    at Module.loadSync (node.js:338:12)
    at loadModule (node.js:283:14)
    at require (node.js:411:14)
    at [object Object].loadJob (/usr/local/lib/node/.npm/node.io/0.1.1-19/package/lib/node.io/processor.js:298:16)

'htmlparser' module not found

I am using npm 1.0rc8 and node 0.4.5

When I try running the following code:

var nodeio = require('node.io');

exports.job = new nodeio.Job({
    input: false,
    run: function() {
        var self = this;

        this.getHtml('http://google.com/', function(err, $) {
            if (err) { self.exit(err) }

            self.emit($('input'));
        })
    }
});

I get this error:

Error: Cannot find module 'htmlparser'
    at Function._resolveFilename (module.js:320:11)
    at Function._load (module.js:266:25)
    at require (module.js:348:19)
    at Object.<anonymous> (/opt/local/lib/node_modules/node.io/vendor/soupselect/lib/soupselect.js:8:16)
    at Module._compile (module.js:404:26)
    at Object..js (module.js:410:10)
    at Module.load (module.js:336:31)
    at Function._load (module.js:297:12)
    at require (module.js:348:19)
    at [object Object].$ (/opt/local/lib/node_modules/node.io/lib/node.io/dom.js:20:22)

.each not working for classes used a single time

There is no way of using .each when there is just a single element matching a class. Take this HTML as an example.

<div class='example'>This is text contained in a div with the class example</div>
<div class='ex2'>This is also text, this time contained in a class named ex2</div>
<div class='example'>Oh look, same class as the first div, example</div>

All fun and games right?

Let's try to get the text out of every div with the class example:

var a = []; $('div.example').each(function(e){ a.push(e.text); }); this.emit(a);

This should work without a hitch.

Now let's try to get the text out of every div with the class ex2, we have no idea how many classes like this are used so we use each to just get them all, even if it's just one:

var z = []; $('div.ex2').each(function(e){ z.push(e.text); }); this.emit(z);

Awww man! That didn't work at all!

.each just works for classes used multiple times. With CSSLint fanatics starting to use classes in place of # this might become a larger issue in the future.