fb55 / readabilitysax Goto Github PK

View Code? Open in Web Editor NEW

244.0 8.0 36.0 449 KB

a fast and platform independent readability port (JS)

License: BSD 2-Clause "Simplified" License

JavaScript 45.22% HTML 54.78%

readability javascript readabilitysax

readabilitysax's Introduction

readabilitySAX

a fast and platform independent readability port

About

This is a port of the algorithm used by the Readability bookmarklet to extract relevant pieces of information from websites, using a SAX parser.

The advantage over other ports, e.g. arrix/node-readability, is a smaller memory footprint and a much faster execution. In my tests, most pages, even large ones, were finished within 15ms (on node, see below for more information). It works with Rhino, so it runs on YQL, which may have interesting uses. And it works within a browser.

The Readability extraction algorithm was completely ported, but some adjustments were made:

<article> and <section> tags are recognized and gain a higher value
If a heading is part of the pages <title>, it is removed (Readability removed any single <h2>, and ignored other tags)
henry and instapaper-body are classes to show an algorithm like this where the content is. readabilitySAX recognizes them and adds additional points
Every bit of code that was taken from the original algorithm was optimized, eg. RegExps should now perform faster (they were optimized & use RegExp#test instead of String#match, which doesn't force the interpreter to build an array)
Some improvements made by GGReadability (an Obj-C port of Readability) were adopted
- Images get additional scores when their height or width attributes are high - icon sized images (<= 32px) get skipped
- Additional classes & ids are checked

How To

Install readabilitySAX

npm install readabilitySAX

CLI

A command line interface (CLI) may be installed via

npm install -g readabilitySAX

It's then available via

readability <domain> [<format>]

To get this readme, just run

readability https://github.com/FB55/readabilitySAX

The format is optional (it's either text or html, the default value is text).

Usage

Node

Just run require("readabilitySAX"). You'll get an object containing three methods:

Readability(settings): The readability constructor. It works as a handler for htmlparser2. Read more about it in the wiki!
WritableStream(settings, cb): A constructor that unites htmlparser2 and the Readability constructor. It's a writable stream, so simply .write all your data to it. Your callback will be called once .end was called. Bonus: You can also .pipe data into it!
createWritableStream(settings, cb): Returns a new instance of the WritableStream. (It's a simple factory method.)

There are two methods available that are deprecated and will be removed in a future version:

get(link, [settings], callback): Gets a webpage and process it.
process(data): Takes a string, runs readabilitySAX and returns the page.

Please don't use those two methods anymore. Streams are the way you should build interfaces in node, and that's what I want encourage people to use.

Browsers

I started to implement simplified SAX-"parsers" for Rhino/YQL (using E4X) and the browser (using the DOM) to increase the overall performance on those platforms. The DOM version is inside the /browsers dir.

A demo of how to use readabilitySAX inside a browser may be found at jsFiddle. Some basic example files are inside the /browsers directory.

YQL

A table using E4X-based events is available as the community table redabilitySAX, as well as here.

Parsers (on node)

Most SAX parsers (as sax.js) fail when a document is malformed XML, even if it's correct HTML. readabilitySAX should be used with htmlparser2, my fork of the htmlparser-module (used by eg. jsdom), which corrects most faults. It's listed as a dependency, so npm should install it with readabilitySAX.

Performance

Speed

Using a package of 724 pages from CleanEval (their website seems to be down, try to google it), readabilitySAX processed all of them in 5768 ms, that's an average of 7.97 ms per page.

The benchmark was done using tests/benchmark.js on a MacBook (late 2010) and is probably far from perfect.

Performance is the main goal of this project. The current speed should be good enough to run readabilitySAX on a singe-threaded web server with an average number of requests. That's an accomplishment!

Accuracy

The main goal of CleanEval is to evaluate the accuracy of an algorithm.

// TODO

Todo

Add documentation & examples
Add support for URLs containing hash-bangs (#!)
Allow fetching articles with more than one page
Don't remove all images inside <a> tags

readabilitysax's People

Contributors

Stargazers

Watchers

readabilitysax's Issues

Not grabbing images from twitpic.

I was thinking it might be useful if it grabbed the image from URLs such as:

http://twitpic.com/9fc9og

Fails to extract content correctly from page.

Not sure if you want these sorts of things reported, but just in case:

http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works

Links to the page root aren't resolved

ontext - cannot read property 'children' of undefined

Hello,

I just "upgraded" to the current version here from the repo and now get an error.

TypeError: Cannot read property 'children' of undefined     at [object Object].ontext (/media/cryptofs/apps/usr/palm/services/com.codingbees.headlines2.service/lib/readabilitySAX.js:425:22)     at Parser._parseTags (/media/cryptofs/apps/usr/palm/services/com.codingbees.headlines2.service/lib/Parser.js:204:14)     at Parser.<anonymous> (/media/cryptofs/apps/usr/palm/services/com.codingbees.headlines2.service/lib/Parser.js:52:7)     at Parser.parseComplete (/media/cryptofs/apps/usr/palm/services/com.codingbees.headlines2.service/lib/Parser.js:43:7)     at [object Object].run (ReadabilityProcessor.js:76:12)

It seems to be an issue here:

    Readability.prototype.ontext = function(text){
    this._currentElement.children.push(text);
    };

Command line client returns error

readability https://github.com/FB55/readabilitySAX

/usr/local/lib/node_modules/readabilitySAX/node_modules/minreq/index.js:82
throw Error("Data was already emitted!");
^
Error: Data was already emitted!
at Error (unknown source)
at Request.pipe (/usr/local/lib/node_modules/readabilitySAX/node_modules/minreq/index.js:82:9)
at null. (/usr/local/lib/node_modules/readabilitySAX/node/getURL.js:48:7)
at EventEmitter.emit (events.js:96:17)
at ClientRequest. (/usr/local/lib/node_modules/readabilitySAX/node_modules/minreq/index.js:205:9)
at ClientRequest.g (events.js:192:14)
at ClientRequest.EventEmitter.emit (events.js:96:17)
at HTTPParser.parserOnIncomingClient as onIncoming
at HTTPParser.parserOnHeadersComplete as onHeadersComplete
at CleartextStream.socketOnData as ondata

Add share and nojavascript to unlikelyCandidates

Add share and nojavascript to the following:

    re_negative = /com(?:bx|ment|-)|contact|foot(?:er|note)?|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|share|nojavascript|shopping|tags|tool|widget/,
re_unlikelyCandidates =  /ad-break|agegate|auth?or|bookmark|cat|com(?:bx|ment|munity)|date|disqus|extra|foot|header|ignore|info|links|share|nojavascript|menu|nav|pag(?:er|ination)|popup|related|remark|rss|shoutbox|sidebar|similar|social|sponsor|teaserlist|time|tweet|twitter/,

Losing formatting from code snippets.

The formatting for the code snippets in these blog post seems to be lost:

http://chrismdp.github.com/2012/02/on-coding-defensively/
http://ayende.com/blog/154273/limit-your-abstractions-and-how-do-you-handle-testing
http://odetocode.com/Blogs/scott/archive/2012/03/04/working-with-mongodb-conventions.aspx

readabilitySAX doesn't handle web pages which require cookie

A number of web pages which require cookie to be enabled end up responding with a page containing a message "Cookie must be enabled", which then gets processed by readabilitySAX as if it's the expected content of the page.

It looks like that this happens due to the use of minreq https://github.com/fb55/readabilitySAX/blob/master/lib/getURL.js#L28 which doesn't support cookie by design in order to improve performance.

Does readabilitySAX purposely ignore web pages which require cookie for performance reason?
Or are you open to the possibility of replacing minreq with request or any other module that supports cookie?

Handling double images

Many sites have an enlarge javascript function to see a larger image. E.g. NPR often does this: http://www.npr.org/blogs/therecord/2012/08/22/159534467/my-american-dream-sounds-like-black-star?sc=fb&cc=fmp.

Safari Reader and Readability manage to filter out the double images somehow but readabilitySAX is showing the same image twice at the top of the article. I am trying to figure out how to fix this but thought I would post an issue until I figure it out.

Bug in DomAsSax

Should reverse the sequence of callbacks in DOMAsSAX.js between onopentagname and onattribute as follows:

        callbacks.onopentagname(name);

    for(var i = 0, j = attributeNodes.length; i < j; i++){
        callbacks.onattribute(attributeNodes[i].name+'', attributeNodes[i].value);
    }

The current sequence is causing attributes to be put on the wrong node in the parser.

Issue with wikipedia link

The link in questions is :

http://en.wikipedia.org/wiki/Concern_troll#Concern_troll

The content retrieved is "Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice."

ReDoS vulnerability in readabilitySAX.js

Description

ReDoS vulnerability is an algorithmic complexity vulnerability that usually appears in backtracking-kind regex engines, e.g. the python default regex engine. The attacker can construct malicious input to trigger the worst-case time complexity of the regex engine to make a denial-of-service attack.

In this project, here has used the ReDoS vulnerable regex (?:<br\/>(?:\s| ?)*)+(?=<\/?p) that can be triggered by the below PoC:

const arg = require('arg');
const args = arg(
    {
        '--foo': String
    },        {
        argv: ['<br/>' + '<br/>'.repeat(24495) + '\x00']
    }
);

How to repair

The cause of this vulnerability is the use of the backtracking-kind regex engine. I recommend the author to use the RE2 regex engine developed by google, but it doesn't support lookaround and backreference extension features, so we need to change the original regex and add additional code constraints. Here is my repair solution:

const RE2 = require('re2');
// (?:<br\/>(?:\s|&nbsp;?)*)+(?=<\/?p)
function safe_match(node) {
    const r1 = new RE2('(?:<br\/>(?:\s|&nbsp;?)*)+', 'g');
    return node.replace(/(?=<\/?p)/g, '')
        .replace(r1, '')
}

The match semantics of the new regex + code constraint above is equivalent to the original regex.

I hope the author can adopt this repair solution and I would be very grateful. Thanks!

Take into account text in divs

It's very useful for some sites to handle this case:

<div class="content">
  Some long text...
  <a href="...">Link</a>
  ...
</div>

Competitors do:
https://github.com/mozilla/readability/blob/master/Readability.js#L677-L688
https://github.com/luin/readability/blob/master/src/helpers.js#L108-L119

I think it must be optional. Something like

    if(tagName === "p" || tagName === "pre" || tagName === "td");
    else if(tagName === "div"){
        var done = false;
        //check if div should be converted to a p
        for(i = 0, j = divToPElements.length; i < j; i++){
            if(divToPElements[i] in elem.info.tagCount){
                done = true;
                break;
            }
        }

        if(done && this._settings.strayText){
            for(i = 0, j = elem.children.length; i < j; i++) {
                var child = elem.children[i];
                if(typeof child !== 'string')
                    continue;
                var textLength = child.trim().length;
                if(textLength > 24 && elem.parent){
                    elem.isCandidate = elem.parent.isCandidate = true;
                    if(re_commas.test(child)) var commas = child.split(re_commas).length - 1;
                    var addScore = 1 + commas + Math.min( Math.floor( textLength / 100 ), 3);
                    elem.tagScore += addScore;
                    elem.parent.tagScore += addScore / 2;
                }
            }
        }

        if(done)
            return;

        elem.name = "p";
    }
    else return;

What do you think about it?

Cannot read property 'totalScore' of null

On page:

http://www.nytimes.com/adx/bin/adx_click.html?type=goto&opzn&page=homepage.nytimes.com/index.html&pos=HPmodule-RE2&sn2=d1cdc681/5274a2cb&sn1=5e1b0829/1821bb3d&camp=hpmodRE-TFC-01-1494369-cla&ad=8.29.11-TFC-TheView-173x98-HPM&goto=http%3A%2F%2Flivingtheview%2Ecom%2F%3Futm%5Fsource%3DNY%252BTimes%26utm%5Fmedium%3Dhome%252Bpage%252Bmodule%26utm%5Fcampaign%3Dview%2Dlivingrm3

I get following error:

2011-12-06T17:20:52+00:00 app[web.1]: { protocol: 'http:',
2011-12-06T17:20:52+00:00 app[web.1]:   slashes: true,
2011-12-06T17:20:52+00:00 app[web.1]:   host: 'www.nytimes.com',
2011-12-06T17:20:52+00:00 app[web.1]:   hostname: 'www.nytimes.com',
2011-12-06T17:20:52+00:00 app[web.1]:   href: 'http://www.nytimes.com/adx/bin/adx_click.html?type=goto&opzn&page=homepage.nytimes.com/index.html&pos=HPmodule-RE2&sn2=d1cdc681/5274a2cb&sn1=5e1b0829/1821bb3d&camp=hpmodRE-TFC-01-1494369-cla&ad=8.29.11-TFC-TheView-173x98-HPM&goto=http%3A%2F%2Flivingtheview%2Ecom%2F%3Futm%5Fsource%3DNY%252BTimes%26utm%5Fmedium%3Dhome%252Bpage%252Bmodule%26utm%5Fcampaign%3Dview%2Dlivingrm3',
2011-12-06T17:20:52+00:00 app[web.1]:   search: '?type=goto&opzn&page=homepage.nytimes.com/index.html&pos=HPmodule-RE2&sn2=d1cdc681/5274a2cb&sn1=5e1b0829/1821bb3d&camp=hpmodRE-TFC-01-1494369-cla&ad=8.29.11-TFC-TheView-173x98-HPM&goto=http%3A%2F%2Flivingtheview%2Ecom%2F%3Futm%5Fsource%3DNY%252BTimes%26utm%5Fmedium%3Dhome%252Bpage%252Bmodule%26utm%5Fcampaign%3Dview%2Dlivingrm3',
2011-12-06T17:20:52+00:00 app[web.1]:   query: 'type=goto&opzn&page=homepage.nytimes.com/index.html&pos=HPmodule-RE2&sn2=d1cdc681/5274a2cb&sn1=5e1b0829/1821bb3d&camp=hpmodRE-TFC-01-1494369-cla&ad=8.29.11-TFC-TheView-173x98-HPM&goto=http%3A%2F%2Flivingtheview%2Ecom%2F%3Futm%5Fsource%3DNY%252BTimes%26utm%5Fmedium%3Dhome%252Bpage%252Bmodule%26utm%5Fcampaign%3Dview%2Dlivingrm3',
2011-12-06T17:20:52+00:00 app[web.1]:   pathname: '/adx/bin/adx_click.html' }
2011-12-06T17:20:52+00:00 app[web.1]: 
2011-12-06T17:20:52+00:00 app[web.1]: /app/node_modules/readabilitySAX/readabilitySAX.js:611
2011-12-06T17:20:52+00:00 app[web.1]:       score: this._topCandidate.totalScore
2011-12-06T17:20:52+00:00 app[web.1]:                            ^
2011-12-06T17:20:52+00:00 app[web.1]: TypeError: Cannot read property 'totalScore' of null
2011-12-06T17:20:52+00:00 app[web.1]:     at [object Object].getArticle (/app/node_modules/readabilitySAX/readabilitySAX.js:611:28)
2011-12-06T17:20:52+00:00 app[web.1]:     at Request.<anonymous> (/app/node_modules/readabilitySAX/node/index.js:49:26)
2011-12-06T17:20:52+00:00 app[web.1]:     at Request.emit (events.js:64:17)
2011-12-06T17:20:52+00:00 app[web.1]:     at IncomingMessage.<anonymous> (/app/node_modules/readabilitySAX/node_modules/request/main.js:423:16)
2011-12-06T17:20:52+00:00 app[web.1]:     at IncomingMessage.emit (events.js:81:20)
2011-12-06T17:20:52+00:00 app[web.1]:     at HTTPParser.onMessageComplete (http.js:133:23)
2011-12-06T17:20:52+00:00 app[web.1]:     at Socket.ondata (http.js:1213:22)
2011-12-06T17:20:52+00:00 app[web.1]:     at Socket._onReadable (net.js:681:27)
2011-12-06T17:20:52+00:00 app[web.1]:     at IOWatcher.onReadable [as callback] (net.js:177:10)

Happens every time I try

Property '_convertLinks' of object is not a function

I get this error when running readabilitySAX from the version on npm:

        elem.attributes[name] = this._convertLinks(value);
                            ^
                    TypeError: Property '_convertLinks' of object [object Object] is not a function
                      at [object Object].onopentag (node_modules/readabilitySAX/ReadabilitySAX.js:349:33)

And ReadabilitySAX.js is different to the version on GitHub, so I think you might need to push to npm again.

script is sometimes not filtered in my example

try to parse this website

http://sports.yahoo.com/news/bulls-2-trades-clear-salary-235937466--nba.html

you will need some attempts, but after it its always reproducible.

Error: Not found: htmlparser2@'>=1.8.0'

I see this when I install via npm:

npm ERR! error installing [email protected] Error: Not found: htmlparser2@'>=1.8.0'
npm ERR! error installing [email protected] Valid install targets:
npm ERR! error installing [email protected] ["latest","1.0.0"]

Issue with gorocu link.

The link in question is:

http://goruco.wufoo.com/forms/m7x3p9/

Seems like the blog entry content is being lost.

There's a problem with the cnn url http://money.cnn.com/2013/01/23/news/economy/mickelson-taxes/index.html

Misses <img> with height/width added with CSS

Example:

HTML:

<img src = "wahoo.jpg" />

CSS:

img {
    max-width: 100%;
    height: auto;
}

Here's the website I've been testing on: http://www.smashingmagazine.com/2012/05/02/applying-macrotypography-for-readable-web-page/

It seems that readability somehow picks up on the images and keeps them.

Weird chars show up

Some self closing <iframe/> breaks the source

Hi,

I'm currently using the latest version of ReadabilitySAX (1.4.1, from npm). While parsing some webpages I found that in some cases it outputs some self closing tags that aren't rendered well in the browsers.

For example:

<iframe /> In some pages, where there's an iframe embed, mostly from YouTube, on a browser it's only displayed that iframe because it expected a </iframe>. Example Page || Example II. Source of the page

Changing double <br> not adding closing </p> tag

Test url: http://www.appleinsider.com/articles/12/01/14/reacting_to_apple_at_ces_2012_intels_ultrabooks_to_samsungs_galaxy_note.html

Expected:

Double <br> tag should be changed to <p>{content}</p>

Observed:

Double <br> tag is changed to <p>, but no closing tag is inserted

Version:

1.3.0

cannot read http://developers.google.com/chrome-develo per-tools/docs/console

$ readability http://developers.google.com/chrome-develo per-tools/docs/console
$ ERROR: Error: 7424:error:0607907F:digital envelope routines:EVP_PKEY_get1_RSA:expecting an rsa key:openssl\crypto\evp\p_lib.c:288:

<img> tags inside of <div> tag are removed from content

Test url: http://www.appleinsider.com/articles/12/01/14/reacting_to_apple_at_ces_2012_intels_ultrabooks_to_samsungs_galaxy_note.html

Expected behaviour:

<img> tag are inserted correctly

Observed behaviour:

No <img> tag are present in result body

Version:

1.3.0

readibilitySAX can be misled by multiple <title> tags

Look at this page http://www.liberation.fr/planete/2015/12/11/la-cop21-bute-sur-la-finance_1420218?xtor=rss-450

The right title is "La COP21 bute sur la finance". The original readability as well as Safari reader mode get it correctly.

Instead, readibilitySAX choose "Ptit Libé" which is in a <title> tag for a symbol used in the "hamburger menu" at the top left corner of the page.

I don't know if this use of <symbol> and <title> is correct, but the W3C HTML validator does not complain with the extraneous <title> tags.

TypeError on some pages when `searchFurtherPages=true`

Hello,

Just found this great project and have to say thanks for sharing!
I'm using this service in webOS and I get an error when parsing data from e.g. a Engadget article and searchFurtherPages=true.

"errorCode":-9999,"exception":"TypeError: Cannot read property '1' of null
at [object Object]._scanLink (/lib/readabilitySAX.js:268:39)
at [object Object].onclosetag (/lib/readabilitySAX.js:388:8)
at Parser._processCloseTag (/lib/Parser.js:237:16)
at Parser._parseTags (/Parser.js:200:43)
at Parser.<anonymous> (/lib/Parser.js:52:7)
at Parser.parseComplete

Article for test: http://www.engadget.com/2011/12/04/whaaa-the-original-motorola-lapdock-can-now-be-yours-for-50/ (but it looks like all articles from Engadget don't work)

If I set searchFurtherPages=falsenothing is returned as article.

My data flow is: I request the page from a webOS WebService and just send the response to ReadabilitySAX. That works fine for many articles, but not for Engadget ones and maybe also for other sites as well.

Version 1.3.3 still unstable?

Hi FB55,
First of all thank you for the effort you are putting into improving the original readability code.

Is it me doing something wrong or readabilitySAX is not stable? I've tried studying some of your code, e.g. tests/test_output.js, and even very simple code that does nothing like below, derived from that, fails executing.

var Readability = require("readabilitySAX");
    readable = new Readability({ 
      pageURL: "http://www.fastcompany.com/1811210/state-of-the-union-address-is-ultimate-master-class-in-public-speaking?partner=rss"
    });

The error I get is:

node.js:201
throw e; // process.nextTick error, or 'error' event on first tick
^
TypeError: object is not a function
at Object.CALL_NON_FUNCTION_AS_CONSTRUCTOR (native)
at Object. (/Users/giacecco/Documents/Projects/jour.no/faves/eclipse-workspace/test1/test1.js:2:16)
at Module._compile (module.js:432:26)
at Object..js (module.js:450:10)
at Module.load (module.js:351:31)
at Function._load (module.js:310:12)
at Array.0 (module.js:470:10)
at EventEmitter._tickCallback (node.js:192:40)

Thank you in advance for any suggestion.

Version 1.3.3: dirty characters in the readable output?

Hi FB55, try to run this:

var Readability = require("readabilitySAX").get(
  "http://www.fastcompany.com/1811210/state-of-the-union-address-is-ultimate-master-class-in-public-speaking?partner=rss",
  function(results) {
    console.log(results);
  } 
);

... and you will see that at some point in the text there are lots of dirty characters replacing apparently harmless characters such as single and double quotes.

At first I thought it could be an encoding issue I did not think of, but by definition the get function is supposed to output HTML, free of odd characters, isn't it?

Hope this was useful. I am working with node 0.6.8 on MacOS 10.7.2 .

Feature request: images management

Hi FB55,
I am using readabilitySAX to build a script that makes a few selected web pages readable and sends them to my Kindle for later reading.

At the moment readabilitySAX leaves image references as they are in the original document. It would be nice if you:

a) added a parameter to strip away all images in any case (I realise I could do this myself with a one-line regex replace), or even better...

b) added a feature to download to a specified location all images that are considered relevant to the readable article, replacing the URLs in the readable text with references to the local copies of the same files, so that I can 'package them together' and send them to the Kindle

What do you reckon?

Thanks again for your work on readabilitySAX!

Unwanted elements showing after last change

After commit ec72bae some unwanted elements started showing. Example (like in earlier issues): http://www.appleinsider.com/articles/12/01/14/reacting_to_apple_at_ces_2012_intels_ultrabooks_to_samsungs_galaxy_note.html

Artifacts:

In starting paragraph there are strange table that isn't visible on page. After small investigation it looks that there is table with script tag that is trimmed, but empty table is still present.
Link to next page is no longer removed - it was removed earlier, but not it's moved outside of div and used as normal element.

Sidenote: it looks that reverting this commit is making it work again, but images are removed. Fix for that could be setting cleanConditionally to false, but then table is showed again(in both HEAD version and reverted version)

As for images - it looks that it's not fault of line 471 - we have there only one image per div, so conditions are valid. What is stripping images is both line 468 and line 472-475

Explanation:

line 468 is removing img because there are not enough paragraphs per site(more images tha paragraphs) One reason for that is that changing double br tags to paragraph is not counted for paragraph count. Second one is that in this case if image is posted as first item in article then no paragraphs are yet available so it will be stripped. But that usually is not the problem because in most cases image before first paragraph is just teaser and can be safely removed. So we just need to start counting double br as paragraph.
line 472-475 - I'm not 100% sure how it is working, but it looks like density is wrongly calculated(but it's just guess) for images inside of divs.

Multiple regressions in 1.0.0

Test url:

http://www.appleinsider.com/articles/11/12/14/inside_anobit_why_apple_is_investing_in_flash_ram_technology.html

Result for both 0.6.1 and 1.0.0:

https://gist.github.com/5d1776577a95927e98b3

Regressions:

title is not found
main content is not fetched(it was working in 0.6.1)
multiple irrelevant results was resulted as body
\r\n is not converted to \n
multiple empty <p> and <div> tags
<script> tag is still present

v0.6.0 not working on heroku

If I try to deploy on heroku I will get following error:

2011-12-02T13:43:47+00:00 heroku[web.1]: State changed from created to starting
2011-12-02T13:43:49+00:00 heroku[web.1]: Starting process with command `node web.js`
2011-12-02T13:43:49+00:00 app[web.1]: 
2011-12-02T13:43:49+00:00 app[web.1]: node.js:134
2011-12-02T13:43:49+00:00 app[web.1]:         throw e; // process.nextTick error, or 'error' event on first tick
2011-12-02T13:43:49+00:00 app[web.1]:         ^
2011-12-02T13:43:49+00:00 app[web.1]: Error: Cannot find module '../ReadabilitySAX'
2011-12-02T13:43:49+00:00 app[web.1]:     at Function._resolveFilename (module.js:320:11)
2011-12-02T13:43:49+00:00 app[web.1]:     at Function._load (module.js:266:25)
2011-12-02T13:43:49+00:00 app[web.1]:     at require (module.js:348:19)
2011-12-02T13:43:49+00:00 app[web.1]:     at Object.<anonymous> (/app/node_modules/readabilitySAX/node/index.js:1:81)
2011-12-02T13:43:49+00:00 app[web.1]:     at Module._compile (module.js:404:26)
2011-12-02T13:43:49+00:00 app[web.1]:     at Object..js (module.js:410:10)
2011-12-02T13:43:49+00:00 app[web.1]:     at Module.load (module.js:336:31)
2011-12-02T13:43:49+00:00 app[web.1]:     at Function._load (module.js:297:12)
2011-12-02T13:43:49+00:00 app[web.1]:     at require (module.js:348:19)
2011-12-02T13:43:49+00:00 app[web.1]:     at Object.<anonymous> (/app/web.js:1:82)
2011-12-02T13:43:49+00:00 heroku[web.1]: Process exited
2011-12-02T13:43:50+00:00 heroku[web.1]: State changed from starting to crashed

web.js is small wrapper around readabilitySAX. Everything is working on my local machine(node v0.4.7, npm v1.0.94) but on heroku(the same version of node and npm) it's failing.

my package.json:

{
  "name": "myapp",
  "version": "0.0.1",
  "dependencies": {
    "readabilitySAX": "0.6.0"
  }
}

Adopt test cases from mozilla/readability

mozilla/readability/test

No content for meteor page

The url is http://meteor.com/.

May not be sensible but I was wondering if in the case of this page it should at least get the early headers (presumably with some indication that the rest of the content can only be accessed by going to the page itself).

Article images need better detection

There are a few times where Safari Reader is doing a better job of leaving in article images that are filtered out by readabilitySAX. Here is an example: http://hommemaker.com/2012/08/20/why-the-gays-hate-their-bodies/. Compare the Safari Reader rendering with readabilitySAX. In this case readabilitySAX should preserve images that are wrapped inside a a parent and p grandparent tag. The general rule might be that if there is a single image of sufficient size with any number of wrapping tags these images are candidates. There is probably a better general rule, that is just my take on it.

Way to remove all class/id attributes from output

Currently there are a lot of elements with class/id attribute in output - there should be way to remove them completely so output can be served to client as is without interrupting with rest of page.

Example link: http://imanel.org/2010/12/building-websocket-server-in-ruby/

Parsing remote documents within a browser with XMLHttpRequest restrictions disabled

Hi Felix,

First let me thank you for creating readabilitySAX and making it publicly available. I'm hoping to use it and its implementation of the core Readability algorithm to develop an offline RSS reader application for the BlackBerry PlayBook tablet. However, I am having an issue.

The PlayBook allows developers to disable the same origin security policy for XMLHttpRequest calls. I simulate that during development in the Chrome browser by running it with the option "chrome.exe --disable-web-security". What I'm trying to do is pull down a remote HTML document and pass that to saxParser, and then have it do callbacks to readable like in your original example for use in browsers. Unfortunately, the "article" object I get back always contains:

Object
html: "<h3>This page contains the following errors:</h3><p>error on line 42 at column 10: Opening and ending tag mismatch: link line 0 and head↵</p><h3>Below is a rendering of the page up to the first error.</h3>"
nextPage: ""
score: 1
textLength: 179
title: "There's no such thing as a free launch"

This is what I've modified:

<!doctype html>
<title>DOMasSAX test</title>
<body>
<script src="./DOMasSAX.js"></script>
<script src="../readabilitySAX.js"></script>
<script>
    function data_retrieve( url ) {
        var xmlhttp = new XMLHttpRequest();
        xmlhttp.open("GET", url, false);
        xmlhttp.send();

        return xmlhttp.responseText;
    }

    var parser = new DOMParser();
    var str_doc = data_retrieve('http://hackerjobs.co.uk/blog/2012/4/13/there-s-no-such-thing-as-a-free-launch');
    var xmlDoc = parser.parseFromString(str_doc, "text/xml");

    var readable = new Readability();
    readable.setSkipLevel(3);
    //saxParser(document.childNodes[document.childNodes.length-1], readable);
    saxParser(xmlDoc.childNodes[xmlDoc.childNodes.length-1], readable);
    article = readable.getArticle();
    console.log(article);
    function makeReadable(){
        document.body.innerHTML = "<h1>" + article.title + "</h1>" + article.html;
    }
</script>
</body>

node 0.11.13
superagent 0.18.0
readabilitySax 1.6.1

var request = require('superagent')
var readabilitySax = require('readabilitySAX')

var url = 'http://techcrunch.com/2010/11/18/mark-zuckerberg/'

request
    .get(url)
    .buffer(false)
    .on('error', console.log)
    .on('response', function(res) {

        var options = {pageURL: url, type: 'html'}

        res.pipe(readabilitySax.createWritableStream(options, console.log))

    })
    .end()