Git Product home page Git Product logo

Comments (12)

kof avatar kof commented on July 16, 2024

I know you have another request wrapper thing which probably doesn't triggers this bug, however I need some more features than that and superagent seems to be the best one. I am piping from superagent to lots of different modules and never seen a problem like this ... though probably there is something with WritableStream ... you might want to look at event-stream module, which works pretty much always f.e. the .through method ... probably they have fixed something like this.

from readabilitysax.

kof avatar kof commented on July 16, 2024

I haven't tried to understand whats the reason, I just took event-stream and replaced WritableStream by this, now it works perfectly

var Readability = readabilitySax.Readability
var Parser = htmlParser.Parser
var CollectingHandler = htmlParser.CollectingHandler

var readability = new Readability({pageURL: url, type: 'html'})
var handler = new CollectingHandler(readability)
var parser = new Parser(handler, {lowerCaseTags: true})

return es.through(function(data) {
    parser.write(data)
    this.emit('data', data)
}, function() {
    for(
        var skipLevel = 1;
        readability._getCandidateNode().info.textLength < 250 && skipLevel < 4;
        skipLevel++
    ){
        readability.setSkipLevel(skipLevel)
        handler.restart()
    }

    var article = readability.getArticle()
    article.html = entities.decodeHTML5(article.html.replace(/\s+/g, ' '))
    article.title = entities.decodeHTML5(article.title)
    article.url = url
    callback(article)
    this.emit('end')
})

from readabilitysax.

kof avatar kof commented on July 16, 2024

readability._getCandidateNode() this is also not nice, you are accessing private or protected method from the outside.

from readabilitysax.

fb55 avatar fb55 commented on July 16, 2024

The stream interface apparently has a bug. A PR replacing it (eg. using through2) would be great; your code should be trivial to port, although entities could be replaced with the decodeEntities option of htmlparser2.

from readabilitysax.

kof avatar kof commented on July 16, 2024

What do you think exactly is a bug?

from readabilitysax.

fb55 avatar fb55 commented on July 16, 2024

The end method currently overwrites the original end method of the stream interface, which could be the root of this issue.

from readabilitysax.

kof avatar kof commented on July 16, 2024

How is the end related to the fact that streaming stops at some point in the middle?

from readabilitysax.

kof avatar kof commented on July 16, 2024

I found out that this code works without issues:

var request = require('superagent')
var readabilitySax = require('readabilitySAX')

var url = 'http://techcrunch.com/2010/11/18/mark-zuckerberg/'

var options = {pageURL: url, type: 'html'}

request
    .get(url)
    .on('error', console.log)
    .pipe(readabilitySax.createWritableStream(options, console.log))

from readabilitysax.

kof avatar kof commented on July 16, 2024

weird ...

from readabilitysax.

fb55 avatar fb55 commented on July 16, 2024

^^

The stream implementation is buggy nevertheless & should probably be updated. I'm closing this issue though.

from readabilitysax.

kof avatar kof commented on July 16, 2024

I have switched to full buffering ... streams the way I have used them were a way slower and fragile.

from readabilitysax.

kof avatar kof commented on July 16, 2024

also encoding detection can be done safely only if the full text is given.

from readabilitysax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.