Git Product home page Git Product logo

Comments (7)

nzv8fan avatar nzv8fan commented on July 17, 2024 1

I've just created pull request #47 for a change that I've made that improves this issue - at least in my testing.

from snacktory.

karussell avatar karussell commented on July 17, 2024

I'm no longer actively developing this library. Most work I do is integrating pull requests (hint hint ;))

from snacktory.

rubdottocom avatar rubdottocom commented on July 17, 2024

Yeah, I know :P I'm only asking for a point to start, but well... I'll try to find where I can evolve the library.

from snacktory.

incubator avatar incubator commented on July 17, 2024

A quick and dirty solution is to use JSoup to alter the DOM before extractContent is run which would be in fetchAsString. Here's a quick class that restructures articles from Slate.com. A more generalized solution would be ideal, but this works well enough for our current needs.

public class SlateOverrideFetcher extends HtmlFetcher {

    @Override
    public String fetchAsString(String urlAsString, int timeout)
            throws MalformedURLException, IOException {

        String result = super.fetchAsString(urlAsString, timeout, true);
        return removeDiv(result);
    }

    /**
     * Remove extraneous Div tags in section.content
     */
    protected String removeDiv(String html) {
        String htmlFinal = null;

        StringBuilder builder = new StringBuilder();
        Document doc = Jsoup.parse(html, "UTF-8");
        Element content = doc.select(".content").first();
        Elements divs = doc.select(".text, .section, .parbase");

        for (Element div : divs) {
            String targetHtml = div.html();
            builder.append(targetHtml);
        }
        content.html(builder.toString());
        htmlFinal = doc.html();

        return htmlFinal;
    }

}

from snacktory.

haochun avatar haochun commented on July 17, 2024

I also encountered the same problem.example:http://www.2cto.com/kf/201310/249427.html .it only extract part text.

from snacktory.

haochun avatar haochun commented on July 17, 2024

@rubdottocom hello,do you resolve this bug now?

from snacktory.

karussell avatar karussell commented on July 17, 2024

Merged. If someone wants to be added as a contributor - let me know via email!

from snacktory.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.