karussell / snacktory Goto Github PK

View Code? Open in Web Editor NEW

461.0 36.0 159.0 2.42 MB

Readability clone in Java

Java 1.71% HTML 98.29%

snacktory's Issues

ArticleTextExtractor.getNodes() questions

Not an issue as such, a few questions.

Why in ArticleTextExtractor.getNodes() do you:

Use a Map, generate a hashCode and then only return the map values? Wouldn't a Set do the same job?
Add the parent of each element?

Stack overflow ...

Very occasionally I'm getting a stack overflow in 1.3-SNAPSHOT- so clearly it is content specific. Sadly I haven't been able to capture an offending site yet:

java.lang.StackOverflowError
at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299)
at java.util.HashMap.putVal(HashMap.java:663)
at java.util.HashMap.put(HashMap.java:611)
at org.jsoup.nodes.Attributes.put(Attributes.java:74)
at org.jsoup.nodes.Attributes.put(Attributes.java:51)
at org.jsoup.nodes.TextNode.ensureAttributes(TextNode.java:138)
at org.jsoup.nodes.TextNode.attr(TextNode.java:144)
at de.jetwick.snacktory.OutputFormatter.unlikely(OutputFormatter.java:118)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:130)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
........

Text content is removed when there is an image in news webpage.

Hi,

I have tried using snacktory and It works well on the webpages which do not contain images. I have tried using one of the newspapers and I found that whenever there is an image, snacktory removes text block close to the image.

Try this url : http://articles.timesofindia.indiatimes.com/2013-09-17/rest-of-world/42147651_1_tropical-depression-mexico-city-heavy-rains

ensure asian characters are not broken

This is now fixed! But needs a unit test!

From email:

The issue is in Converter.streamToString(). There's a loop to read http data chunks. Each chunk is converted separately to String, but may contain only the first (or seconf) half of a character, thus result in corrupted data. It happens sporadically depending on timing.

Also, the counting of bytesRead was wrong, so for slow connection there may be a "size exceeded" message with no justification.

What I did to test this problem is reading a Japanese article (url below) with the Browser, save its content somewhere (e.g. on file). Then run the streamToString() function in a loop (with some delay) and each time compare its output with the expected output on file. Sometimes I experienced dozens successful tests and then several failures, so this is not too persistent but the errors were often enough.

The article I tested on is http://astand.asahi.com/magazine/wrscience/2012022900015.html, and the corruption was almost always visible in the string "300" (see in the article), where instead of the "3" some junk was displayed.

Detect publish date

A great feature could be to detect the published date of the web page.
This information is often located somewhere at the top or the bottom of the main text.

Relevant content in XML island is not returned

When the relevant article content is in an XML island it wouldn't be returned. See for example WSJ Japan article http://jp.wsj.com/Finance-Markets/Foreign-Currency-Markets/node_400108 with the following fragment (shortened for clarity):

<p>
<?xml version="1.0" encoding="utf-8"?>
<section xmlns:image="http://ez.no/namespaces/ezpublish3/image/" ...>
<paragraph>(this is the relevant content) イスラエル銀行（**銀行）は景気下支えを目的に過去5カ月間に ...</paragraph>
</section>
</p>

Snacktory on Android? java.beans.Introspector

Hi! I'm trying to make it work on Android project but when I initialize fetcher:

HtmlFetcher fetcher = new HtmlFetcher();

A java.lang.NoClassDefFoundError: Failed resolution of: Ljava/beans/Introspector; is thrown.
I read about java.beans are not fully implemented on Android, and I found Open Beans project http://code.google.com/p/openbeans/ but I don't know how to make it work or if there is a simpler way to fix that exception.

Thank you.

Provide optional extraction directives

What about provide optional extraction directives ?

In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :

ArticleTextExtractor extractor = new ArticleTextExtractor();
extractor.setTextSelector("div.article_content");
extractor.setTitleSelector("h2", "first");
String dateRegEx = "xxxx";
extractor.setDateSelector("#published", dateRegEx);

JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = res.getDate();

Allow users to set a proxy

IIUC, it is not supported at the moment

snacktory/src/main/java/de/jetwick/snacktory/HtmlFetcher.java

Line 398 in 665d54a

 HttpURLConnection hConn = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY); 

Any chance to have this as a parameter in the future?

bump jsoup version

Is snacktory usable with the last version of jsoup 1.6.2? If yes, it could be great to bump it.
Thanks,
David

hello

sorry, bad action. please delete this issue !

sorry, sorry ...

Many websites only extract partial content

Hi Peter,

I notice that I can extract only a part of the content of many websites, for example this site: http://sheldonbrown.com/brandt/patching.html I only get a part of the article, starting from: "Assuming that a patch was properly".

Do you know if the reason is because the library needs more development to complete the TODO: "only top text supported at the moment".

If so, could you give me some guidelines on how can I work to improve that?

Thank you so much.

On slow networks Converter logs error

We need to check if we really already have read something ...

Error message for HtmlFetcherIntegrationTest.testHashbang:
"Converter - Couldn't reset stream to re-read with new encoding UTF-8 java.io.IOException: Resetting to invalid mark"

Please don't cause referrer spam

Your software already appears to properly advertise itself in the User-Agent, please don't cause Referer spam by using a fake referer pointing to this repository.

Bad parsing of article from `nytimes`

Here is the example:
https://www.nytimes.com/2017/10/09/business/general-motors-driverless.html

Text not fully parsed from the beginning. It starts only from:

The efforts have been moving forward in earnest since early last year, when G.M. bought Cruise Automation, a software company based in San Francisco.
...

Crux, an Android-optimized fork of Snacktory, with many issues fixed

Hi @karussell, thanks for building and sharing Snacktory!

You said you were looking for someone to take over maintenance and future development?

We’ve been working hard on our own fork, with several features over the original Snacktory. The reason we forked it is because we needed to change the basic API to make it fit our requirements, including optimizing the library for Android, decoupling it from optional dependencies such as HttpUrlConnection, log4j, etc., and adding several new features, such as rich-text output (HTML), preserving links, extracting more metadata content, etc.

Announcing Crux: https://github.com/chimbori/crux

If you are interested, let us know how we can work together for maintenance and future development!

NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher

I'm using snacktory with IntelliJ 15 and Gradle. The following was working yesterday, but stopped working today:

repositories {
  maven {
    url "https://github.com/karussell/mvnrepo/raw/master/releases/"
  }
}
dependencies {
  compile('de.jetwick:snacktory:1.2')
}

Getting errors from HtmlFetcher fetcher = new HtmlFetcher();:

java.lang.NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher

Things I tried in IntelliJ:

refresh gradle deps
Build -> Rebuild Project
File -> Invalidate Caches/Restart

Interesting if I build a jar with dependencies and run javac -jar..., then it does seem to work.

Any ideas what might have gone wrong?

dependency via sbt

Did you manage to add the dependency with sbt? I do get different exceptions while referring to different versions

Unsupported Popular Internet Landmarks

Articles from the following properties don't currently work:

m.slashdot.org
- Returns JResult with empty contents, probably due to redirect
arstechnica.com
- Produces java.net.ProtocolException: Unexpected status line: ��HTTP/1.1 200 OK
(full trace)

Great work btw. I'll keep hunting for more.

Preserve paragraphs?

Hi, is it possible to preserve/restore paragraphs with Snacktory engine? Extracted articles are not really readable when joined in one big chunk of text.

ignore hidden items?

It seems when I do a snacktory pass on amazon pages (ex: http://www.amazon.com/Vandaveer-Software-Brick-Buster-Pro/dp/B006T4IJTK)

its extracting hidden data, which is obviously not that relevant from a readability perspective. I was a bit confused when looking at the output, as a simple find via a browser didnt see the same text. when i looked at the source I realized that it was using display: hidden.

I realize in some cases it might be beyond the scope of readability (if its not really approaching it from a full DOM perspective), but it would seem in some cases (such as here), it should be more obvious that these nodes should be excluded

Can't split getText() into paragraphs

Hello, I’m trying to get major text from articles but I got a string without new line characters. Is there a way to extract the text while retaining all new line characters? Otherwise there will be only one single paragraphe per article…

Or is there a switch to retain certain html tag while doing the extraction? Like retain all <a> and <br>.

By the way, thanks for your great work!

Build fail due to test failed

Running de.jetwick.snacktory.ArticleTextExtractorTest
2012-09-23 08:22:25,963 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
2012-09-23 08:22:26,006 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
Tests run: 72, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.646 sec <<< FAILURE!

Failed tests:
testYomiuri(de.jetwick.snacktory.ArticleTextExtractorTest): yomiuri:????????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Tests run: 95, Failures: 1, Errors: 0, Skipped: 0

The issue is line 111

    assertTrue("yomiuri:" + res.getText(), res.getText().startsWith("　海津市海津町の国営木曽三川公園で、チューリップが見頃を迎えている。２０日までは「チューリップ祭」が開かれており、大勢の人たちが多彩な色や形を鑑賞している＝写真＝"));

The test is in success if you remove the space here >> startsWith("　海津

Converter.detectCharset throws for inputs longer than 2048

protected String detectCharset(String key, ByteArrayOutputStream bos, BufferedInputStream in, String enc) throws IOException {
byte[] arr = new byte[2048];

how to reproduce:
do a fetchAndExtract of this url 'http://www.gazzetta.it/Sport-Invernali/Sci-Alpino/Coppa-Mondo-Sci/26-02-2017/sci-combinata-brignone-ho-sciato-senza-paura-uscire-180995893986.shtml'

determineImageSource for images without width and height attributes

Images without weight and height attributes are ignored by determineImageSource method.
I think images without these attributes can be considered as images with width > 50 and height > 50

Futhermore width=50 and height = 50 are also ignored by the test

        if (height > 50)
            weight += 20;
        else if (height < 50)
            weight -= 20;

In my opinion, we should use :

        int weight = 0;
        int height = 0;
        int width = 0;
        try {
            height = Integer.parseInt(e.attr("height"));
        } catch (Exception ex) {}
        if (height == 0 || height >= 50)
            weight += 20;
        else if (height < 50)
            weight -= 20;

        try {
            width = Integer.parseInt(e.attr("width"));
        } catch (Exception ex) {}
        if (width == 0 || width >= 50)
            weight += 20;
        else if (width < 50)
            weight -= 20;

String text ignores paragraphs, isn't there a way to get the text in html

Not able to extract content

Not able to extract content from the some websites like quora.com and possibly some others.
It is returning 403, for HEAD request method at this line in HtmlFetcher class.

Make it possible to Increase maxBytes in HtmlFetcher

Hello,
I am getting an exception when loading urls with pages larger than the fixed 500000 maxBytes limit specified in Converter class.
Please add a way to either modify this value.

Fetch content from Twitter URLs?

Hi!
I'm trying to fetch content from the URLs inside of a Tweet.

When I try to do it for Official Twitter Android app, Twitter only shares with me a text like "read this tweet from @user at http://twitter.com/status/8341234812634".

So I fetch this URL with the hope to get the real tweet text with the real URL that I want to fetch.

However, when I do that I receive from Twitter a sort of warning that I must accept the use of cookies "To bring you Twitter, we and our partners use cookies on our and other websites. Cookies help personalize Twitter content, tailor Twitter Ads, measure their performance and provide you with a better, faster, safer Twitter experience. By using our services, you agree to our Cookie Use. Close".

I tried to set some "user-agent" and "cookie" configuration to HttpURLConnection before fetch Twitter, without success.

Do you know how can I achieve that?

That's currently my code (some dirty, I'm wondering to push you a fix when it works).

public String fetchAsString(String urlAsString, int timeout, boolean includeSomeGooseOptions)
        throws MalformedURLException, IOException {
    HttpURLConnection hConn = createUrlConnection(urlAsString, timeout, includeSomeGooseOptions);
    hConn.setInstanceFollowRedirects(true);

   // Start "hack"
    hConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
    Log.d("EXTRACT", hConn.getRequestProperty("User-Agent"));
    CookieManager cookieManager = new CookieManager();
    CookieHandler.setDefault(cookieManager);

    HttpCookie cookie = new HttpCookie("lang", "en");
    cookie.setDomain("twitter.com");
    cookie.setPath("/");
    cookie.setVersion(0);
    try {
        cookieManager.getCookieStore().add(new URI("http://twitter.com/"), cookie);
    } catch (URISyntaxException e) {
        e.printStackTrace();
    }
   // End "hack"

    String encoding = hConn.getContentEncoding();        
    InputStream is;
    if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
        is = new GZIPInputStream(hConn.getInputStream());
    } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
        is = new InflaterInputStream(hConn.getInputStream(), new Inflater(true));
    } else {
        is = hConn.getInputStream();
    }

    String enc = Converter.extractEncoding(hConn.getContentType());
    String res = createConverter(urlAsString).streamToString(is, enc);
    if (logger.isDebugEnabled())
        logger.debug(res.length() + " FetchAsString:" + urlAsString);
    return res;
}

aaaa <strong>bbbb </strong>cccc

The result of replaceTagsWithText method is

aaaa bbbbcccc

The space after bbbb is lost

But, if a html page contains something like

aaaa <strong>bbbb</strong> cccc

there is no problem.

TextNode tn = new TextNode(item.text(), topNode.baseUri());

remove the space

Regards

wrong imageUrl in youtube url's

This happens whenever you fetch an Youtube link like:
https://www.youtube.com/watch?v=1a6KjDmHbR4

Instead of using the "og:image" from the head, it's setting the imageUrl as the first image in the , so in the provided example url, it's getting "https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif" instead of the og:image that is the correct one: "https://i.ytimg.com/vi/1a6KjDmHbR4/maxresdefault.jpg"

Is there a work around this?

karussell / snacktory Goto Github PK

snacktory's Issues

Recommend Projects

Recommend Topics

Recommend Org