karussell / snacktory Goto Github PK
View Code? Open in Web Editor NEWReadability clone in Java
Readability clone in Java
Not an issue as such, a few questions.
Why in ArticleTextExtractor.getNodes() do you:
Map
, generate a hashCode
and then only return the map values? Wouldn't a Set
do the same job?Very occasionally I'm getting a stack overflow in 1.3-SNAPSHOT- so clearly it is content specific. Sadly I haven't been able to capture an offending site yet:
java.lang.StackOverflowError
at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299)
at java.util.HashMap.putVal(HashMap.java:663)
at java.util.HashMap.put(HashMap.java:611)
at org.jsoup.nodes.Attributes.put(Attributes.java:74)
at org.jsoup.nodes.Attributes.put(Attributes.java:51)
at org.jsoup.nodes.TextNode.ensureAttributes(TextNode.java:138)
at org.jsoup.nodes.TextNode.attr(TextNode.java:144)
at de.jetwick.snacktory.OutputFormatter.unlikely(OutputFormatter.java:118)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:130)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
........
Hi,
I have tried using snacktory and It works well on the webpages which do not contain images. I have tried using one of the newspapers and I found that whenever there is an image, snacktory removes text block close to the image.
Try this url : http://articles.timesofindia.indiatimes.com/2013-09-17/rest-of-world/42147651_1_tropical-depression-mexico-city-heavy-rains
This is now fixed! But needs a unit test!
From email:
The issue is in Converter.streamToString(). There's a loop to read http data chunks. Each chunk is converted separately to String, but may contain only the first (or seconf) half of a character, thus result in corrupted data. It happens sporadically depending on timing.
Also, the counting of bytesRead was wrong, so for slow connection there may be a "size exceeded" message with no justification.
What I did to test this problem is reading a Japanese article (url below) with the Browser, save its content somewhere (e.g. on file). Then run the streamToString() function in a loop (with some delay) and each time compare its output with the expected output on file. Sometimes I experienced dozens successful tests and then several failures, so this is not too persistent but the errors were often enough.
The article I tested on is http://astand.asahi.com/magazine/wrscience/2012022900015.html, and the corruption was almost always visible in the string "300" (see in the article), where instead of the "3" some junk was displayed.
A great feature could be to detect the published date of the web page.
This information is often located somewhere at the top or the bottom of the main text.
When the relevant article content is in an XML island it wouldn't be returned. See for example WSJ Japan article http://jp.wsj.com/Finance-Markets/Foreign-Currency-Markets/node_400108 with the following fragment (shortened for clarity):
<p>
<?xml version="1.0" encoding="utf-8"?>
<section xmlns:image="http://ez.no/namespaces/ezpublish3/image/" ...>
<paragraph>(this is the relevant content) イスラエル銀行(**銀行)は景気下支えを目的に過去5カ月間に ...</paragraph>
</section>
</p>
Hi! I'm trying to make it work on Android project but when I initialize fetcher:
HtmlFetcher fetcher = new HtmlFetcher();
A java.lang.NoClassDefFoundError: Failed resolution of: Ljava/beans/Introspector; is thrown.
I read about java.beans are not fully implemented on Android, and I found Open Beans project http://code.google.com/p/openbeans/ but I don't know how to make it work or if there is a simpler way to fix that exception.
Thank you.
What about provide optional extraction directives ?
In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :
ArticleTextExtractor extractor = new ArticleTextExtractor();
extractor.setTextSelector("div.article_content");
extractor.setTitleSelector("h2", "first");
String dateRegEx = "xxxx";
extractor.setDateSelector("#published", dateRegEx);
JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = res.getDate();
IIUC, it is not supported at the moment
Any chance to have this as a parameter in the future?
Is snacktory usable with the last version of jsoup 1.6.2? If yes, it could be great to bump it.
Thanks,
David
hello
sorry, bad action. please delete this issue !
sorry, sorry ...
Hi Peter,
I notice that I can extract only a part of the content of many websites, for example this site: http://sheldonbrown.com/brandt/patching.html I only get a part of the article, starting from: "Assuming that a patch was properly".
Do you know if the reason is because the library needs more development to complete the TODO: "only top text supported at the moment".
If so, could you give me some guidelines on how can I work to improve that?
Thank you so much.
We need to check if we really already have read something ...
Error message for HtmlFetcherIntegrationTest.testHashbang:
"Converter - Couldn't reset stream to re-read with new encoding UTF-8 java.io.IOException: Resetting to invalid mark"
Your software already appears to properly advertise itself in the User-Agent, please don't cause Referer spam by using a fake referer pointing to this repository.
Here is the example:
https://www.nytimes.com/2017/10/09/business/general-motors-driverless.html
Text not fully parsed from the beginning. It starts only from:
The efforts have been moving forward in earnest since early last year, when G.M. bought Cruise Automation, a software company based in San Francisco.
...
Hi @karussell, thanks for building and sharing Snacktory!
You said you were looking for someone to take over maintenance and future development?
We’ve been working hard on our own fork, with several features over the original Snacktory. The reason we forked it is because we needed to change the basic API to make it fit our requirements, including optimizing the library for Android, decoupling it from optional dependencies such as HttpUrlConnection, log4j, etc., and adding several new features, such as rich-text output (HTML), preserving links, extracting more metadata content, etc.
Announcing Crux: https://github.com/chimbori/crux
If you are interested, let us know how we can work together for maintenance and future development!
I'm using snacktory with IntelliJ 15 and Gradle. The following was working yesterday, but stopped working today:
repositories {
maven {
url "https://github.com/karussell/mvnrepo/raw/master/releases/"
}
}
dependencies {
compile('de.jetwick:snacktory:1.2')
}
Getting errors from HtmlFetcher fetcher = new HtmlFetcher();
:
java.lang.NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher
Things I tried in IntelliJ:
Interesting if I build a jar with dependencies and run javac -jar...
, then it does seem to work.
Any ideas what might have gone wrong?
Did you manage to add the dependency with sbt? I do get different exceptions while referring to different versions
Articles from the following properties don't currently work:
m.slashdot.org
arstechnica.com
java.net.ProtocolException: Unexpected status line: �����������������������������������HTTP/1.1 200 OK
Great work btw. I'll keep hunting for more.
Hi, is it possible to preserve/restore paragraphs with Snacktory engine? Extracted articles are not really readable when joined in one big chunk of text.
It seems when I do a snacktory pass on amazon pages (ex: http://www.amazon.com/Vandaveer-Software-Brick-Buster-Pro/dp/B006T4IJTK)
its extracting hidden data, which is obviously not that relevant from a readability perspective. I was a bit confused when looking at the output, as a simple find via a browser didnt see the same text. when i looked at the source I realized that it was using display: hidden.
I realize in some cases it might be beyond the scope of readability (if its not really approaching it from a full DOM perspective), but it would seem in some cases (such as here), it should be more obvious that these nodes should be excluded
Hello, I’m trying to get major text from articles but I got a string without new line characters. Is there a way to extract the text while retaining all new line characters? Otherwise there will be only one single paragraphe per article…
Or is there a switch to retain certain html tag while doing the extraction? Like retain all <a>
and <br>
.
By the way, thanks for your great work!
Running de.jetwick.snacktory.ArticleTextExtractorTest
2012-09-23 08:22:25,963 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
2012-09-23 08:22:26,006 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
Tests run: 72, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.646 sec <<< FAILURE!
Failed tests:
testYomiuri(de.jetwick.snacktory.ArticleTextExtractorTest): yomiuri:????????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
Tests run: 95, Failures: 1, Errors: 0, Skipped: 0
The issue is line 111
assertTrue("yomiuri:" + res.getText(), res.getText().startsWith(" 海津市海津町の国営木曽三川公園で、チューリップが見頃を迎えている。20日までは「チューリップ祭」が開かれており、大勢の人たちが多彩な色や形を鑑賞している=写真="));
The test is in success if you remove the space here >> startsWith(" 海津
protected String detectCharset(String key, ByteArrayOutputStream bos, BufferedInputStream in, String enc) throws IOException {
byte[] arr = new byte[2048];
how to reproduce:
do a fetchAndExtract of this url 'http://www.gazzetta.it/Sport-Invernali/Sci-Alpino/Coppa-Mondo-Sci/26-02-2017/sci-combinata-brignone-ho-sciato-senza-paura-uscire-180995893986.shtml'
Images without weight and height attributes are ignored by determineImageSource method.
I think images without these attributes can be considered as images with width > 50 and height > 50
Futhermore width=50 and height = 50 are also ignored by the test
if (height > 50)
weight += 20;
else if (height < 50)
weight -= 20;
In my opinion, we should use :
int weight = 0;
int height = 0;
int width = 0;
try {
height = Integer.parseInt(e.attr("height"));
} catch (Exception ex) {}
if (height == 0 || height >= 50)
weight += 20;
else if (height < 50)
weight -= 20;
try {
width = Integer.parseInt(e.attr("width"));
} catch (Exception ex) {}
if (width == 0 || width >= 50)
weight += 20;
else if (width < 50)
weight -= 20;
Not able to extract content from the some websites like quora.com and possibly some others.
It is returning 403, for HEAD request method at this line in HtmlFetcher class.
Hello,
I am getting an exception when loading urls with pages larger than the fixed 500000 maxBytes
limit specified in Converter
class.
Please add a way to either modify this value.
Hi!
I'm trying to fetch content from the URLs inside of a Tweet.
When I try to do it for Official Twitter Android app, Twitter only shares with me a text like "read this tweet from @user at http://twitter.com/status/8341234812634".
So I fetch this URL with the hope to get the real tweet text with the real URL that I want to fetch.
However, when I do that I receive from Twitter a sort of warning that I must accept the use of cookies "To bring you Twitter, we and our partners use cookies on our and other websites. Cookies help personalize Twitter content, tailor Twitter Ads, measure their performance and provide you with a better, faster, safer Twitter experience. By using our services, you agree to our Cookie Use. Close".
I tried to set some "user-agent" and "cookie" configuration to HttpURLConnection before fetch Twitter, without success.
Do you know how can I achieve that?
That's currently my code (some dirty, I'm wondering to push you a fix when it works).
public String fetchAsString(String urlAsString, int timeout, boolean includeSomeGooseOptions)
throws MalformedURLException, IOException {
HttpURLConnection hConn = createUrlConnection(urlAsString, timeout, includeSomeGooseOptions);
hConn.setInstanceFollowRedirects(true);
// Start "hack"
hConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
Log.d("EXTRACT", hConn.getRequestProperty("User-Agent"));
CookieManager cookieManager = new CookieManager();
CookieHandler.setDefault(cookieManager);
HttpCookie cookie = new HttpCookie("lang", "en");
cookie.setDomain("twitter.com");
cookie.setPath("/");
cookie.setVersion(0);
try {
cookieManager.getCookieStore().add(new URI("http://twitter.com/"), cookie);
} catch (URISyntaxException e) {
e.printStackTrace();
}
// End "hack"
String encoding = hConn.getContentEncoding();
InputStream is;
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
is = new GZIPInputStream(hConn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
is = new InflaterInputStream(hConn.getInputStream(), new Inflater(true));
} else {
is = hConn.getInputStream();
}
String enc = Converter.extractEncoding(hConn.getContentType());
String res = createConverter(urlAsString).streamToString(is, enc);
if (logger.isDebugEnabled())
logger.debug(res.length() + " FetchAsString:" + urlAsString);
return res;
}
I am trying to implement it using the code from Readme but it just doesn't work. There are no errors but it doesn't work either.
If I try to Log.d
value returned from JResult, that debug log is also not in the output. I just don't know what is the issue here.
Here is the example:
https://www.cnbc.com/2017/10/09/amazons-comedies-win-with-critics-while-hulu-is-a-hit-with-audiences.html
https://www.cnbc.com/2017/10/10/opec-calls-on-us-shale-oil-producers-to-accept-shared-responsibility.html
Text not fully parsed. Only first part of article.
I wanted to change in the beginning of this file, pepole for people, but i could not create a pull request.
Cheers!
Hi,
If a html page contains something like
aaaa <strong>bbbb </strong>cccc
The result of replaceTagsWithText method is
aaaa bbbbcccc
The space after bbbb is lost
But, if a html page contains something like
aaaa <strong>bbbb</strong> cccc
there is no problem.
TextNode tn = new TextNode(item.text(), topNode.baseUri());
remove the space
Regards
This happens whenever you fetch an Youtube link like:
https://www.youtube.com/watch?v=1a6KjDmHbR4
Instead of using the "og:image" from the head, it's setting the imageUrl as the first image in the , so in the provided example url, it's getting "https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif" instead of the og:image that is the correct one: "https://i.ytimg.com/vi/1a6KjDmHbR4/maxresdefault.jpg"
Is there a work around this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.