Comments (7)
I've just created pull request #47 for a change that I've made that improves this issue - at least in my testing.
from snacktory.
I'm no longer actively developing this library. Most work I do is integrating pull requests (hint hint ;))
from snacktory.
Yeah, I know :P I'm only asking for a point to start, but well... I'll try to find where I can evolve the library.
from snacktory.
A quick and dirty solution is to use JSoup to alter the DOM before extractContent is run which would be in fetchAsString. Here's a quick class that restructures articles from Slate.com. A more generalized solution would be ideal, but this works well enough for our current needs.
public class SlateOverrideFetcher extends HtmlFetcher {
@Override
public String fetchAsString(String urlAsString, int timeout)
throws MalformedURLException, IOException {
String result = super.fetchAsString(urlAsString, timeout, true);
return removeDiv(result);
}
/**
* Remove extraneous Div tags in section.content
*/
protected String removeDiv(String html) {
String htmlFinal = null;
StringBuilder builder = new StringBuilder();
Document doc = Jsoup.parse(html, "UTF-8");
Element content = doc.select(".content").first();
Elements divs = doc.select(".text, .section, .parbase");
for (Element div : divs) {
String targetHtml = div.html();
builder.append(targetHtml);
}
content.html(builder.toString());
htmlFinal = doc.html();
return htmlFinal;
}
}
from snacktory.
I also encountered the same problem.example:http://www.2cto.com/kf/201310/249427.html .it only extract part text.
from snacktory.
@rubdottocom hello,do you resolve this bug now?
from snacktory.
Merged. If someone wants to be added as a contributor - let me know via email!
from snacktory.
Related Issues (20)
- Text content is removed when there is an image in news webpage. HOT 1
- Snacktory on Android? java.beans.Introspector HOT 6
- Fetch content from Twitter URLs? HOT 4
- Unsupported Popular Internet Landmarks HOT 1
- Misspelling in README file
- Allow users to set a proxy HOT 5
- String text ignores paragraphs, isn't there a way to get the text in html
- NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher HOT 4
- wrong imageUrl in youtube url's HOT 2
- Not working
- dependency via sbt
- Make it possible to Increase maxBytes in HtmlFetcher
- Stack overflow ...
- Crux, an Android-optimized fork of Snacktory, with many issues fixed HOT 7
- Converter.detectCharset throws for inputs longer than 2048
- Please don't cause referrer spam HOT 3
- Not able to extract content HOT 1
- Bad parsing of article from `cnbc`
- Bad parsing of article from `nytimes`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from snacktory.