Comments (5)
It looks like the issue is the KeepLargestBlockFilter which rejects every block
except the largest. While taking out this filter in the library should return
results closer to http://boilerpipe-web.appspot.com/, it looks like that will
cause some non-article things that are rejected by
http://boilerpipe-web.appspot.com/ to make it through.
Is there anyplace I can go to see what filters are being by the article
extractor used by http://boilerpipe-web.appspot.com/?
Original comment by [email protected]
on 30 Mar 2012 at 4:46
from boilerpipe.
I'm also unable to get the same results using the HTMLHighlighter in extraction
mode. The web API (http://boilerpipe-web.appspot.com) clearly states that:
"This Web Application probably uses a more recent version than the released
versions in the Boilerpipe Google Code page. You might thus get slightly
different (hopefully better) results."
In the web demo, are you using the HTMLHighligher to extract HTML, or have you
added a different approach to boilerpipe-core? If changes have been made to
core, I'd be happy to contribute to the open source version. Hints would be
appreciated =)
Original comment by [email protected]
on 19 Oct 2012 at 5:49
from boilerpipe.
I got the similar issue.
When trying the URL "http://www.hokkaido-np.co.jp/news/donai/424760.html"
With ArticleExtractor and "Plain Text" output
Library code did not produce same results as http://boilerpipe-web.appspot.com/
Original comment by [email protected]
on 6 Dec 2012 at 12:02
from boilerpipe.
Building from source will fix this issue.
Original comment by [email protected]
on 24 Jan 2013 at 8:33
from boilerpipe.
I have done a build from SVN. Still Japanese URL gives bad results. Same works
gr8 on the Web App. Code snippet below:
URL url = new URL("http://d.hatena.ne.jp/mkusunok/20130817/p1");
String text = ArticleExtractor.INSTANCE.getText(url);
Original comment by [email protected]
on 7 Oct 2013 at 7:24
from boilerpipe.
Related Issues (20)
- BoilerplateBlockFilter ignores labelToKeep
- [deleted issue]
- Program does not terminate for badly formatted/syntactically incorrect HTML input
- How to use boilerpipe to get some text with a hyperlink from the web page? HOT 1
- Incomplete extraction of text with special characters
- Server returned HTTP response code: 403 for URL (SOLVED) please use this codeline. HOT 2
- Limit the parsing depth of the html parsing to avoid out of memory situations HOT 1
- Extract article from non-english text HOT 1
- Missing Maven 1.2.0
- Xerces for andorid jar file needed HOT 2
- its not working for a news site HOT 1
- Incomplete extraction of article
- Fail to extract main content on some page, get footnote instead
- IllegalArgumentException for many web pages
- Missing ImageExtractor in downloabale 1.2 jar file
- Performance issues with UnicodeTokenizer
- Boilerpipe is conflicting with CyberNeko library HOT 1
- Unsupported content type: null HOT 1
- Different result when using Web Api and the source api?
- How to debug the result?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from boilerpipe.