Comments (2)
Thanks for reporting.
This seems to be caused by a bug in NekoHTML 1.9.13
The corresponding stacktrace points at
"org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)"
The problem seems to go away after an update to NekoHTML 1.9.15.
Could you please confirm this?
Before upgrading boilerpipe to NekoHTML 1.9.15, I will have to perform some
extra checks, especially to ensure we don't get any regressions in terms of
extraction quality.
Best,
Christian
Original comment by ckkohl79
on 14 May 2012 at 4:44
- Changed state: Started
- Added labels: OpSys-All
from boilerpipe.
Thanks for quick-response.
As you've stated, the problem has gone away with NekoHTML 1.9.15.
Below is the list of changes in NekoHTML since ver.1.9.13 (which has been
released on 2 Sept 2009):
- Version 1.9.15 (3 Aug 2011)
Avoid using a synchronized structure (here java.util.Properties) to store built-in entities that are loaded at startup (#3001745), change INS to inline element, change BUTTON to inline element. don't parse body of IFRAME, add new feature http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe to allow empty IFRAME tags (default is false), make detected encoding available as Locator2.getEncoding() (#3381270).
- Version 1.9.14 (2 Feb 2010)
Don't parse body of NOFRAMES (fixes StackOverflowError reported in #2854697), TABLE can have multiple THEAD, TBODY and TFOOT (patch provided by Ahmed Ashour, #2893796), trim encoding found in meta tag (#2904817), fix ArrayIndexOutOfBoundException on empty attribute when using feature normalize-attrs(#2838901), recognize tags even if the > of the opening tag is missing (#2886227), only end TABLE can close a table (#2913095), fix StackOverflowError when parsing document fragment (#2911449), fix NullPointerException occurring with the insert-namespaces feature (#2942363).
I'm not pretty sure but I guess these changes do not affect the BoilerPipe's
extraction quality.
Looking forward to hearing about the result of your regression tests.
Regards,
Gural
Original comment by gural.vu...@gmail.com
on 14 May 2012 at 7:16
from boilerpipe.
Related Issues (20)
- BoilerplateBlockFilter ignores labelToKeep
- [deleted issue]
- Program does not terminate for badly formatted/syntactically incorrect HTML input
- How to use boilerpipe to get some text with a hyperlink from the web page? HOT 1
- Incomplete extraction of text with special characters
- Server returned HTTP response code: 403 for URL (SOLVED) please use this codeline. HOT 2
- Limit the parsing depth of the html parsing to avoid out of memory situations HOT 1
- Extract article from non-english text HOT 1
- Missing Maven 1.2.0
- Xerces for andorid jar file needed HOT 2
- its not working for a news site HOT 1
- Incomplete extraction of article
- Fail to extract main content on some page, get footnote instead
- IllegalArgumentException for many web pages
- Missing ImageExtractor in downloabale 1.2 jar file
- Performance issues with UnicodeTokenizer
- Boilerpipe is conflicting with CyberNeko library HOT 1
- Unsupported content type: null HOT 1
- Different result when using Web Api and the source api?
- How to debug the result?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from boilerpipe.