Git Product home page Git Product logo

StackOverflowError when page includes another <body> part in <noframes> about boilerpipe HOT 2 OPEN

GoogleCodeExporter avatar GoogleCodeExporter commented on July 17, 2024
StackOverflowError when page includes another part in <p>from boilerpipe.</p></section> </section> </article> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-7917632214101949" data-ad-slot="6627871389" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> <article> <h2 class="h2">Comments (2)</h2> <section class="issue-comment"> <section id="105056563" class="issue-head"> <img class="issue-avatar" src="https://avatars.githubusercontent.com/u/9614759?s=30&amp;u=cc9d321ce8a017c405d02d97701bd06d77b5b30c&amp;v=4" alt="GoogleCodeExporter avatar" /> <a class="issue-username" href="/GoogleCodeExporter">GoogleCodeExporter</a> <span class="issue-time"> commented on July 17, 2024 </span> </section> <section class="markdown markdown-js p-5"><div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Thanks for reporting. This seems to be caused by a bug in NekoHTML 1.9.13 The corresponding stacktrace points at &quot;org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)&quot; The problem seems to go away after an update to NekoHTML 1.9.15. Could you please confirm this? Before upgrading boilerpipe to NekoHTML 1.9.15, I will have to perform some extra checks, especially to ensure we don't get any regressions in terms of extraction quality. Best, Christian"><pre class="notranslate"><code class="notranslate">Thanks for reporting. This seems to be caused by a bug in NekoHTML 1.9.13 The corresponding stacktrace points at "org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)" The problem seems to go away after an update to NekoHTML 1.9.15. Could you please confirm this? Before upgrading boilerpipe to NekoHTML 1.9.15, I will have to perform some extra checks, especially to ensure we don't get any regressions in terms of extraction quality. Best, Christian </code></pre></div> <p dir="auto">Original comment by <code class="notranslate">ckkohl79</code> on 14 May 2012 at 4:44</p> <ul dir="auto"> <li>Changed state: <strong>Started</strong></li> <li>Added labels: <strong>OpSys-All</strong></li> </ul><p>from boilerpipe.</p></section> </section> <section class="issue-comment"> <section id="105056564" class="issue-head"> <img class="issue-avatar" src="https://avatars.githubusercontent.com/u/9614759?s=30&amp;u=cc9d321ce8a017c405d02d97701bd06d77b5b30c&amp;v=4" alt="GoogleCodeExporter avatar" /> <a class="issue-username" href="/GoogleCodeExporter">GoogleCodeExporter</a> <span class="issue-time"> commented on July 17, 2024 </span> </section> <section class="markdown markdown-js p-5"><div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="Thanks for quick-response. As you've stated, the problem has gone away with NekoHTML 1.9.15. Below is the list of changes in NekoHTML since ver.1.9.13 (which has been released on 2 Sept 2009): - Version 1.9.15 (3 Aug 2011) Avoid using a synchronized structure (here java.util.Properties) to store built-in entities that are loaded at startup (#3001745), change INS to inline element, change BUTTON to inline element. don't parse body of IFRAME, add new feature http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe to allow empty IFRAME tags (default is false), make detected encoding available as Locator2.getEncoding() (#3381270). - Version 1.9.14 (2 Feb 2010) Don't parse body of NOFRAMES (fixes StackOverflowError reported in #2854697), TABLE can have multiple THEAD, TBODY and TFOOT (patch provided by Ahmed Ashour, #2893796), trim encoding found in meta tag (#2904817), fix ArrayIndexOutOfBoundException on empty attribute when using feature normalize-attrs(#2838901), recognize tags even if the &gt; of the opening tag is missing (#2886227), only end TABLE can close a table (#2913095), fix StackOverflowError when parsing document fragment (#2911449), fix NullPointerException occurring with the insert-namespaces feature (#2942363). I'm not pretty sure but I guess these changes do not affect the BoilerPipe's extraction quality. Looking forward to hearing about the result of your regression tests. Regards, Gural"><pre class="notranslate"><code class="notranslate">Thanks for quick-response. As you've stated, the problem has gone away with NekoHTML 1.9.15. Below is the list of changes in NekoHTML since ver.1.9.13 (which has been released on 2 Sept 2009): - Version 1.9.15 (3 Aug 2011) Avoid using a synchronized structure (here java.util.Properties) to store built-in entities that are loaded at startup (#3001745), change INS to inline element, change BUTTON to inline element. don't parse body of IFRAME, add new feature http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe to allow empty IFRAME tags (default is false), make detected encoding available as Locator2.getEncoding() (#3381270). - Version 1.9.14 (2 Feb 2010) Don't parse body of NOFRAMES (fixes StackOverflowError reported in #2854697), TABLE can have multiple THEAD, TBODY and TFOOT (patch provided by Ahmed Ashour, #2893796), trim encoding found in meta tag (#2904817), fix ArrayIndexOutOfBoundException on empty attribute when using feature normalize-attrs(#2838901), recognize tags even if the &gt; of the opening tag is missing (#2886227), only end TABLE can close a table (#2913095), fix StackOverflowError when parsing document fragment (#2911449), fix NullPointerException occurring with the insert-namespaces feature (#2942363). I'm not pretty sure but I guess these changes do not affect the BoilerPipe's extraction quality. Looking forward to hearing about the result of your regression tests. Regards, Gural </code></pre></div> <p dir="auto">Original comment by <code class="notranslate">gural.vu...@gmail.com</code> on 14 May 2012 at 7:16</p><p>from boilerpipe.</p></section> </section> </article> <section> <h2 class="h2">Related Issues (20)</h2> <div class="issue"> <ul> <li> <a href="/tilaklodha/boilerpipe/issues/65">BoilerplateBlockFilter ignores labelToKeep</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/66">[deleted issue]</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/67">Program does not terminate for badly formatted/syntactically incorrect HTML input</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/68">How to use boilerpipe to get some text with a hyperlink from the web page?</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 1</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/69">Incomplete extraction of text with special characters </a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/70">Server returned HTTP response code: 403 for URL (SOLVED) please use this codeline.</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 2</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/71">Limit the parsing depth of the html parsing to avoid out of memory situations</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 1</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/72">Extract article from non-english text</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 1</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/73">Missing Maven 1.2.0</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/74">Xerces for andorid jar file needed</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 2</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/75">its not working for a news site</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 1</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/76">Incomplete extraction of article </a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/77">Fail to extract main content on some page, get footnote instead </a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/78">IllegalArgumentException for many web pages</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/79">Missing ImageExtractor in downloabale 1.2 jar file</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/80">Performance issues with UnicodeTokenizer</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/81">Boilerpipe is conflicting with CyberNeko library</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 1</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/82"> Unsupported content type: null</a> <span class="text-red-600 text-xs font-normal py-0.5 px-1 border border-red-600 rounded-md">HOT 1</span> </li> <li> <a href="/tilaklodha/boilerpipe/issues/83">Different result when using Web Api and the source api?</a> </li> <li> <a href="/tilaklodha/boilerpipe/issues/84">How to debug the result?</a> </li> </ul> </div> </section> </main> <section id="more" class="flex-none w-full md:w-60 text-gray-600 bg-gray-50 px-5 md:px-3 rounded-md dark-color"> <div class="w-full md:w-60 h-0.5"></div> <section> <!-- recommend projects --> <h2 class="h2 py-3.5">Recommend Projects</h2> <ul> <li class="mb-4"> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/facebook/react"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://raw.githubusercontent.com/facebook/create-react-app/master/packages/cra-template/template/public/logo192.png" alt="React photo" /> React </a> </h3> <p class="article-more pt-1">A declarative, efficient, and flexible JavaScript library for building user interfaces.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/vuejs/vue"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://camo.githubusercontent.com/c8f91d18976e27123643a926a2588b8d931a0292fd0b6532c3155379e8591629/68747470733a2f2f7675656a732e6f72672f696d616765732f6c6f676f2e706e67" alt="Vue.js photo" /> Vue.js </a> </h3> <p class="article-more pt-1">🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/microsoft/TypeScript"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://www.typescriptlang.org/favicon-32x32.png" alt="Typescript photo" /> Typescript </a> </h3> <p class="article-more pt-1">TypeScript is a superset of JavaScript that compiles to clean JavaScript output.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/tensorflow/tensorflow"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://camo.githubusercontent.com/c04e16c05de80dadbdc990884672fc941fdcbbfbb02b31dd48c248d010861426/68747470733a2f2f7777772e74656e736f72666c6f772e6f72672f696d616765732f74665f6c6f676f5f736f6369616c2e706e67" alt="TensorFlow photo" /> TensorFlow </a> </h3> <p class="article-more pt-1">An Open Source Machine Learning Framework for Everyone</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/django/django"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://avatars2.githubusercontent.com/u/27804?s=200&amp;v=4" alt="Django photo" /> Django </a> </h3> <p class="article-more pt-1">The Web framework for perfectionists with deadlines.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/laravel/laravel"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://laravel.com/img/logomark.min.svg" alt="Laravel photo" /> Laravel </a> </h3> <p class="article-more pt-1">A PHP framework for web artisans</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/d3/d3"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://camo.githubusercontent.com/586ccf0aad9684edc821658cee04146cf36d1f1d5ec904bbefd72728909ccb2e/68747470733a2f2f64336a732e6f72672f6c6f676f2e737667" alt="D3 photo" /> D3 </a> </h3> <p class="article-more pt-1">Bring data to life with SVG, Canvas and HTML. 📊📈🎉</p> </article> </li> <li> <div> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-7917632214101949" data-ad-slot="6627871389" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </li> </ul> </section> <section> <!-- recommend topics --> <h2 class="h2 py-3.5">Recommend Topics</h2> <ul> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/topic/javascript"> javascript </a> </h3> <p class="article-more pt-1">JavaScript (JS) is a lightweight interpreted programming language with first-class functions.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/topic/web"> web </a> </h3> <p class="article-more pt-1">Some thing interesting about web. New door for the world.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/topic/server"> server </a> </h3> <p class="article-more pt-1">A server is a program made to process requests and deliver data to clients.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/topic/machine-learning"> Machine learning </a> </h3> <p class="article-more pt-1">Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/topic/visualization"> Visualization </a> </h3> <p class="article-more pt-1">Some thing interesting about visualization, use data art</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/topic/game"> Game </a> </h3> <p class="article-more pt-1">Some thing interesting about game, make everyone happy.</p> </article> </li> <li> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-7917632214101949" data-ad-slot="6627871389" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </li> </ul> </section> <section> <!-- recommend users --> <h2 class="h2 py-3.5">Recommend Org</h2> <ul> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/facebook"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://avatars.githubusercontent.com/u/69631?v=4" alt="Facebook photo" /> Facebook </a> </h3> <p class="article-more pt-1">We are working to build community through open source technology. NB: members must have two-factor auth.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/microsoft"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://avatars.githubusercontent.com/u/6154722?v=4" alt="Microsoft photo" /> Microsoft </a> </h3> <p class="article-more pt-1">Open source projects and samples from Microsoft.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/google"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://avatars.githubusercontent.com/u/1342004?v=4" alt="Google photo" /> Google </a> </h3> <p class="article-more pt-1">Google ❤️ Open Source for everyone.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/alibaba"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://avatars.githubusercontent.com/u/1961952?v=4" alt="Alibaba photo" /> Alibaba </a> </h3> <p class="article-more pt-1">Alibaba Open Source for everyone</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/d3"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://avatars.githubusercontent.com/u/1562726?v=4" alt="D3 photo" /> D3 </a> </h3> <p class="article-more pt-1">Data-Driven Documents codes.</p> </article> </li> <li> <article class="small-box"> <h3 class="article-title"> <a class="block break-all" href="/tencent"> <img loading="lazy" class="inline-block w-6 h-6 rounded-md border border-white" width="24" height="24" src="https://avatars.githubusercontent.com/u/18461506?v=4" alt="Tencent photo" /> Tencent </a> </h3> <p class="article-more pt-1">China tencent open source team.</p> </article> </li> <li> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-7917632214101949" data-ad-slot="6627871389" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </li> </ul> </section> </section> </div> </div> <!-- footer --> <footer class="sizeing text-xs text-center p-5"> <div>Friends: <a class="hover:underline" target="_blank" href="https://www.chanpinqingbaoju.com">ProductDiscover</a> </div> Copyright © 2024 Git Product <!-- & <span class="block md:inline">Data Power by github.com</span> --> ❤️ <a class="hover:underline block md:inline" href="mailto:cs.victor.edison@gmail.com">Mail to me</a> </footer> </body> </html>