tilaklodha / boilerpipe Goto Github PK

Automatically exported from code.google.com/p/boilerpipe

Java 99.62% HTML 0.38%

boilerpipe's Issues

Precursory header tags missing

What steps will reproduce the problem?
1. Use the HTMLHighlighter to extract the relevant html-code from a page:
   final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
   final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
   System.out.println(hh.process(url, extractor));
2. Try to parse this page: http://www.golem.de/1102/81290.html

What is the expected output? What do you see instead?
This should be the output:
<H2>
Daniel Domscheit-Berg
</H2>
<H1>
Wikileaks-Aussteiger haben Unterlagen mitgenommen
</H1>
...

But actually I get this:
Daniel Domscheit-Berg
</H2>
Wikileaks-Aussteiger haben Unterlagen mitgenommen
</H1>
...

What version of the product are you using? On what operating system?
- Boilerplate 1.1.0 binary
- OS: Suse

Is it possible to generate exactly the output which the Web API produces? There 
are even other tags which seem to be missing like <TABLE> and <TD>.

Original issue reported on code.google.com by [email protected] on 10 Feb 2011 at 9:07

Boilepipe fails (but not web api edition)

What steps will reproduce the problem?
1. curl --fail -L http://thisrecording.com/the-past | java -jar 
tika-app-0.9.jar -T

What is the expected output? What do you see instead?
at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
    at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
    at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
    at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
    at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
    at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
    at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
    at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
    at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
    at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
    at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)

What version of the product are you using? On what operating system?
1.1.0, Mac Os

Please provide any additional information below.
https://issues.apache.org/jira/browse/TIKA-676

Original issue reported on code.google.com by gabriele%[email protected] on 18 Jun 2011 at 12:52

Add 'getInstance' accessor for ImageExtractor

Hi, it would be convenient to add an accessor:

public static ImageExtractor getInstance() {
        return INSTANCE;
    }


This helps in the case when i'm running BP with JRuby.

Further - is there a way to contribute? if this were in Github i would already 
have made a pull request :)

PS - kudos for a great build experience! i had no problem running my own  build 
what so ever.

Original issue reported on code.google.com by [email protected] on 15 Jan 2012 at 2:10

Tags missing in output html

The problem is the htmlhighlighter process seems to omit opening header tags 
<H1>, <H2> etc but includes the closing tags </H2> thus titles don't stand out 
in the output document etc


What version of the product are you using? On what operating system?

boilerpipe 1.1.0 ubuntu 10.10
Please provide any additional information below.

Original issue reported on code.google.com by *[email protected] on 6 Jul 2011 at 2:11

Ability to keep inline HTML in extracted content

It would be very useful to have an option to keep inline HTML, such as links, 
formatting or images, 
inside the block of HTML which boilerplate selects.

Original issue reported on code.google.com by tom%[email protected] on 24 Jan 2010 at 4:36

Bad xml format in html output from Web API

• What steps will reproduce the problem?
Get an html or htmlFragment from any page

• What is the expected output? What do you see instead?
The output have an xml declaration, but instead of a valid html/xml structure 
there are extra tags that break the xml:

<?xml version="1.0" encoding="utf-8" ?>
<meta …/>
<base … />
<html>
  <body>
    ...
  </body>
</html>

And in the <html> the style comes directly after the <html> and not in a <head>.

The correct output would be:

<?xml version="1.0" encoding="utf-8" ?>
<html>
  <head>
    <meta …/>
    <base … />
    <style>...</style>
  </head>
  <body>
    ...
  </body>
</html>

• What version of the product are you using? On what operating system?

The Web API http://boilerpipe-web.appspot.com/extract

And thanks for this great *GREAT* tool!!!

--
François

Original issue reported on code.google.com by [email protected] on 3 Dec 2011 at 4:13

Encoding problem? – Strange garbage introduced

public class Oneliner
{
  public static void main(final String[] args) throws Exception
  {
    final URL url = new URL("http://a2zmacau.com/1284/ao-man-long-tells-macau-court-he-
did-receive-bribes/");

    // This can also be done in one line:
    System.out.println(DefaultExtractor.INSTANCE.getText(url));
  }
}

gives

The former secretary for transport and public works, Ao Man Long who took a 
cool US$100 
million in bribes and now is serving a 27-year jail sentence for serious 
corruption charges, 
admitted yesterday to having received money from companies including Seng Meng 
Fai.
Ao was a witness in his family’s trial and rejected claims that his relatives 
and wife had 
knowledge of what the former secretary was doing. Ao also told the court his 
family did what he 
asked without ever questioning him or the activities involving grand sums of 
money and offshore 
accounts.
The court repeatedly heard how the former secretary’s family trusted Ao and 
his decisions.
However, Ao confessed to receiving large sums of money, but said it had not 
been in the way 
described in the indictment against him.
The payments were made in increments for services provided to those companies, 
however they 
did not affect the outcome or the process of the public tenders and winning 
bidders, the court 
heard.
The court also heard that Ho Meng Fai had made payments to bank accounts under 
Ao’s family 
members’ names, but were managed by the former secretary. The money was not 
related to 
bribery nor was it related to corruption, Ao told the court.
The money was â€œsimplyâ€� for services Ecoline, one of Ao’s shell 
companies, had carried 
out, the former secretary said, adding that for the Macau Dome, Ho Meng Fai had 
sought 
services from Ecoline to contact a projects concession company from the 
mainland.
The court heard that this was an example of the types of services Ecoline 
carried out.
Ao also said that this time, unlike previously, he was telling the truth. But 
he was unable to 
itemise all the works where such services and payments were made, saying that 
the prosecution 
would have to ask the deceased Lee Se Chong, who had all the companies’ 
contacts.
The court also heard that Ao had only had access to Ecoline in 2006 after 
theÂ  manager Lee Se 
Chong died.
Related Websites
Leave a reply
Search For Macau Hotels

Please notice "after theÂ  manager". The HTML of this part is very simple,

 <p>... Ecoline in 2006 after the  manager Lee Se Chong died.</p>

but contains two consecutive spaces.

Hope this helps to improve your tool, which looks quite good.

Kaspar

Original issue reported on code.google.com by [email protected] on 7 Jan 2010 at 6:50

Add 1.2.0 release to maven repository

It would be nice to have the 1.2.0 relase on the maven repository 
http://boilerpipe.googlecode.com/svn/repo/ and even more helpful if it was also 
available in maven.central (1.1.0 is available in both).

Thanks!

Original issue reported on code.google.com by [email protected] on 7 Jul 2011 at 1:15

boilerpipe crash

What steps will reproduce the problem?
1. Try to extract that url:
http://sourceforge.net/projects/xampp/files/XAMPP%20Windows/1.7.4/xampp-win32-1.
7.4-VC6-installer.exe/download
I have used ArticleExtractor.
It throws few times:
Warning: SAX input contains nested A elements -- You have probably hit a bug in 
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML 
externally and feed it to boilerpipe again. Trying to recover somehow...
and then crashes with OutOfMemoryException

I'm using version 1.2.0. I have tested on Windows and on Ubuntu as well.

Original issue reported on code.google.com by [email protected] on 29 Jul 2011 at 1:27

Code for Google app-engine?

It would be useful to have the code for deploying boilerpipe on Google app 
engine.
Could you distribute the code you use for http://boilerpipe-web.appspot.com/ ?

Original issue reported on code.google.com by [email protected] on 21 Mar 2011 at 9:24

Unconventional operator used for boolean logic

Just a small suggestion to help others with reading the code.
I opened ArticleExtractor and saw the following:
  return TerminatingBlocksFinder.INSTANCE.process(doc)
      | new DocumentTitleMatchClassifier(doc.getTitle()).process(doc)
      | NumWordsRulesClassifier.INSTANCE.process(doc)
      | IgnoreBlocksAfterContentFilter.DEFAULT_INSTANCE.process(doc)
      | BlockProximityFusion.MAX_DISTANCE_1.process(doc)
      | BoilerplateBlockFilter.INSTANCE.process(doc)
      | BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
      | KeepLargestFulltextBlockFilter.INSTANCE.process(doc)
      | ExpandTitleToContentFilter.INSTANCE.process(doc);

This was very confusing to me.  The | operator in Java is usually reserved for 
bitwise operations and it appears that it's the boolean or operation that is 
being done here for which || is typically used.  I was surprised this even 
compiles though it turns out it is valid and function exactly the same.  It 
would really help readability to replace the | with || throughout since that is 
the standard Java convention.

Original issue reported on code.google.com by [email protected] on 21 Nov 2010 at 7:37

Links on boilerpipe homepage are broken

The extractor links on the homepage (http://boilerpipe-web.appspot.com/) are 
broken.  I think they should be changed from .html to .java

The project looks really cool!  I'm looking forward to checking it out.  Thanks 
for making it public.

Original issue reported on code.google.com by [email protected] on 21 Nov 2010 at 5:56

INSTALL.txt in src directory

Consider adding a one-line INSTALL.txt file to the root of the src directory:

"See http://code.google.com/p/boilerpipe/wiki/QuickStart for installation
instructions."

Original issue reported on code.google.com by [email protected] on 26 Mar 2010 at 6:11

Title empty when parsing with TagSoup

What steps will reproduce the problem?

I am using TagSoup for parsing HTML documents:

URL url = new URL("http://www.bbc.co.uk/news/uk-12038847")
Parser parser = new Parser();
BoilerpipeHTMLContentHandler handler = new BoilerpipeHTMLContentHandler();
parser.setContentHandler(handler);  
System.out.println("T: " + handler.toTextDocument().getTitle());
InputSource is = HTMLFetcher.fetch(url).toInputSource();
parser.parse(is);


What is the expected output? What do you see instead?

With the example document from the BBC you should get
"BBC News - Snow disrupts travel across northern Europe"
Instead it is null.


What version of the product are you using? On what operating system?

Trunk

Please provide any additional information below.

The problem can be fixed if I change 
BoilerpipeHTMLContentHandler.characters method
and move flushBlock() invocation from the begging of the method to its end (see 
attached patch). Since I have no idea why this helps, I am not sure if that is 
not braking other things.

Original issue reported on code.google.com by [email protected] on 20 Dec 2010 at 8:02

Attachments:

title.diff

DocumentTitleMatchClassifier should include the « and • characters

I have run across a few news articles that use these characters.

The following articles use the « character (\u00AB):
http://philadelphia.cbslocal.com/2012/02/06/report-1-in-5-children-exposed-to-se
condhand-smoke-in-cars/
http://blog.mediaglobal.org/?p=448

I haven't seen too many of them but it looks like the first part is always the 
title.  It might be safe to assume that parts[0] is the title after performing 
the split.

The following article uses the • character (\u2022):
http://ictsd.org/i/news/biores/128000/

Original issue reported on code.google.com by [email protected] on 22 Mar 2012 at 6:05

Support HTML5 elements

Now that HTML5 becomes more pervasive on the web, it might be worth considering 
additional parsing support in places, one example being the recently added 
image extractor. HTML5 includes <figure> and <figcaption> for adding semantics 
to images, especially the figcaption element is of interest since the text 
could be used to determine image relevancy in relation to the extracted 
document text.

Original issue reported on code.google.com by [email protected] on 18 Oct 2011 at 9:03

Page not being parsed correctly <li> the issue.

What steps will reproduce the problem?
Running ArticleExtractor on http://www.seomoz.org/ugc/link-building-management

What is the expected output? What do you see instead?
Expect to see the full article, instead it starts from the last <li> within the 
content of the article, causing a large portion of the article to be stripped.

What version of the product are you using? On what operating system?
Using the appspot version

Please provide any additional information below.
This is not an issue with the default extractor, however the default extractor 
includes comments.

Original issue reported on code.google.com by [email protected] on 7 Jun 2011 at 1:50

hybrid extractor?

Christian,

We have a corpus that is a mixture of news articles and other web pages, some 
of which contain tables.  The ArticleExtractor has trouble with many of these 
other pages.  Is there a hybrid extractor that detects when it would be better 
to run KeepEverythingExtractor and when better to run ArticleExtractor?

Perhaps we should just use KeepEverything for now...?

Thanks!
jrf

Original issue reported on code.google.com by [email protected] on 27 Apr 2012 at 3:08

Patch for /trunk/boilerpipe-core/src/main/de/l3s/boilerpipe/filters/heuristics/DocumentTitleMatchClassifier.java

Break after tagging a TextBlock a candidate title, there is no need to continue 
checking the rest of the potential titles for the current TextBlock.

Original issue reported on code.google.com by [email protected] on 20 Mar 2012 at 8:08

Merged into: #41

Attachments:

DocumentTitleMatchClassifier.java.patch

Encoding problem (input is interpreted as Latin-1)

What steps will reproduce the problem?
1. Apply boilerpipe-1.1.0 (ArticleExtractor) to a file without explicit 
'charset=' meta. (e.g. 
http://www.slobodnadalmacija.hr/Zadar/tabid/73/articleType/ArticleView/articleId
/140666/Default.aspx) 

What is the expected output? What do you see instead?
Expected: When no further information is available from the input, non-Ascii 
chars are read and written as UTF-8, being the most general and most widely 
used character set. 
Instead: Non-Ascii chars are mis-interpreted as Latin-1 while reading in and 
then written as UTF-8.

What version of the product are you using? On what operating system?
boilerpipe 1.1.0 on Ubuntu Linux 10.04 (locale: en_US.utf8)

Please provide any additional information below.
The problem seems to be corrected in the version of the web interface (cf. URL 
above). So it should be an easy thing to handle.

Original issue reported on code.google.com by [email protected] on 14 Jun 2011 at 4:14

Errors deploying to Android

I'm trying to get Boilerpipe set up on Android. I'm using Eclipse Indigo and 
can build my project.

As a test I am simply trying this:
<code>
String response="";
  try {
    response = ArticleExtractor.INSTANCE.getText(new URL("http://www.guardian.co.uk/technology/2012/apr/17/walled-gardens-facebook-apple-censors"));
  } catch (Exception e) {
    e.printStackTrace();
  }  
</code>

When I run I deploy as an Android application - I get a whole bunch of errors, 
all looking a little like this:

Dx warning: Ignoring InnerClasses attribute for an anonymous inner class
(org.apache.html.dom.SecuritySupport$1) that doesn't come with an
associated EnclosingMethod attribute. This class was probably produced by a
compiler that did not target the modern .class file format. The recommended
solution is to recompile the class from source, using an up-to-date compiler
and without specifying any "-target" type options. The consequence of ignoring
this warning is that reflective operations on this class will incorrectly
indicate that it is *not* an inner class.

When I take out the ArticleExtractor line I don't get any errors and can 
deploy. I wasn't sure if the problem is with Xerces or not, but I can deploy as 
an Android app with the exact same XercesImpl jar file, not using Boilerpipe, 
and the App runs fine i.e. so it seems to be taking issues with Xerces in one 
instance and not the other (if that makes sense)

Original issue reported on code.google.com by [email protected] on 18 Apr 2012 at 9:07

Ignore FORM tags in HTMLHighlighter

When using HTMLHighlighter some times boilerpipe keeps some artifacts related 
coming from FORM and LABEL tags.

This can be easily prevented by addding a new ignorable element to TAG_ACTIONS 
map in HTMLHighlighter.java:

TAG_ACTIONS.put("FORM", TA_IGNORABLE_ELEMENT);

Original issue reported on code.google.com by [email protected] on 24 Mar 2012 at 6:40

Article Image

Hello,

I have come across your API and it seems really impressive.

Is there a way to parse the src URL of the main image in an Article?
If not yet, do you plan to include that in your API as well?

Kind Regards,
Manos

Original issue reported on code.google.com by [email protected] on 7 May 2012 at 3:59

Outputs html instead of plain text for certain urls

To reproduce the problem
1. Apply ArticleExtractor to 
http://fahadbangladesh.blogspot.com/feeds/posts/default?orderby=updated
2. Same problem happens in DefaultExtractor and CanolaExtractor

What is the expected output? What do you see instead?
The expected output is pure text.  But I get html. I've attached the output of 
ArticleExtractor for the same url.

What version of the product are you using? On what operating system?
I'm using 1.2.0 version on lmde (based on Debian Testing Rolling distribution)

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 27 Aug 2011 at 4:02

Attachments:

sample_err_out

Featurerequest: Run boilerpipe as a command line tool

Is it possible to use Boilerpipe as a CLI app together with other bash 
commands, e.g. extract the text of an entire website with a command like: 

wget -p http://mysite.com | boilerplate -options > file#.html

Original issue reported on code.google.com by [email protected] on 22 Mar 2011 at 3:26

Missing Maven dependency

I'm new to Maven, so forgive me if I'm wrong, but I think boilerpipe needs to 
declare Neko as a dependency.  I had to add the following to my project, but I 
think it should be in the boilerpipe pom.xml instead:

  <dependencies>
    <dependency>
        <groupId>net.sourceforge.nekohtml</groupId>
        <artifactId>nekohtml</artifactId>
        <version>1.9.14</version>
    </dependency>
  </dependencies>

Original issue reported on code.google.com by [email protected] on 21 Nov 2010 at 9:41

UTF characters are not handled correctly

The following test case fails:

ArticleExtractor extractor = ArticleExtractor.INSTANCE;
TextDocument textDoc = new BoilerpipeSAXInput(HTMLFetcher.fetch(new 
URL("http://de.wikipedia.org/wiki/Barack_Obama")).toInputSource()).getTextDocume
nt();
assertEquals("Barack Obama – Wikipedia", textDoc.getTitle());

The attached patch fixes the issue.

Original issue reported on code.google.com by [email protected] on 26 Jul 2011 at 7:13

Attachments:

utf8.patch

Better support for non-english pages

I'm looking for a solution to parse pages that are non-english, which seems to 
give varying results with Boilerpipe. Here are a couple of examples where 
boilerpipe misses the main portion of text (tested with 
http://boilerpipe-web.appspot.com/ - 2011-01-06):

* 
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik 
- picks up some teasers instead
* 
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - all sorts of content from 
around the article
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning - 
picks up the comment section

I also see minor artifacts from non-content sections throughout the extracted 
text:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - "Skriv ut" is a 
link to print the article. "Bildmaterial" is a header from the sidebar"
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - "Dela med andra" 
is a header from the sidebar with sharing links
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto - 
Misses main header and teaser

I know it's hard to get all the above URL:s right without site-specific code, 
but I also know it's possible. I've run all of the URL:s above through 
readability.js, and it parses all of them without any artifacts. Maybe it's 
readabilities reliance on class names (which generally is in english even on 
foreign language sites) that makes it cope better. Problem is, readability.js 
is a mess to run server-side, and has not undergone the rigorous testing 
boilerpipe has, so I would much rather see boilerpip succeed that switch to 
readability.js.

Thanks for your hard work.

Original issue reported on code.google.com by EmilStenstrom on 6 Jan 2011 at 2:43

ImageExtractor doesn't detect alternative images for Object plugins

When using the new ImageExtractor <img/> tags placed as alternative content in 
<object /> tags (normally used in video players using flash), images are not 
detected.

It's quite a common practice to embed a video player like:

<object type="application/x-shockwave-flash">
        <param name="movie" value='my.swf'/>
        <param name="quality" value="high"/>
        <param name="allowScriptAccess" value="always"/>
        <param name="allowFullScreen" value="true"/>
        <param name="wmode" value="opaque"/>
        <img src='1328528982826.jpg' alt='yes an alt' title='and a title'/>
        <p>some alternative content</p>
    </object>


What is the expected output? What do you see instead?
These images should be detected as well.

To detect these images you only might need to comment out the line:
//TAG_ACTIONS.put("OBJECT", TA_IGNORABLE_ELEMENT);

from within ImageExtractor.java

Original issue reported on code.google.com by [email protected] on 6 Feb 2012 at 3:28

StackOverflowError when page includes another <body> part in <noframes>

What steps will reproduce the problem?
- ArticleExtractor cannot process a web page having two <body> parts (like the 
attached page) and results "java.lang.StackOverflowError". 

What is the expected output? What do you see instead?
- "noframes" part is for browsers that do not support frames, so boilerpipe 
should not take this part into consideration.

What version of the product are you using? On what operating system?
- boilerpipe 1.2.0 on Linux/Windows

Original issue reported on code.google.com by [email protected] on 14 May 2012 at 2:56

Attachments:

clueweb09-en0000-20-02277.html

timeout and fallback strategy for boilerpipe

I don't see a news group or other forum for asking questions like this, so 
please forgive me making this an issue ticket.

Is there a best practice example for managing boilerpipe with a timeout and 
falling back to a series of less sophisticated extractors?  

For example, when boilerpipe's ArticleExtractor says:
Warning: SAX input contains nested A elements -- You have probably hit a bug in 
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML 
externally and feed it to boilerpipe again. Trying to recover somehow...

and hits an infinite loop, I need to kill it and hammer the text in another way.

Should I just run it inside a thread and kill the thread after allotted time 
passes?  Or does boilerpipe have tools for doing this kind of thing for me?

What sequence of extractors would you recommend?

Thanks!

John

Original issue reported on code.google.com by [email protected] on 6 Feb 2012 at 5:17

IDN <-> ACE Domain Names

Hi,

this module is incredible good but it cannot handle domains names with
(german) "Umlaute" (Ä, Ö, Ü, ...). Any ideas how to deal with this problem?

Thanks,
Felix.

Original issue reported on code.google.com by [email protected] on 21 Jan 2010 at 2:32

Documentation - How to output html extract fragement instead of text?

Fantastic tool, been wondering how to output html extract fragment instead of 
text? Similar to what the appspot app uses.

Original issue reported on code.google.com by [email protected] on 20 Nov 2011 at 3:47

Add clone method to TextBlock

Since the TextBlocks can be modified and merged - it would be useful to be able 
to clone them e.g. for testing a different Extractor without having to reparse 
the HTML.

Original issue reported on code.google.com by [email protected] on 13 Oct 2010 at 8:25

Exclude Script tags

It seems everything is OK except that the exctractor usually includes many 
javascript codes from any side including the one in the demo code. I think this 
can be prevented by removing <script> tags in SAX parsing stage. 

Google Analytics tracker code is extracted as content in many web sites.

You can improve using Readable's algorithm. http://readable-app.appspot.com/

Original issue reported on code.google.com by [email protected] on 23 Aug 2010 at 7:31

Patch for /trunk/boilerpipe-core/src/main/de/l3s/boilerpipe/sax/DefaultTagActionMap.java

This is part 3 of a patch related to problems with title parsing.

http://code.google.com/p/boilerpipe/issues/detail?id=38

Original issue reported on code.google.com by [email protected] on 15 Mar 2012 at 2:40

Merged into: #41

Attachments:

DefaultTagActionMap.java.patch

Feature Request - api to return character offsets of non-boilerplate text

The Highlighter returns the non-boilerplate text. Is there a way to return the 
character offsets of the non-boilerplate text in the original HTML? That would 
be very useful for me.

Currently, the tool is quite useful as a pre-processor that you pass HTML into 
and get back clean plaintext, which you can then pass to an indexing pipeline. 

I need to take this a step furthr and be able to mark up a HTML page with 
"interesting terms", ie terms that I find in my controlled vocabulary. So I 
figured that I could use boilerpipe in this manner:

1) pass boilerpipe to the HTML highlighter
2) find non-boilerplate text in the HTML (ie character offsets, begin and end 
blocks).
3) pass each of these blocks into my application that finds matches in my 
controlled vocabulary and record character offsets.
4) return the original HTML page decorated with the annotations from my 
controlled vocabulary (using offsets found in 2 and 3 to compute the positions 
to decorate).

Currently the closest I can get to this is via the highlighter. But I dont see 
a way to get the character positions from the highlighted text.

Any pointers, suggestions, or a new API to do this would be greatly appreciated.

I am using boilerpipe-1.1.0.

Thanks very much,
Sujit

Original issue reported on code.google.com by [email protected] on 19 Jun 2011 at 9:25

Ignore FORM tags in HTMLHighlighter

When using HTMLHighlighter some times boilerpipe keeps some artifacts related 
coming from FORM and LABEL tags.

This can be easily prevented by addding a new ignorable element to TAG_ACTIONS 
map in HTMLHighlighter.java:

TAG_ACTIONS.put("FORM", TA_IGNORABLE_ELEMENT);

Original issue reported on code.google.com by [email protected] on 24 Mar 2012 at 6:40

Merged into: #44

Patch for /trunk/boilerpipe-core/src/main/de/l3s/boilerpipe/sax/BoilerpipeHTMLContentHandler.java

This is part 1 of a 2 part fix for problems with title detection.

Currently setTitle() is being called sometimes many times per file resulting in 
the class thinking there is no title when there actually is, the class just 
erased the value after setting it.

The problem lies in the way the title is detected, using lastStartTag.  If 
characters() is called before the next start tag, the title can be overridden.

Original issue reported on code.google.com by [email protected] on 15 Mar 2012 at 2:33

Merged into: #41

Attachments:

BoilerpipeHTMLContentHandler.java.patch

Can you fix or promote the bug fix of NekoHTML (#2909310) ?

What steps will reproduce the problem?
1. DefaultExtractor.getText(text);
2.
3.

What is the expected output? What do you see instead?
Caused by: de.l3s.boilerpipe.BoilerpipeProcessingException: 
org.xml.sax.SAXException: SAX input contains nested A elements -- You have 
probably hit a bug in NekoHTML (#2909310). Please clean the HTML externally and 
feed it to boilerpipe again
    at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:54)
    at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:72)
    at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:125)



What version of the product are you using? On what operating system?
1.0.3  Ubuntu,

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 8 Sep 2010 at 8:50

Title detection: Treat non-breaking space as whitespace

Character.isWhitespace(String) does not consider the non breaking space 
character (160) to be whitespace.  This causes whitespace to not be correctly 
trimmed when the non breaking space character is involved.  This can cause 
DocumentTitleMatchClassifier to miss a title match as well as other whitespace 
related problems.

The following article uses the character in the title and a few other places:
http://espn.go.com/dallas/nfl/story/_/id/7560381/do-anthony-spencer

Original issue reported on code.google.com by [email protected] on 20 Mar 2012 at 2:58

Attachments:

BoilerpipeHTMLContentHandler.java.patch

Patch for /trunk/boilerpipe-core/src/main/de/l3s/boilerpipe/sax/CommonTagActions.java

This is part 2 of a patch related to problems with title parsing.

http://code.google.com/p/boilerpipe/issues/detail?id=38

Original issue reported on code.google.com by [email protected] on 15 Mar 2012 at 2:38

Merged into: #41

Attachments:

CommonTagActions.java.patch

Description of different extractors?

I see that you recently added the canola extractor. Is this extractor better 
for general web text?

Could you provide a high-level summary of the different extractors, and the 
type of pages they work best on? This would be very useful documentation.

Original issue reported on code.google.com by [email protected] on 21 Feb 2011 at 8:13

boilerpipe-web: Charset encoding problem

https://boilerpipe-web.appspot.com/extract?url=http://habr.ru&extractor=ArticleE
xtractor&output=html

Encoding porblem?

Original issue reported on code.google.com by [email protected] on 23 Nov 2010 at 7:03

2 to 3 mins taken for a some URLs

What steps will reproduce the problem?
1.Modified the demo code
2.Compile with following command

javac -cp boilerpipe-1.0.4.jar;lib/nekohtml-1.9.13.jar;lib/xerces-2.9.1.jar
Oneliner.java

3.Run with following command

java -cp
.;boilerpipe-1.0.4.jar;lib/nekohtml-1.9.13.jar;lib/xerces-2.9.1.jar Oneliner

What is the expected output? What do you see instead?
I am satisfied with output but time consumption is not considerable.

What version of the product are you using? On what operating system?
boilerpipe-1.0.4 under Window XP 

Please provide any additional information below.
I have attached the modified source code

Original issue reported on code.google.com by [email protected] on 11 May 2010 at 2:47

Attachments:

Oneliner.java

DefaultExtractor.INSTANCE.getText(html): Removes leading special charcater when it is coded in ascii

What steps will reproduce the problem?

DefaultExtractor.INSTANCE.getText(html):

When "html" contains a word with leading special char which is coded in
ascii like "Überprüfung"  -> &#220;berpr&#252;fung

getText() returns only berpr&#252;fung 


What version of the product are you using? On what operating system?
Version 1.0.2 on Linux

Original issue reported on code.google.com by [email protected] on 4 Jan 2010 at 5:18

Library does not produce same results as http://boilerpipe-web.appspot.com/

1) Go to http://boilerpipe-web.appspot.com/
2) Type in http://arstechnica.com/ as the URL.
3) Use article extractor and HTML (extract fragment)
4) See a nice list of articles on that page

Compare to:
1) Download latest boilerpipe svn.
2) Use the following code:

        final URL url = new URL("http://arstechnica.com/");
        final ArticleExtractor articleExtractor = ArticleExtractor.INSTANCE;
        final HTMLHighlighter htmlHighlighter = HTMLHighlighter.newExtractingInstance();
        final String xhtml = htmlHighlighter.process(url, articleExtractor);
3) xhtml only contains 1 article.

Are there settings that need to be changed? Or is there a code update that 
hasn't been checked in?

Original issue reported on code.google.com by [email protected] on 30 Mar 2012 at 2:50

word counting code does not account for & being special html symbol.

What steps will reproduce the problem?
1. make the method de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.isWord 
public
2. in UnicodeTokenizer.java import static that method
3. add the following main method to UnicodeTokenizer.java : 

    public static void main(String[] args) {
        String html = "A few years later, in 1823, another Knickerbocker, Clement C. Moore, offered his own riff on Irving&rsquo;s version of St. Nicholas. Moore&rsquo;s instantly popular poem &ldquo;A Visit from Saint Nicholas&rdquo; introduced the slightly cloying, but instantly and sensationally popular, symbol of the season&mdash;a &ldquo;chubby and plump...right jolly old elf.&rdquo; (There are those who contend that an author named Henry Livingston Jr. penned the poem, but that&rsquo;s another story altogether.)"; 
        final String[] tokens = UnicodeTokenizer.tokenize(html);
        for( String s : tokens ){
            if( isWord(s) ){
                System.out.println("isWord: "+s);
            } else {
                System.out.println("!isWord: "+s);
            }
        }
    }

What is the expected output? What do you see instead?

That html is from 
http://www.smithsonianmag.com/arts-culture/A-Mischevious-St-Nick-from-the-Americ
an-Art-Museum.html 

It uses &rsquo; such as "Irving&rsquo;s version of St. Nicholas. Moore&rsquo;s 
instantly". The logic used by BoilderPipe does not account for that and in the 
program above with output: 

isWord: Irving
!isWord: &
isWord: rsquo;s
isWord: version
isWord: of
isWord: St.
isWord: Nicholas.
isWord: Moore
!isWord: &
isWord: rsquo;s
isWord: instantly

which shows that it is breaking up "Irving's" and "Moore's" into two words 
where they are one.

Original issue reported on code.google.com by [email protected] on 22 Jan 2012 at 10:36

Possible improvement to TerminatingBlocksFinder

The following block of code:
final String text = tb.getText().trim();
if (text.startsWith("Comments")
  || N_COMMENTS.matcher(text).find()
  || text.contains("What you think...")
  || text.contains("add your comment")
  || text.contains("Add your comment")
  || text.contains("Add Your Comment")
  || text.contains("Add Comment")
  || text.contains("Reader views")
  || text.contains("Have your say")
  || text.contains("Have Your Say")
  || text.contains("Reader Comments")
  || text.equals("Thanks for your comments - this feedback is now closed")
  || text.startsWith("© Reuters")
  || text.startsWith("Please rate this")

Might be rewritten as:
final String text = tb.getText().trim().toLowerCase();
if (text.startsWith("comments")
  || N_COMMENTS.matcher(text).find()
  || text.contains("what you think...")
  || text.contains("add your comment")
  || text.contains("add comment")
  || text.contains("reader views")
  || text.contains("have your say")
  || text.contains("reader comments")
  || text.equals("thanks for your comments - this feedback is now closed")
  || text.startsWith("© reuters")
  || text.startsWith("please rate this")


It would catch more cases this way and be easier to maintain.

Also, I saw the Washington Post use "Post a Comment", so it could be good to 
add that one as well.

Original issue reported on code.google.com by [email protected] on 21 Nov 2010 at 8:15

Included nekhtml 1.9.9 mising LostText class

Boilerpipe 1.1.0 contains a modified version of nekohtml 1.9.9

It seems that this modified version of nekohtml is broken in that it references 
the class LostText but does not include it.

The unmodified release of nekohtml 1.9.9 does not reference or include this 
class and the latest release, 1.9.14, both references and includes it.

This is an issue when using boilerpipe in a project that also uses nekohtml.

Original issue reported on code.google.com by [email protected] on 11 May 2011 at 1:37

tilaklodha / boilerpipe Goto Github PK

boilerpipe's Issues

Recommend Projects

Recommend Topics

Recommend Org