tilaklodha / boilerpipe Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/boilerpipe
Automatically exported from code.google.com/p/boilerpipe
What steps will reproduce the problem?
1. Use the HTMLHighlighter to extract the relevant html-code from a page:
final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
System.out.println(hh.process(url, extractor));
2. Try to parse this page: http://www.golem.de/1102/81290.html
What is the expected output? What do you see instead?
This should be the output:
<H2>
Daniel Domscheit-Berg
</H2>
<H1>
Wikileaks-Aussteiger haben Unterlagen mitgenommen
</H1>
...
But actually I get this:
Daniel Domscheit-Berg
</H2>
Wikileaks-Aussteiger haben Unterlagen mitgenommen
</H1>
...
What version of the product are you using? On what operating system?
- Boilerplate 1.1.0 binary
- OS: Suse
Is it possible to generate exactly the output which the Web API produces? There
are even other tags which seem to be missing like <TABLE> and <TD>.
Original issue reported on code.google.com by [email protected]
on 10 Feb 2011 at 9:07
What steps will reproduce the problem?
1. curl --fail -L http://thisrecording.com/the-past | java -jar
tika-app-0.9.jar -T
What is the expected output? What do you see instead?
at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
What version of the product are you using? On what operating system?
1.1.0, Mac Os
Please provide any additional information below.
https://issues.apache.org/jira/browse/TIKA-676
Original issue reported on code.google.com by gabriele%[email protected]
on 18 Jun 2011 at 12:52
Hi, it would be convenient to add an accessor:
public static ImageExtractor getInstance() {
return INSTANCE;
}
This helps in the case when i'm running BP with JRuby.
Further - is there a way to contribute? if this were in Github i would already
have made a pull request :)
PS - kudos for a great build experience! i had no problem running my own build
what so ever.
Original issue reported on code.google.com by [email protected]
on 15 Jan 2012 at 2:10
The problem is the htmlhighlighter process seems to omit opening header tags
<H1>, <H2> etc but includes the closing tags </H2> thus titles don't stand out
in the output document etc
What version of the product are you using? On what operating system?
boilerpipe 1.1.0 ubuntu 10.10
Please provide any additional information below.
Original issue reported on code.google.com by *[email protected]
on 6 Jul 2011 at 2:11
It would be very useful to have an option to keep inline HTML, such as links,
formatting or images,
inside the block of HTML which boilerplate selects.
Original issue reported on code.google.com by tom%[email protected]
on 24 Jan 2010 at 4:36
• What steps will reproduce the problem?
Get an html or htmlFragment from any page
• What is the expected output? What do you see instead?
The output have an xml declaration, but instead of a valid html/xml structure
there are extra tags that break the xml:
<?xml version="1.0" encoding="utf-8" ?>
<meta …/>
<base … />
<html>
<body>
...
</body>
</html>
And in the <html> the style comes directly after the <html> and not in a <head>.
The correct output would be:
<?xml version="1.0" encoding="utf-8" ?>
<html>
<head>
<meta …/>
<base … />
<style>...</style>
</head>
<body>
...
</body>
</html>
• What version of the product are you using? On what operating system?
The Web API http://boilerpipe-web.appspot.com/extract
And thanks for this great *GREAT* tool!!!
--
François
Original issue reported on code.google.com by [email protected]
on 3 Dec 2011 at 4:13
public class Oneliner
{
public static void main(final String[] args) throws Exception
{
final URL url = new URL("http://a2zmacau.com/1284/ao-man-long-tells-macau-court-he-
did-receive-bribes/");
// This can also be done in one line:
System.out.println(DefaultExtractor.INSTANCE.getText(url));
}
}
gives
The former secretary for transport and public works, Ao Man Long who took a
cool US$100
million in bribes and now is serving a 27-year jail sentence for serious
corruption charges,
admitted yesterday to having received money from companies including Seng Meng
Fai.
Ao was a witness in his family’s trial and rejected claims that his relatives
and wife had
knowledge of what the former secretary was doing. Ao also told the court his
family did what he
asked without ever questioning him or the activities involving grand sums of
money and offshore
accounts.
The court repeatedly heard how the former secretary’s family trusted Ao and
his decisions.
However, Ao confessed to receiving large sums of money, but said it had not
been in the way
described in the indictment against him.
The payments were made in increments for services provided to those companies,
however they
did not affect the outcome or the process of the public tenders and winning
bidders, the court
heard.
The court also heard that Ho Meng Fai had made payments to bank accounts under
Ao’s family
members’ names, but were managed by the former secretary. The money was not
related to
bribery nor was it related to corruption, Ao told the court.
The money was “simplyâ€� for services Ecoline, one of Ao’s shell
companies, had carried
out, the former secretary said, adding that for the Macau Dome, Ho Meng Fai had
sought
services from Ecoline to contact a projects concession company from the
mainland.
The court heard that this was an example of the types of services Ecoline
carried out.
Ao also said that this time, unlike previously, he was telling the truth. But
he was unable to
itemise all the works where such services and payments were made, saying that
the prosecution
would have to ask the deceased Lee Se Chong, who had all the companies’
contacts.
The court also heard that Ao had only had access to Ecoline in 2006 after
the manager Lee Se
Chong died.
Related Websites
Leave a reply
Search For Macau Hotels
Please notice "after the manager". The HTML of this part is very simple,
<p>... Ecoline in 2006 after the manager Lee Se Chong died.</p>
but contains two consecutive spaces.
Hope this helps to improve your tool, which looks quite good.
Kaspar
Original issue reported on code.google.com by [email protected]
on 7 Jan 2010 at 6:50
It would be nice to have the 1.2.0 relase on the maven repository
http://boilerpipe.googlecode.com/svn/repo/ and even more helpful if it was also
available in maven.central (1.1.0 is available in both).
Thanks!
Original issue reported on code.google.com by [email protected]
on 7 Jul 2011 at 1:15
What steps will reproduce the problem?
1. Try to extract that url:
http://sourceforge.net/projects/xampp/files/XAMPP%20Windows/1.7.4/xampp-win32-1.
7.4-VC6-installer.exe/download
I have used ArticleExtractor.
It throws few times:
Warning: SAX input contains nested A elements -- You have probably hit a bug in
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML
externally and feed it to boilerpipe again. Trying to recover somehow...
and then crashes with OutOfMemoryException
I'm using version 1.2.0. I have tested on Windows and on Ubuntu as well.
Original issue reported on code.google.com by [email protected]
on 29 Jul 2011 at 1:27
It would be useful to have the code for deploying boilerpipe on Google app
engine.
Could you distribute the code you use for http://boilerpipe-web.appspot.com/ ?
Original issue reported on code.google.com by [email protected]
on 21 Mar 2011 at 9:24
Just a small suggestion to help others with reading the code.
I opened ArticleExtractor and saw the following:
return TerminatingBlocksFinder.INSTANCE.process(doc)
| new DocumentTitleMatchClassifier(doc.getTitle()).process(doc)
| NumWordsRulesClassifier.INSTANCE.process(doc)
| IgnoreBlocksAfterContentFilter.DEFAULT_INSTANCE.process(doc)
| BlockProximityFusion.MAX_DISTANCE_1.process(doc)
| BoilerplateBlockFilter.INSTANCE.process(doc)
| BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
| KeepLargestFulltextBlockFilter.INSTANCE.process(doc)
| ExpandTitleToContentFilter.INSTANCE.process(doc);
This was very confusing to me. The | operator in Java is usually reserved for
bitwise operations and it appears that it's the boolean or operation that is
being done here for which || is typically used. I was surprised this even
compiles though it turns out it is valid and function exactly the same. It
would really help readability to replace the | with || throughout since that is
the standard Java convention.
Original issue reported on code.google.com by [email protected]
on 21 Nov 2010 at 7:37
The extractor links on the homepage (http://boilerpipe-web.appspot.com/) are
broken. I think they should be changed from .html to .java
The project looks really cool! I'm looking forward to checking it out. Thanks
for making it public.
Original issue reported on code.google.com by [email protected]
on 21 Nov 2010 at 5:56
Consider adding a one-line INSTALL.txt file to the root of the src directory:
"See http://code.google.com/p/boilerpipe/wiki/QuickStart for installation
instructions."
Original issue reported on code.google.com by [email protected]
on 26 Mar 2010 at 6:11
What steps will reproduce the problem?
I am using TagSoup for parsing HTML documents:
URL url = new URL("http://www.bbc.co.uk/news/uk-12038847")
Parser parser = new Parser();
BoilerpipeHTMLContentHandler handler = new BoilerpipeHTMLContentHandler();
parser.setContentHandler(handler);
System.out.println("T: " + handler.toTextDocument().getTitle());
InputSource is = HTMLFetcher.fetch(url).toInputSource();
parser.parse(is);
What is the expected output? What do you see instead?
With the example document from the BBC you should get
"BBC News - Snow disrupts travel across northern Europe"
Instead it is null.
What version of the product are you using? On what operating system?
Trunk
Please provide any additional information below.
The problem can be fixed if I change
BoilerpipeHTMLContentHandler.characters method
and move flushBlock() invocation from the begging of the method to its end (see
attached patch). Since I have no idea why this helps, I am not sure if that is
not braking other things.
Original issue reported on code.google.com by [email protected]
on 20 Dec 2010 at 8:02
Attachments:
I have run across a few news articles that use these characters.
The following articles use the « character (\u00AB):
http://philadelphia.cbslocal.com/2012/02/06/report-1-in-5-children-exposed-to-se
condhand-smoke-in-cars/
http://blog.mediaglobal.org/?p=448
I haven't seen too many of them but it looks like the first part is always the
title. It might be safe to assume that parts[0] is the title after performing
the split.
The following article uses the • character (\u2022):
http://ictsd.org/i/news/biores/128000/
Original issue reported on code.google.com by [email protected]
on 22 Mar 2012 at 6:05
Now that HTML5 becomes more pervasive on the web, it might be worth considering
additional parsing support in places, one example being the recently added
image extractor. HTML5 includes <figure> and <figcaption> for adding semantics
to images, especially the figcaption element is of interest since the text
could be used to determine image relevancy in relation to the extracted
document text.
Original issue reported on code.google.com by [email protected]
on 18 Oct 2011 at 9:03
What steps will reproduce the problem?
Running ArticleExtractor on http://www.seomoz.org/ugc/link-building-management
What is the expected output? What do you see instead?
Expect to see the full article, instead it starts from the last <li> within the
content of the article, causing a large portion of the article to be stripped.
What version of the product are you using? On what operating system?
Using the appspot version
Please provide any additional information below.
This is not an issue with the default extractor, however the default extractor
includes comments.
Original issue reported on code.google.com by [email protected]
on 7 Jun 2011 at 1:50
Christian,
We have a corpus that is a mixture of news articles and other web pages, some
of which contain tables. The ArticleExtractor has trouble with many of these
other pages. Is there a hybrid extractor that detects when it would be better
to run KeepEverythingExtractor and when better to run ArticleExtractor?
Perhaps we should just use KeepEverything for now...?
Thanks!
jrf
Original issue reported on code.google.com by [email protected]
on 27 Apr 2012 at 3:08
Break after tagging a TextBlock a candidate title, there is no need to continue
checking the rest of the potential titles for the current TextBlock.
Original issue reported on code.google.com by [email protected]
on 20 Mar 2012 at 8:08
Attachments:
What steps will reproduce the problem?
1. Apply boilerpipe-1.1.0 (ArticleExtractor) to a file without explicit
'charset=' meta. (e.g.
http://www.slobodnadalmacija.hr/Zadar/tabid/73/articleType/ArticleView/articleId
/140666/Default.aspx)
What is the expected output? What do you see instead?
Expected: When no further information is available from the input, non-Ascii
chars are read and written as UTF-8, being the most general and most widely
used character set.
Instead: Non-Ascii chars are mis-interpreted as Latin-1 while reading in and
then written as UTF-8.
What version of the product are you using? On what operating system?
boilerpipe 1.1.0 on Ubuntu Linux 10.04 (locale: en_US.utf8)
Please provide any additional information below.
The problem seems to be corrected in the version of the web interface (cf. URL
above). So it should be an easy thing to handle.
Original issue reported on code.google.com by [email protected]
on 14 Jun 2011 at 4:14
I'm trying to get Boilerpipe set up on Android. I'm using Eclipse Indigo and
can build my project.
As a test I am simply trying this:
<code>
String response="";
try {
response = ArticleExtractor.INSTANCE.getText(new URL("http://www.guardian.co.uk/technology/2012/apr/17/walled-gardens-facebook-apple-censors"));
} catch (Exception e) {
e.printStackTrace();
}
</code>
When I run I deploy as an Android application - I get a whole bunch of errors,
all looking a little like this:
Dx warning: Ignoring InnerClasses attribute for an anonymous inner class
(org.apache.html.dom.SecuritySupport$1) that doesn't come with an
associated EnclosingMethod attribute. This class was probably produced by a
compiler that did not target the modern .class file format. The recommended
solution is to recompile the class from source, using an up-to-date compiler
and without specifying any "-target" type options. The consequence of ignoring
this warning is that reflective operations on this class will incorrectly
indicate that it is *not* an inner class.
When I take out the ArticleExtractor line I don't get any errors and can
deploy. I wasn't sure if the problem is with Xerces or not, but I can deploy as
an Android app with the exact same XercesImpl jar file, not using Boilerpipe,
and the App runs fine i.e. so it seems to be taking issues with Xerces in one
instance and not the other (if that makes sense)
Original issue reported on code.google.com by [email protected]
on 18 Apr 2012 at 9:07
When using HTMLHighlighter some times boilerpipe keeps some artifacts related
coming from FORM and LABEL tags.
This can be easily prevented by addding a new ignorable element to TAG_ACTIONS
map in HTMLHighlighter.java:
TAG_ACTIONS.put("FORM", TA_IGNORABLE_ELEMENT);
Original issue reported on code.google.com by [email protected]
on 24 Mar 2012 at 6:40
Hello,
I have come across your API and it seems really impressive.
Is there a way to parse the src URL of the main image in an Article?
If not yet, do you plan to include that in your API as well?
Kind Regards,
Manos
Original issue reported on code.google.com by [email protected]
on 7 May 2012 at 3:59
To reproduce the problem
1. Apply ArticleExtractor to
http://fahadbangladesh.blogspot.com/feeds/posts/default?orderby=updated
2. Same problem happens in DefaultExtractor and CanolaExtractor
What is the expected output? What do you see instead?
The expected output is pure text. But I get html. I've attached the output of
ArticleExtractor for the same url.
What version of the product are you using? On what operating system?
I'm using 1.2.0 version on lmde (based on Debian Testing Rolling distribution)
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 27 Aug 2011 at 4:02
Attachments:
Is it possible to use Boilerpipe as a CLI app together with other bash
commands, e.g. extract the text of an entire website with a command like:
wget -p http://mysite.com | boilerplate -options > file#.html
Original issue reported on code.google.com by [email protected]
on 22 Mar 2011 at 3:26
I'm new to Maven, so forgive me if I'm wrong, but I think boilerpipe needs to
declare Neko as a dependency. I had to add the following to my project, but I
think it should be in the boilerpipe pom.xml instead:
<dependencies>
<dependency>
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>1.9.14</version>
</dependency>
</dependencies>
Original issue reported on code.google.com by [email protected]
on 21 Nov 2010 at 9:41
The following test case fails:
ArticleExtractor extractor = ArticleExtractor.INSTANCE;
TextDocument textDoc = new BoilerpipeSAXInput(HTMLFetcher.fetch(new
URL("http://de.wikipedia.org/wiki/Barack_Obama")).toInputSource()).getTextDocume
nt();
assertEquals("Barack Obama – Wikipedia", textDoc.getTitle());
The attached patch fixes the issue.
Original issue reported on code.google.com by [email protected]
on 26 Jul 2011 at 7:13
Attachments:
I'm looking for a solution to parse pages that are non-english, which seems to
give varying results with Boilerpipe. Here are a couple of examples where
boilerpipe misses the main portion of text (tested with
http://boilerpipe-web.appspot.com/ - 2011-01-06):
*
http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik
- picks up some teasers instead
*
http://www.sydsvenskan.se/malmo/article1346121/I-natt-bargas-det---forhoppningsv
is.html - picks up the comment section
* http://www.dn.se/sthlm/tva-raddade-ur-malarvak - all sorts of content from
around the article
* http://www.expressen.se/nyheter/1.2280178/smhi-utfardar-klass-2-varning -
picks up the comment section
I also see minor artifacts from non-content sections throughout the extracted
text:
* http://hd.se/skane/2011/01/06/mangder-med-sno-over-skane/ - "Skriv ut" is a
link to print the article. "Bildmaterial" is a header from the sidebar"
* http://www.dn.se/sthlm/misstankt-brott-bakom-ung-mans-dod - "Dela med andra"
is a header from the sidebar with sharing links
* http://www.expressen.se/noje/1.2280351/lotta-engberg-lamnar-bingolotto -
Misses main header and teaser
I know it's hard to get all the above URL:s right without site-specific code,
but I also know it's possible. I've run all of the URL:s above through
readability.js, and it parses all of them without any artifacts. Maybe it's
readabilities reliance on class names (which generally is in english even on
foreign language sites) that makes it cope better. Problem is, readability.js
is a mess to run server-side, and has not undergone the rigorous testing
boilerpipe has, so I would much rather see boilerpip succeed that switch to
readability.js.
Thanks for your hard work.
Original issue reported on code.google.com by EmilStenstrom
on 6 Jan 2011 at 2:43
When using the new ImageExtractor <img/> tags placed as alternative content in
<object /> tags (normally used in video players using flash), images are not
detected.
It's quite a common practice to embed a video player like:
<object type="application/x-shockwave-flash">
<param name="movie" value='my.swf'/>
<param name="quality" value="high"/>
<param name="allowScriptAccess" value="always"/>
<param name="allowFullScreen" value="true"/>
<param name="wmode" value="opaque"/>
<img src='1328528982826.jpg' alt='yes an alt' title='and a title'/>
<p>some alternative content</p>
</object>
What is the expected output? What do you see instead?
These images should be detected as well.
To detect these images you only might need to comment out the line:
//TAG_ACTIONS.put("OBJECT", TA_IGNORABLE_ELEMENT);
from within ImageExtractor.java
Original issue reported on code.google.com by [email protected]
on 6 Feb 2012 at 3:28
What steps will reproduce the problem?
- ArticleExtractor cannot process a web page having two <body> parts (like the
attached page) and results "java.lang.StackOverflowError".
What is the expected output? What do you see instead?
- "noframes" part is for browsers that do not support frames, so boilerpipe
should not take this part into consideration.
What version of the product are you using? On what operating system?
- boilerpipe 1.2.0 on Linux/Windows
Original issue reported on code.google.com by [email protected]
on 14 May 2012 at 2:56
Attachments:
I don't see a news group or other forum for asking questions like this, so
please forgive me making this an issue ticket.
Is there a best practice example for managing boilerpipe with a timeout and
falling back to a series of less sophisticated extractors?
For example, when boilerpipe's ArticleExtractor says:
Warning: SAX input contains nested A elements -- You have probably hit a bug in
your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML
externally and feed it to boilerpipe again. Trying to recover somehow...
and hits an infinite loop, I need to kill it and hammer the text in another way.
Should I just run it inside a thread and kill the thread after allotted time
passes? Or does boilerpipe have tools for doing this kind of thing for me?
What sequence of extractors would you recommend?
Thanks!
John
Original issue reported on code.google.com by [email protected]
on 6 Feb 2012 at 5:17
Hi,
this module is incredible good but it cannot handle domains names with
(german) "Umlaute" (Ä, Ö, Ü, ...). Any ideas how to deal with this problem?
Thanks,
Felix.
Original issue reported on code.google.com by [email protected]
on 21 Jan 2010 at 2:32
Fantastic tool, been wondering how to output html extract fragment instead of
text? Similar to what the appspot app uses.
Original issue reported on code.google.com by [email protected]
on 20 Nov 2011 at 3:47
Since the TextBlocks can be modified and merged - it would be useful to be able
to clone them e.g. for testing a different Extractor without having to reparse
the HTML.
Original issue reported on code.google.com by [email protected]
on 13 Oct 2010 at 8:25
It seems everything is OK except that the exctractor usually includes many
javascript codes from any side including the one in the demo code. I think this
can be prevented by removing <script> tags in SAX parsing stage.
Google Analytics tracker code is extracted as content in many web sites.
You can improve using Readable's algorithm. http://readable-app.appspot.com/
Original issue reported on code.google.com by [email protected]
on 23 Aug 2010 at 7:31
This is part 3 of a patch related to problems with title parsing.
http://code.google.com/p/boilerpipe/issues/detail?id=38
Original issue reported on code.google.com by [email protected]
on 15 Mar 2012 at 2:40
Attachments:
The Highlighter returns the non-boilerplate text. Is there a way to return the
character offsets of the non-boilerplate text in the original HTML? That would
be very useful for me.
Currently, the tool is quite useful as a pre-processor that you pass HTML into
and get back clean plaintext, which you can then pass to an indexing pipeline.
I need to take this a step furthr and be able to mark up a HTML page with
"interesting terms", ie terms that I find in my controlled vocabulary. So I
figured that I could use boilerpipe in this manner:
1) pass boilerpipe to the HTML highlighter
2) find non-boilerplate text in the HTML (ie character offsets, begin and end
blocks).
3) pass each of these blocks into my application that finds matches in my
controlled vocabulary and record character offsets.
4) return the original HTML page decorated with the annotations from my
controlled vocabulary (using offsets found in 2 and 3 to compute the positions
to decorate).
Currently the closest I can get to this is via the highlighter. But I dont see
a way to get the character positions from the highlighted text.
Any pointers, suggestions, or a new API to do this would be greatly appreciated.
I am using boilerpipe-1.1.0.
Thanks very much,
Sujit
Original issue reported on code.google.com by [email protected]
on 19 Jun 2011 at 9:25
When using HTMLHighlighter some times boilerpipe keeps some artifacts related
coming from FORM and LABEL tags.
This can be easily prevented by addding a new ignorable element to TAG_ACTIONS
map in HTMLHighlighter.java:
TAG_ACTIONS.put("FORM", TA_IGNORABLE_ELEMENT);
Original issue reported on code.google.com by [email protected]
on 24 Mar 2012 at 6:40
This is part 1 of a 2 part fix for problems with title detection.
Currently setTitle() is being called sometimes many times per file resulting in
the class thinking there is no title when there actually is, the class just
erased the value after setting it.
The problem lies in the way the title is detected, using lastStartTag. If
characters() is called before the next start tag, the title can be overridden.
Original issue reported on code.google.com by [email protected]
on 15 Mar 2012 at 2:33
Attachments:
What steps will reproduce the problem?
1. DefaultExtractor.getText(text);
2.
3.
What is the expected output? What do you see instead?
Caused by: de.l3s.boilerpipe.BoilerpipeProcessingException:
org.xml.sax.SAXException: SAX input contains nested A elements -- You have
probably hit a bug in NekoHTML (#2909310). Please clean the HTML externally and
feed it to boilerpipe again
at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:54)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:72)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:125)
What version of the product are you using? On what operating system?
1.0.3 Ubuntu,
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 8 Sep 2010 at 8:50
Character.isWhitespace(String) does not consider the non breaking space
character (160) to be whitespace. This causes whitespace to not be correctly
trimmed when the non breaking space character is involved. This can cause
DocumentTitleMatchClassifier to miss a title match as well as other whitespace
related problems.
The following article uses the character in the title and a few other places:
http://espn.go.com/dallas/nfl/story/_/id/7560381/do-anthony-spencer
Original issue reported on code.google.com by [email protected]
on 20 Mar 2012 at 2:58
Attachments:
This is part 2 of a patch related to problems with title parsing.
http://code.google.com/p/boilerpipe/issues/detail?id=38
Original issue reported on code.google.com by [email protected]
on 15 Mar 2012 at 2:38
Attachments:
I see that you recently added the canola extractor. Is this extractor better
for general web text?
Could you provide a high-level summary of the different extractors, and the
type of pages they work best on? This would be very useful documentation.
Original issue reported on code.google.com by [email protected]
on 21 Feb 2011 at 8:13
https://boilerpipe-web.appspot.com/extract?url=http://habr.ru&extractor=ArticleE
xtractor&output=html
Encoding porblem?
Original issue reported on code.google.com by [email protected]
on 23 Nov 2010 at 7:03
What steps will reproduce the problem?
1.Modified the demo code
2.Compile with following command
javac -cp boilerpipe-1.0.4.jar;lib/nekohtml-1.9.13.jar;lib/xerces-2.9.1.jar
Oneliner.java
3.Run with following command
java -cp
.;boilerpipe-1.0.4.jar;lib/nekohtml-1.9.13.jar;lib/xerces-2.9.1.jar Oneliner
What is the expected output? What do you see instead?
I am satisfied with output but time consumption is not considerable.
What version of the product are you using? On what operating system?
boilerpipe-1.0.4 under Window XP
Please provide any additional information below.
I have attached the modified source code
Original issue reported on code.google.com by [email protected]
on 11 May 2010 at 2:47
Attachments:
What steps will reproduce the problem?
DefaultExtractor.INSTANCE.getText(html):
When "html" contains a word with leading special char which is coded in
ascii like "Überprüfung" -> Überprüfung
getText() returns only berprüfung
What version of the product are you using? On what operating system?
Version 1.0.2 on Linux
Original issue reported on code.google.com by [email protected]
on 4 Jan 2010 at 5:18
1) Go to http://boilerpipe-web.appspot.com/
2) Type in http://arstechnica.com/ as the URL.
3) Use article extractor and HTML (extract fragment)
4) See a nice list of articles on that page
Compare to:
1) Download latest boilerpipe svn.
2) Use the following code:
final URL url = new URL("http://arstechnica.com/");
final ArticleExtractor articleExtractor = ArticleExtractor.INSTANCE;
final HTMLHighlighter htmlHighlighter = HTMLHighlighter.newExtractingInstance();
final String xhtml = htmlHighlighter.process(url, articleExtractor);
3) xhtml only contains 1 article.
Are there settings that need to be changed? Or is there a code update that
hasn't been checked in?
Original issue reported on code.google.com by [email protected]
on 30 Mar 2012 at 2:50
What steps will reproduce the problem?
1. make the method de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.isWord
public
2. in UnicodeTokenizer.java import static that method
3. add the following main method to UnicodeTokenizer.java :
public static void main(String[] args) {
String html = "A few years later, in 1823, another Knickerbocker, Clement C. Moore, offered his own riff on Irving’s version of St. Nicholas. Moore’s instantly popular poem “A Visit from Saint Nicholas” introduced the slightly cloying, but instantly and sensationally popular, symbol of the season—a “chubby and plump...right jolly old elf.” (There are those who contend that an author named Henry Livingston Jr. penned the poem, but that’s another story altogether.)";
final String[] tokens = UnicodeTokenizer.tokenize(html);
for( String s : tokens ){
if( isWord(s) ){
System.out.println("isWord: "+s);
} else {
System.out.println("!isWord: "+s);
}
}
}
What is the expected output? What do you see instead?
That html is from
http://www.smithsonianmag.com/arts-culture/A-Mischevious-St-Nick-from-the-Americ
an-Art-Museum.html
It uses ’ such as "Irving’s version of St. Nicholas. Moore’s
instantly". The logic used by BoilderPipe does not account for that and in the
program above with output:
isWord: Irving
!isWord: &
isWord: rsquo;s
isWord: version
isWord: of
isWord: St.
isWord: Nicholas.
isWord: Moore
!isWord: &
isWord: rsquo;s
isWord: instantly
which shows that it is breaking up "Irving's" and "Moore's" into two words
where they are one.
Original issue reported on code.google.com by [email protected]
on 22 Jan 2012 at 10:36
The following block of code:
final String text = tb.getText().trim();
if (text.startsWith("Comments")
|| N_COMMENTS.matcher(text).find()
|| text.contains("What you think...")
|| text.contains("add your comment")
|| text.contains("Add your comment")
|| text.contains("Add Your Comment")
|| text.contains("Add Comment")
|| text.contains("Reader views")
|| text.contains("Have your say")
|| text.contains("Have Your Say")
|| text.contains("Reader Comments")
|| text.equals("Thanks for your comments - this feedback is now closed")
|| text.startsWith("© Reuters")
|| text.startsWith("Please rate this")
Might be rewritten as:
final String text = tb.getText().trim().toLowerCase();
if (text.startsWith("comments")
|| N_COMMENTS.matcher(text).find()
|| text.contains("what you think...")
|| text.contains("add your comment")
|| text.contains("add comment")
|| text.contains("reader views")
|| text.contains("have your say")
|| text.contains("reader comments")
|| text.equals("thanks for your comments - this feedback is now closed")
|| text.startsWith("© reuters")
|| text.startsWith("please rate this")
It would catch more cases this way and be easier to maintain.
Also, I saw the Washington Post use "Post a Comment", so it could be good to
add that one as well.
Original issue reported on code.google.com by [email protected]
on 21 Nov 2010 at 8:15
Boilerpipe 1.1.0 contains a modified version of nekohtml 1.9.9
It seems that this modified version of nekohtml is broken in that it references
the class LostText but does not include it.
The unmodified release of nekohtml 1.9.9 does not reference or include this
class and the latest release, 1.9.14, both references and includes it.
This is an issue when using boilerpipe in a project that also uses nekohtml.
Original issue reported on code.google.com by [email protected]
on 11 May 2011 at 1:37
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.