grangier / python-goose Goto Github PK

Html Content / Article Extractor, web scrapping lib in Python

License: Apache License 2.0

Python 3.99% HTML 96.01%

python-goose's Introduction

Python-Goose - Article Extractor

Intro

Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project.

This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Goose will try to extract the following information:

Main text of an article
Main image of article
Any YouTube/Vimeo movies embedded in article
Meta Description
Meta tags

The Python version was rewritten by:

Xavier Grangier

Licensing

If you find Goose useful or have issues please drop me a line. I'd love to hear how you're using it or what features should be improved.

Goose is licensed by Gravity.com under the Apache 2.0 license; see the LICENSE file for more details.

Setup

mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install

Take it for a spin

>>> from goose import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
>>> article.cleaned_text[:150]
(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

Configuration

There are two ways to pass configuration to goose. The first one is to pass goose a Configuration() object. The second one is to pass a configuration dict.

For instance, if you want to change the userAgent used by Goose just pass:

>>> g = Goose({'browser_user_agent': 'Mozilla'})

Switching parsers : Goose can now be used with lxml html parser or lxml soup parser. By default the html parser is used. If you want to use the soup parser pass it in the configuration dict :

>>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})

Goose is now language aware

For example, scraping a Spanish content page with correct meta language tags:

>>> from goose import Goose
>>> url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Las listas de espera se agravan'
>>> article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'

Some pages don't have correct meta language tags, you can force it using configuration :

>>> from goose import Goose
>>> url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'
>>> g = Goose({'use_meta_language': False, 'target_language':'es'})
>>> article = g.extract(url=url)
>>> article.cleaned_text[:150]
u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\xf3metros de Lyon, a Izaskun Lesaka y '

Passing {'use_meta_language': False, 'target_language':'es'} will forcibly select Spanish.

Video extraction

>>> import goose
>>> url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350'
>>> g = goose.Goose({'target_language':'fr'})
>>> article = g.extract(url=url)
>>> article.movies
[<goose.videos.videos.Video object at 0x25f60d0>]
>>> article.movies[0].src
'http://sa.kewego.com/embed/vp/?language_code=fr&playerKey=1764a824c13c&configKey=dcc707ec373f&suffix=&sig=9bc77afb496s&autostart=false'
>>> article.movies[0].embed_code
'<iframe src="http://sa.kewego.com/embed/vp/?language_code=fr&amp;playerKey=1764a824c13c&amp;configKey=dcc707ec373f&amp;suffix=&amp;sig=9bc77afb496s&amp;autostart=false" frameborder="0" scrolling="no" width="476" height="357"/>'
>>> article.movies[0].embed_type
'iframe'
>>> article.movies[0].width
'476'
>>> article.movies[0].height
'357'

Goose in Chinese

Some users want to use Goose for Chinese content. Chinese word segmentation is way more difficult to deal with than occidental languages. Chinese needs a dedicated StopWord analyser that need to be passed to the config object.

>>> from goose import Goose
>>> from goose.text import StopWordsChinese
>>> url  = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
香港行政长官梁振英在各方压力下就其大宅的违章建筑（僭建）问题到立法会接受质询，并向香港民众道歉。

梁振英在星期二（12月10日）的答问大会开始之际在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的意图和动机。

一些亲北京阵营议员欢迎梁振英道歉，且认为应能获得香港民众接受，但这些议员也质问梁振英有

Goose in Arabic

In order to use Goose in Arabic you have to use the StopWordsArabic class.

>>> from goose import Goose
>>> from goose.text import StopWordsArabic
>>> url = 'http://arabic.cnn.com/2013/middle_east/8/3/syria.clashes/index.html'
>>> g = Goose({'stopwords_class': StopWordsArabic})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
دمشق، سوريا (CNN) -- أكدت جهات سورية معارضة أن فصائل مسلحة معارضة لنظام الرئيس بشار الأسد وعلى صلة بـ"الجيش الحر" تمكنت من السيطرة على مستودعات للأسل

Goose in Korean

In order to use Goose in Korean you have to use the StopWordsKorean class.

>>> from goose import Goose
>>> from goose.text import StopWordsKorean
>>> url='http://news.donga.com/3/all/20131023/58406128/1'
>>> g = Goose({'stopwords_class':StopWordsKorean})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
경기도 용인에 자리 잡은 민간 시험인증 전문기업 ㈜디지털이엠씨(www.digitalemc.com). 
14년째 세계 각국의 통신·안전·전파 규격 시험과 인증 한 우물만 파고 있는 이 회사 박채규 대표가 만나기로 한 주인공이다. 
그는 전기전자·무선통신·자동차 전장품 분야에

Known issues

There are some issues with unicode URLs.
Cookie handling : Some websites need cookie handling. At the moment the only work around is to use the raw_html extraction. For instance:

>>> import urllib2 >>> import goose >>> url = "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp" >>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) >>> response = opener.open(url) >>> raw_html = response.read() >>> g = goose.Goose() >>> a = g.extract(raw_html=raw_html) >>> a.cleaned_text u'CAIRO u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.nnAs t'

TODO

Video html5 tag extraction

python-goose's People

Contributors

Stargazers

Watchers

Forkers

anthonynystrom litso toddwilson mpuig evan0 alex-dow muggot palyan yoyossy aurora1625 happyfeng fanfannothing angelzou fudong1127 bearstech shaoneng netconstructor scraping-xx bitwjg alfredyuan halfnhav4 mozii danielmagnussons inman x100up hbprotoss codelurker samcrosoft faustow tbkraf08 ddesign84 stupied4ever bigdata-tools big-data yuanshanxiaoni aviks djrondon daqing15 kvarkson twubs psilva261 pansuo nimast pombredanne dokalanyi nopper ewencp uservidya hellobin tazjel malmros weewillie adamar subblime wuschel zoeyyoung rfoo gth158a harikt rebeling robbestad tgallant lemonhall waytai bocode dreamfrog jimmy0000 bgruszka shabeermothi markosski frnsys zhoubug imclab ofshellohicy priyankt68 tdomhan bahrunnur iyangming nubela zhefeng igivefirst sjain07 antlypls jeffnappi topcaoagui ricky-wilson zee724 metricle mulinfro ceablecui wodow beiyexertz franklin521 luisyang joshuag quintos silentninja johnconnelly75 wyrover ageitgey

python-goose's Issues

Fail to extract text from CNN

Hello, I tried to extract the text of a certain page of CNN. Here is the link "http://edition.cnn.com/2013/09/04/travel/recycled-planes/index.html?eref=edition" and here is my code

g = Goose() article = g.extract(url="http://edition.cnn.com/2013/09/04/travel/recycled-planes/index.html?eref=edition") article.cleaned_text

The result is this:

'''It's a plane! It's a boat! It's a house!

If only all airplane seats were this comfy

"We went for the 'raw machine' look this time"

For a certain type of bachelor pad

A new way to serve

Plane on the outside, party on the inside

From up in the air to underwater''''

As I can see they are few of the images' labels.

IOError: cannot identify image file

I try to use goose in python 2.7 on windows, but the IOError("cannot identify image file") raised in PIL\Image.py.
How could I resolve this problem? Thanks.

Add indonesian stopword file

Got a UnicodeEncodeError when the image url is accented

I was parsing this link:

http://www.slate.fr/story/64063/tapie-mougeotte-la-provence

Here is the complete traceback:

UnicodeEncodeError('ascii', u'http://www.slate.fr/sites/default/files/photos/Capture d\u2019e\u0301cran 2012-10-29 a\u0300 13.28.45.png', 56, 57, 'ordinal not in range(128)')

Stacktrace (most recent call last):

File "semantism/process.py", line 112, in index_url
link_extractor.extract()

File "semantism/link.py", line 159, in extract
getattr(self, method_name)(self.response.content, url)

File "semantism/link.py", line 123, in extract_text_html
article = self.goose.extractContent(url=url, rawHTML=page_content)

File "goose/Goose.py", line 52, in extractContent
return self.sendToActor(cc)

File "goose/Goose.py", line 59, in sendToActor
article = crawler.crawl(crawlCandiate)

File "goose/Crawler.py", line 92, in crawl
article.topImage = imageExtractor.getBestImage(article.rawDoc, article.topNode)

File "goose/images/UpgradedImageExtractor.py", line 85, in getBestImage
image = self.checkForLargeImages(topNode, 0, 0)

File "goose/images/UpgradedImageExtractor.py", line 116, in checkForLargeImages
goodImages = self.getImageCandidates(node)

File "goose/images/UpgradedImageExtractor.py", line 260, in getImageCandidates
goodImages = self.findImagesThatPassByteSizeTest(filteredImages)

File "goose/images/UpgradedImageExtractor.py", line 276, in findImagesThatPassByteSizeTest
locallyStoredImage = self.getLocallyStoredImage(imgSrc)

File "goose/images/UpgradedImageExtractor.py", line 339, in getLocallyStoredImage
self.linkhash, imageSrc, self.config)

File "goose/images/ImageUtils.py", line 51, in storeImageToLocalFile
image = self.readExistingFileInfo(linkhash, imageSrc, config)

File "goose/images/ImageUtils.py", line 77, in readExistingFileInfo
localImageName = self.getLocalFileName(linkhash, imageSrc, config)

File "goose/images/ImageUtils.py", line 104, in getLocalFileName
imageHash = hashlib.md5(imageSrc).hexdigest()

Goose not able to extract content from popular websites

Please advise where do I need to make the changes to make sure these sites work : mashable.com, usatoday.com, politicalwire.com

http://stackoverflow.com/questions/21397893/python-goose-not-able-to-extract-mashable-usatoday-politicalwire-articles

Thanks

Goose crashes if no url is provied

>>> from goose import Goose
>>> g = Goose()
>>> g.extract(raw_html='<p>bla</p>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "goose/__init__.py", line 53, in extract
    return self.crawl(cc)
  File "goose/__init__.py", line 60, in crawl
    article = crawler.crawl(crawl_candiate)
  File "goose/crawler.py", line 53, in crawl
    parse_candidate = URLHelper.getCleanedUrl(crawl_candidate.url)
  File "goose/utils/__init__.py", line 94, in getCleanedUrl
    if '#!' in urlToCrawl else urlToCrawl
TypeError: argument of type 'NoneType' is not iterable
>>>

Correctly delete tmp file in /tmp/goosetmp

Image stored files are not cleanly deleted in /tmp/gosetmp.

I'll work on this one later.

broken travis image

Not processing images - can we skip the creation of a local storage path

We're not using Goose to process any images, but despite this it requires the existence of a local storage path, and complains if it can't write to the server's filesystem. I have to make sure /tmp/goose is always available and writable, and occasionally get errors from Goose complaining that it can't write some file there.

Is there a way to turn this feature off completely and never have Goose write to the local filesystem, even if it disables image processing?

Why not use the Pillow ? That's the "friendly" PIL fork

Multiprocessing does not work on Goose().extract(raw_html=...)

This was tested with python's multiprocessing library.

I checked the code, everything hangs at the line prior to Goose().extract(...)
not at join() or run(), etc.

This example is specific but the problem is general. Has anyone gotten multiprocessing to work with
goose's extraction period? If so a brief explanation of how or a tiny code snippet sample would be great thanks.

Or at least an explanation of why my code blow is incorrect would be great.!

Sorry if i'm doing something blatantly wrong. This is my first time
doing multiprocessing in python.

from multiprocessing import Process, Queue
from multiprocessing import cpu_count as num_cores
import pickle
import codecs
from goose import Goose

class Processor(Process):

    def __init__(self, queue, html):
        super(Processor, self).__init__()
        self.queue = queue
        self.html = html

    def ret(self):
        g = Goose()
        article = g.extract(raw_html=self.html)  # THE CODE HANGS HERE
        pickle.dump(article, codecs.open(str(id(article))+'.txt', 'wb'))
        return str(id(article))

    def run(self):
        self.queue.put(self.ret())

processes = []

if __name__ == '__main__':
     for i in range(0, num_cores()):
         q = Queue()
         html = ... 
         p = Processor(q, html)
         processes.append((p, q))
         print 'appending', (p, q)
         p.start()

     for val in processes:
         val[0].join()
         id_ = val[1].get()
         article = pickle.load( codecs.open(id_+'.txt', 'rb'))
         print article.cleaned_text

Program with goose can't run with .exe make in pyinstaller

Traceback (most recent call last):
File "", line 7, in
File "C:\Users\user\Desktop\build\pyi.win32\testgoose11\out00-PYZ.pyz\goose.G
oose", line 52, in extractContent
File "C:\Users\user\Desktop\build\pyi.win32\testgoose11\out00-PYZ.pyz\goose.G
oose", line 59, in sendToActor
File "C:\Users\user\Desktop\build\pyi.win32\testgoose11\out00-PYZ.pyz\goose.C
rawler", line 86, in crawl
File "C:\Users\user\Desktop\build\pyi.win32\testgoose11\out00-PYZ.pyz\goose.e
xtractors", line 245, in calculateBestNodeBasedOnClustering
File "C:\Users\user\Desktop\build\pyi.win32\testgoose11\out00-PYZ.pyz\goose.t
ext", line 97, in init
File "C:\Users\user\Desktop\build\pyi.win32\testgoose11\out00-PYZ.pyz\goose.u
tils", line 76, in loadResourceFile
IOError: Couldn't open file C:\Users\user\Desktop\dist\testgoose11.exe?175104\g
oose/resources/text/stopwords-en.txt

article.cleaned_text is null

Not getting cleaned_text

>>> article = g.extract("http://vocamus.net/dave/?p=1602")
>>> article.cleaned_text

Arabic support

Hello,

I tried the library with an arabic article URL but the cleaned_text wasn't extracted at all.

Example:

from goose import Goose
url = 'http://www.alrai.com/article/599211.html'
g = Goose()
article = g.extract(url=url)
article.title
u'\u0627\u0644\u0642\u0627\u0626\u062f \u0627\u0644\u0623\u0639\u0644\u0649 \u064a\u0632\u0648\u0631 \u0627\u0644\u0648\u0627\u062c\u0647\u0629 \u0627\u0644\u0634\u0645\u0627\u0644\u064a\u0629 \u0627\u0644\u0634\u0631\u0642\u064a\u0629'
article.cleaned_text
u''

Since you don't have the stop words list for arabic, i couldn't set the 'target_language' to 'ar' because an error would take place.

Please advise.

article - don't pass article all the way around

Goose doesn't handle cookies

Some website requires cookie handeling. This is the case for the NYT. It doesn't play well with 4d1ccaf

Does not work on AWS

Goose cannot write to a directory /tmp directory. It completely shuts down my instance.

I've been able to run it by changing permissions on AWS but I was wondering if there was a less hacky way to do it in Goose.

setup.py doesn't install resources folder in Egg

As title. Raises error:
IOError: Couldn't open file [path to virtualenv]/lib/python2.7/site-packages/goose-0.0.1-py2.7.egg/goose/resources/text/stopwords-en.txt

Also, nice idea. Goose is great, but Scala? ;)

Doesnt seem to work for other language articles like spanish.Is there any solution for this?

WindowsError: [Error 32] The process cannot access the file because it is being used by another process

I am using Goose on Windows Platform.

Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.

from goose import Goose
Goose()
Traceback (most recent call last):
File "", line 1, in
File "d:\Program Files (x86)\python273\lib\site-packages\goose_extractor-1.0.8
-py2.7.egg\goose__init__.py", line 38, in init
self.initialize()
File "d:\Program Files (x86)\python273\lib\site-packages\goose_extractor-1.0.8
-py2.7.egg\goose__init__.py", line 82, in initialize
os.remove(path)
WindowsError: [Error 32] The process cannot access the file because it is being
used by another process: 'c:\users\danyang\appdata\local\temp\goose\tmpj2
avys'

Text chunks get duplicated

'... Her choice? James Franco. (“I think he’s really cool.”) ...' appears twice in the text although it appears only once on he original page:

http://www.vanityfair.com/online/oscars/2013/10/britney-spears-breaking-bad-miley-cyrus-fifty-shades-of-grey

Goose fails in extracting articles from Gizmodo and NY Times.

I tried Goose to extract articles in Gizmodo and NY Times but it just returned a blank string or a faulty extraction.

Provide a method to extract with raw html

Hello,
First of all thanks for this very good job, I just wanted to make a little suggestion it could be great to propose a way to submit directly raw html and let the user manage his own preferences (Cookies, key, user agents proxy...).
I'm using python-requests on the top of this and trying to use goose just to extract the article. I will make a pull request if I found out how to do that.

Portuguese stopword file

Text extraction in Portuguese doesn't play well du to a missing stopword file #60

publish_date is always null

Out of all the links I've tried, the publish_date is always null.

local_storage_path: $TEMPDIR/goose isn't guaranteed to be available

I've hit a problem with running Goose twice in separate users: both expect to write to /tmp/goose (specifically, the value of self.config.local_storage_path defaults to tempfile.gettempdir() + "goose")

Obviously, I should be able to change the configuration options, but this still seems like a bug.

Goose Date Extraction

article.publish_date always returns None. I did not find a single article for which it worked. What is the fix to this problem? Thanks

Add support for UL, OL, PRE, other non-P elements

I’m testing Goose and finding that elements other than paragraphs are unavailable in cleaned_text or top_node.

What I did:

url = 'https://github.com/grangier/python-goose'
g = Goose()
a = g.extract(url=url)

I expected to find the single-item list with “Xavier Grangier” and all the code samples in the output, but they were not there. I would be interested to see an additional property in the output, something like source_node that made the non-cleaned element tree of the original content available.

korean support

Switching to beautifulsoup4 for Python 3 support?

Are there any plans to change the beautifulsoup dependency to beautifulsoup4 for Python 3 support? Or are there other factors as well before this will be py3 compatible?

Does goose supports summary the html content?

Hi, I use goose to test some prototype ideas. It's very convenient to extract text from html. I find it support extracting meta description of a web page. But some html files don't have meta description tag. Can goose help me to do summary of the html content? Thanks!

Add bounds check to index access

Correction of some errors for python-goose

I created some changes in python-goose project.
Some of them are changes in algorithms, but others are correction of errors.
Here is list of fixes:
muggot@9b07707
muggot@a0949f0
muggot@5f29a29
muggot@7542fbb
muggot@7420632
muggot@430116f
muggot@64a80c7
muggot@69e5149
muggot@97d94ba
muggot@7279ebc
muggot@b9a4516
muggot@b903708
muggot@922e227
Please, review the commits.
If you find some of them useful, then I can create new fork with selected changes
and pull request.

Extract HTML instead of Plaintext

I'm working on a project that requires that I preserve the formatting of the extracted content. I was wondering if it's possible to get the topmost DOM node that contains the extracted text.

Thanks for the great port btw.

<br> tags are mostly ignored

Hi,

(Unfortunately) some sites rely on br tags for newlines, an example is:

http://allnewlyrics.com/only-one-lyrics-pj-morton-ft-stevie-wonder.html

The newlines are almost completely ignored...

Is it difficult to solve that? I already tried to understand the crawler. So far I've seen that text included inside br tags gets preserved. On the other hand in goose/outputformatters.py self.remove_fewwords_paragraphs(article) removes single
tags. Also the order of the br node seems to be changed at an early stage of the process. I wonder which point that is...

Cheers, Philip

cannot identify image file

url = "http://manc.it/13L8Jcx"
article = g.extract(url=url)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/init.py", line 53, in extract
return self.crawl(cc)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/init.py", line 60, in crawl
article = crawler.crawl(crawl_candiate)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/crawler.py", line 98, in crawl
article.top_image = image_extractor.get_best_image(article.raw_doc, article.top_node)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/extractors.py", line 88, in get_best_image
image = self.check_large_images(topNode, 0, 0)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/extractors.py", line 120, in check_large_images
good_images = self.get_image_candidates(node)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/extractors.py", line 264, in get_image_candidates
good_images = self.get_images_bytesize_match(filtered_images)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/extractors.py", line 280, in get_images_bytesize_match
local_image = self.get_local_image(src)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/extractors.py", line 343, in get_local_image
self.link_hash, src, self.config)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/utils.py", line 59, in store_image
image = self.write_localfile(data, link_hash, src, config)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/utils.py", line 101, in write_localfile
return self.read_localfile(link_hash, src, config)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/utils.py", line 81, in read_localfile
image_details = self.get_image_dimensions(identify, local_image_name)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/images/utils.py", line 36, in get_image_dimensions
image = Image.open(path)
File "/usr/local/lib/python2.7/dist-packages/PIL/Image.py", line 2008, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file
g.config.local_storage_path
'/tmp/goose'

Add other language support

It would be perfect if goose would have support of other wide languages such as Russian German Greek etc.
So I have found project with prepared stop words under GPL v3 licence (https://code.google.com/p/stop-words/). There are over 20 languages.

Goose fails to return a proper title

Hi,

I was trying to parse a feed generated by tumblr at http://www.bonjourmadame.fr/
and stumbled upon a parsing failure at http://www.bonjourmadame.fr/post/50484319059/charlotte-3

extract = g.extract(url='http://www.bonjourmadame.fr/post/50484319059/charlotte-3')
extract.title gives a really bad output. It looks like the problem comes from this line:

    self.doc = lxml.html.fromstring(html)

where lxml defaults to strict/stupid parsing.

Replacing this line with the fromstring from beautifulsoup (from lxml.html.soupparser import fromstring) appears to fix the issue.

Call to html.xpath('string(/head/title)') then returns "Bonjour Madame • Charlotte <3" which is the appropriate output.

Could the html parser be made switchable ?

Italian stop words

cannot deal with traditional chinese content

example url : http://violetiva.pixnet.net/blog/post/28652839-%5B%E9%A3%9F%E8%A8%98%5D-%E5%8F%B0%E5%8C%97%E2%80%A7%E6%9D%8F%E5%AD%90%E8%B1%AC%E6%8E%92-

content will be sucessfully extracted using viewtex.org api
e.g., http://viewtext.org/article?url=http://violetiva.pixnet.net/blog/post/28652839-%5B%E9%A3%9F%E8%A8%98%5D-%E5%8F%B0%E5%8C%97%E2%80%A7%E6%%209D%8F%E5%AD%90%E8%B1%AC%E6%8E%92-

but get nothing through goose

article.topNode will be none and
article.cleanedArticleText also none , too.

Stopwords-hi does not exist

For some websites, when I run goose, I get the following error:

Couldn't open file /home/ubuntu/crsq-virtualenv/local/lib/python2.7/site-packages/goose_extractor-1.0.6-py2.7.egg/goose/resources/text/stopwords-hi.txt

What can be the reason?

lxml.html doesn't play well with html5

Parser class uses lxml.html.fromstring on some document the whole article is stripped out the document produced, for exemple : http://www.lefigaro.fr/conjoncture/2013/04/05/20002-20130405ARTFIG00473-montebourg-envisage-des-privatisations-partielles.php

In this case goose extractor doesn't return any top_node (obviously) as it does'nt exist.

How to get article with all images

Hi, I tied to parse articles which has some images, but goose returns only main image. How to get all article images?

ValueError: Unicode strings with encoding declaration are not supported

Traceback:

url = "http://www.academyshop.co.uk/index.php?route=product/category&path=181_461"
at = g.extract(url=url)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/init.py", line 53, in extract
return self.crawl(cc)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/init.py", line 60, in crawl
article = crawler.crawl(crawl_candiate)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/crawler.py", line 63, in crawl
doc = self.get_document(raw_html)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/crawler.py", line 135, in get_document
doc = self.parser.fromstring(raw_html)
File "/usr/local/lib/python2.7/dist-packages/goose_extractor-1.0.2-py2.7.egg/goose/parsers.py", line 54, in fromstring
self.doc = lxml.html.fromstring(html)
File "/usr/lib/python2.7/dist-packages/lxml/html/init.py", line 634, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, *_kw)
File "/usr/lib/python2.7/dist-packages/lxml/html/init.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, *_kw)
File "lxml.etree.pyx", line 2756, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54726)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82754)
ValueError: Unicode strings with encoding declaration are not supported.

check_large_images removes width and height information

From images/extractors.py

if scored_images:
                highscore_image = sorted(scored_images.items(),
                                        key=lambda x: x[1], reverse=True)[0][0]
                main_image = Image()
                main_image.src = highscore_image.src
                main_image.extraction_type = "bigimage"
                main_image.confidence_score = 100 / len(scored_images) \
                                    if len(scored_images) > 0 else 0
                return main_image

Any reason we don't copy image and height information over to the returned image?
Or just update and return highscore_image?

Atli.

Bad case for image extraction

Hi, I use goose to extract images from a Chinese news site. Some news articles dont't have images. But goose gives me one from the sidebar of the page.

For example:
url: http://news.xinhuanet.com/fortune/2013-10/10/c_125507992.htm

And Goose give me this image: http://news.xinhuanet.com/fortune/titlepic/117523073_title1n.jpg

This image is at the right site of this page.

How can I fix it!

Thanks for your help.

Resourse issue when build from setup.py

workspace/python-goose-1.0.0$ sudo python setup.py build
running build
running build_py
running egg_info
writing requirements to goose.egg-info/requires.txt
writing goose.egg-info/PKG-INFO
writing top-level names to goose.egg-info/top_level.txtwarning: no files found matching '' under directory 'goose/resources/parser'
warning: no files found matching '' under directory 'goose/resources/statichtm
writing dependency_links to goose.egg-info/dependency_links.txt
reading manifest file 'goose.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '' under directory 'goose/resources/parser'
warning: no files found matching '' under directory 'goose/resources/statichtml'
writing manifest file 'goose.egg-info/SOURCES.txt'

It's result that goose can't be used, this is the error i get
Traceback (most recent call last):
File "/home/angel/workspace/accounting_tfm/vector.py", line 6, in
class read_file:
File "/home/angel/workspace/accounting_tfm/vector.py", line 7, in read_file
from goose import Goose
File "/home/angel/workspace/python-goose/dist/goose-1.0.0-py2.7.egg/goose/init.py", line 25, in
File "/home/angel/workspace/python-goose/dist/goose-1.0.0-py2.7.egg/goose/configuration.py", line 23, in
File "/home/angel/workspace/python-goose/dist/goose-1.0.0-py2.7.egg/goose/text.py", line 26, in
File "/home/angel/workspace/python-goose/dist/goose-1.0.0-py2.7.egg/goose/utils/init.py", line 29, in
ImportError: No module named 'urlparse'

Do we need to download images to file?

Aren't we only looking for width,height and number of bytes?

A head request gives you the content-length and something like https://gist.github.com/atlithorn/6155288 for width and height could remove the need for /tmp access.

If you are interested I'll create a pull request.