Git Product home page Git Product logo

Comments (11)

grangier avatar grangier commented on September 22, 2024

Hello,

Please provide a full traceback of the error.

from python-goose.

bitwjg avatar bitwjg commented on September 22, 2024

the full traceback is as follows:

Traceback (most recent call last):
File "C:\Users\v-jingaw\workspace\ArticleExtractor\src\msra\km\jingang\demo.py", line 11, in
article = g.extractContent(url)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\Goose.py", line 52, in extractContent
return self.sendToActor(cc)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\Goose.py", line 59, in sendToActor
article = crawler.crawl(crawlCandiate)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\Crawler.py", line 93, in crawl
article.topImage = imageExtractor.getBestImage(article.rawDoc, article.topNode)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 86, in getBestImage
image = self.checkForLargeImages(topNode, 0, 0)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 134, in checkForLargeImages
depthObj.parentDepth, depthObj.siblingDepth)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 134, in checkForLargeImages
depthObj.parentDepth, depthObj.siblingDepth)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 117, in checkForLargeImages
goodImages = self.getImageCandidates(node)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 261, in getImageCandidates
goodImages = self.findImagesThatPassByteSizeTest(filteredImages)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 277, in findImagesThatPassByteSizeTest
locallyStoredImage = self.getLocallyStoredImage(imgSrc)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 340, in getLocallyStoredImage
self.linkhash, imageSrc, self.config)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 59, in storeImageToLocalFile
image = self.writeEntityContentsToDisk(data, linkhash, imageSrc, config)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 101, in writeEntityContentsToDisk
return self.readExistingFileInfo(linkhash, imageSrc, config)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 81, in readExistingFileInfo
imageDetails = self.getImageDimensions(identify, localImageName)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 36, in getImageDimensions
image = Image.open(filePath)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1980, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file

from python-goose.

grangier avatar grangier commented on September 22, 2024

Hello,

Latest commit e365e70 should resolve the issue. Could you pull the new head and test it again. Warning the API has changed due to the merge of the camelcaseless branch. Refere to README to see changes.

Thanks

from python-goose.

bitwjg avatar bitwjg commented on September 22, 2024

Hi, I have tried the new version, unfortunately it still didn't work. I attached the Traceback as follows, have I missed something?

Traceback (most recent call last):
File "C:\Users\v-jingaw\workspace\ArticleExtractor\src\msra\km\jingang\demo.py", line 11, in
article = g.extract(url=url)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose__init__.py", line 53, in extract
return self.crawl(cc)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose__init__.py", line 60, in crawl
article = crawler.crawl(crawl_candiate)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\crawler.py", line 93, in crawl
article.top_image = image_extractor.get_best_image(article.raw_doc, article.top_node)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 86, in get_best_image
image = self.check_large_images(topNode, 0, 0)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 118, in check_large_images
good_images = self.get_image_candidates(node)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 262, in get_image_candidates
good_images = self.get_images_bytesize_match(filtered_images)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 278, in get_images_bytesize_match
local_image = self.get_local_image(src)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 341, in get_local_image
self.link_hash, src, self.config)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 59, in store_image
image = self.write_localfile(data, link_hash, src, config)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 101, in write_localfile
return self.read_localfile(link_hash, src, config)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 81, in read_localfile
image_details = self.get_image_dimensions(identify, local_image_name)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 36, in get_image_dimensions
image = Image.open(path)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1980, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file

from python-goose.

grangier avatar grangier commented on September 22, 2024

Did you path a local_storage_path in the Configuration object ? For instance :
g = Goose({'local_storage_path': 'C:'})
...

if this doesn't work, please provide the value of g.config.local_storage_path and the url to extract

from python-goose.

bitwjg avatar bitwjg commented on September 22, 2024

Hi, I tried a list of urls.
The demo could not work with the url http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html
While, when I tried a url without image in the web page (such as a wikipedia page http://en.wikipedia.org/wiki/Aharon_Barak), the demo could work well and extract the content successfully.

from python-goose.

grangier avatar grangier commented on September 22, 2024

Did you change the local_storage_path ?

from python-goose.

grangier avatar grangier commented on September 22, 2024

Works well here :

import goose
url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = goose.Goose({'local_storage_path': '/home/'})
article = g.extract(url=url)
article.title
u'Las listas de espera se agravan'
article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'
article.top_image.src
'http://ep01.epimg.net/sociedad/imagenes/2012/10/27/actualidad/1351332873_157836_1351354920_noticia_normal.jpg'

Please be sure that you pass a valid path to the local_storage_path : goose.Goose({'local_storage_path': '/home/'}).

Otherwise please refer to your PIL setup. You must have JPG, PNG support.

from python-goose.

bitwjg avatar bitwjg commented on September 22, 2024

Sorry for the late reply.
I have re install goose in my computer and the error still existed.
I think this problem should be caused by PIL.
Could you tell me how to set the PIL to support all the image fomats?

On Wed, Apr 3, 2013 at 2:40 PM, Xavier Grangier [email protected]:

Works well here :

import goose
url = '
http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html
'
g = goose.Goose({'local_storage_path': '/home/'})
article = g.extract(url=url)
article.title
u'Las listas de espera se agravan'
article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio
de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s
ciudad'

Please be sure that you pass a valid path to the local_storage_path :
goose.Goose({'local_storage_path': '/home/'})


Reply to this email directly or view it on GitHubhttps://github.com/xgdlm/python-goose/issues/13#issuecomment-15820841
.

Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China

from python-goose.

grangier avatar grangier commented on September 22, 2024

Sorry I have no idea how to setup PIL on windows. I'll close the issue as it seems to be on your side.

from python-goose.

ventouris avatar ventouris commented on September 22, 2024

Hello,

I had the same problem with the IOError. I tried to include a storage path like this

g = Goose({'local_storage_path': "C:\Users\mycomputer\Desktop\beta\test"})

But now I have another error. Here is the trackback

Traceback (most recent call last):
File "C:/Users/mycomputer/Desktop/beta/test.py", line 17, in
g = Goose({'local_storage_path': ""C:\Users\mycomputer\Desktop\beta\test"})
File "C:\Python27\lib\site-packages\goose_extractor-1.0.2-py2.7.egg\goose__init__.py", line 36, in init
self.extend_config()
File "C:\Python27\lib\site-packages\goose_extractor-1.0.2-py2.7.egg\goose__init__.py", line 44, in extend_config
setattr(config, k, v)
AttributeError: can't set attribute

Any suggestion?

from python-goose.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.