Comments (11)
Hello,
Please provide a full traceback of the error.
from python-goose.
the full traceback is as follows:
Traceback (most recent call last):
File "C:\Users\v-jingaw\workspace\ArticleExtractor\src\msra\km\jingang\demo.py", line 11, in
article = g.extractContent(url)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\Goose.py", line 52, in extractContent
return self.sendToActor(cc)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\Goose.py", line 59, in sendToActor
article = crawler.crawl(crawlCandiate)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\Crawler.py", line 93, in crawl
article.topImage = imageExtractor.getBestImage(article.rawDoc, article.topNode)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 86, in getBestImage
image = self.checkForLargeImages(topNode, 0, 0)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 134, in checkForLargeImages
depthObj.parentDepth, depthObj.siblingDepth)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 134, in checkForLargeImages
depthObj.parentDepth, depthObj.siblingDepth)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 117, in checkForLargeImages
goodImages = self.getImageCandidates(node)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 261, in getImageCandidates
goodImages = self.findImagesThatPassByteSizeTest(filteredImages)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 277, in findImagesThatPassByteSizeTest
locallyStoredImage = self.getLocallyStoredImage(imgSrc)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\UpgradedImageExtractor.py", line 340, in getLocallyStoredImage
self.linkhash, imageSrc, self.config)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 59, in storeImageToLocalFile
image = self.writeEntityContentsToDisk(data, linkhash, imageSrc, config)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 101, in writeEntityContentsToDisk
return self.readExistingFileInfo(linkhash, imageSrc, config)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 81, in readExistingFileInfo
imageDetails = self.getImageDimensions(identify, localImageName)
File "C:\Python27\lib\site-packages\goose-0.0.1-py2.7.egg\goose\images\ImageUtils.py", line 36, in getImageDimensions
image = Image.open(filePath)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1980, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file
from python-goose.
Hello,
Latest commit e365e70 should resolve the issue. Could you pull the new head and test it again. Warning the API has changed due to the merge of the camelcaseless branch. Refere to README to see changes.
Thanks
from python-goose.
Hi, I have tried the new version, unfortunately it still didn't work. I attached the Traceback as follows, have I missed something?
Traceback (most recent call last):
File "C:\Users\v-jingaw\workspace\ArticleExtractor\src\msra\km\jingang\demo.py", line 11, in
article = g.extract(url=url)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose__init__.py", line 53, in extract
return self.crawl(cc)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose__init__.py", line 60, in crawl
article = crawler.crawl(crawl_candiate)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\crawler.py", line 93, in crawl
article.top_image = image_extractor.get_best_image(article.raw_doc, article.top_node)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 86, in get_best_image
image = self.check_large_images(topNode, 0, 0)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 118, in check_large_images
good_images = self.get_image_candidates(node)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 262, in get_image_candidates
good_images = self.get_images_bytesize_match(filtered_images)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 278, in get_images_bytesize_match
local_image = self.get_local_image(src)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\extractors.py", line 341, in get_local_image
self.link_hash, src, self.config)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 59, in store_image
image = self.write_localfile(data, link_hash, src, config)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 101, in write_localfile
return self.read_localfile(link_hash, src, config)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 81, in read_localfile
image_details = self.get_image_dimensions(identify, local_image_name)
File "C:\Python27\lib\site-packages\goose-1.0.0-py2.7.egg\goose\images\utils.py", line 36, in get_image_dimensions
image = Image.open(path)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1980, in open
raise IOError("cannot identify image file")
IOError: cannot identify image file
from python-goose.
Did you path a local_storage_path in the Configuration object ? For instance :
g = Goose({'local_storage_path': 'C:'})
...
if this doesn't work, please provide the value of g.config.local_storage_path and the url to extract
from python-goose.
Hi, I tried a list of urls.
The demo could not work with the url http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html
While, when I tried a url without image in the web page (such as a wikipedia page http://en.wikipedia.org/wiki/Aharon_Barak), the demo could work well and extract the content successfully.
from python-goose.
Did you change the local_storage_path ?
from python-goose.
Works well here :
import goose
url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = goose.Goose({'local_storage_path': '/home/'})
article = g.extract(url=url)
article.title
u'Las listas de espera se agravan'
article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'
article.top_image.src
'http://ep01.epimg.net/sociedad/imagenes/2012/10/27/actualidad/1351332873_157836_1351354920_noticia_normal.jpg'
Please be sure that you pass a valid path to the local_storage_path : goose.Goose({'local_storage_path': '/home/'}).
Otherwise please refer to your PIL setup. You must have JPG, PNG support.
from python-goose.
Sorry for the late reply.
I have re install goose in my computer and the error still existed.
I think this problem should be caused by PIL.
Could you tell me how to set the PIL to support all the image fomats?
On Wed, Apr 3, 2013 at 2:40 PM, Xavier Grangier [email protected]:
Works well here :
import goose
url = '
http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html
'
g = goose.Goose({'local_storage_path': '/home/'})
article = g.extract(url=url)
article.title
u'Las listas de espera se agravan'
article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio
de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s
ciudad'Please be sure that you pass a valid path to the local_storage_path :
goose.Goose({'local_storage_path': '/home/'})—
Reply to this email directly or view it on GitHubhttps://github.com/xgdlm/python-goose/issues/13#issuecomment-15820841
.
Wang Jingang(王金刚)
Ph.D. Candidate at
Lab of High Volume Language Information Processing & Cloud Computing
School of Computer Science
Beijing Institute of Technology
Beijing 100081
P.R China
from python-goose.
Sorry I have no idea how to setup PIL on windows. I'll close the issue as it seems to be on your side.
from python-goose.
Hello,
I had the same problem with the IOError. I tried to include a storage path like this
g = Goose({'local_storage_path': "C:\Users\mycomputer\Desktop\beta\test"})
But now I have another error. Here is the trackback
Traceback (most recent call last):
File "C:/Users/mycomputer/Desktop/beta/test.py", line 17, in
g = Goose({'local_storage_path': ""C:\Users\mycomputer\Desktop\beta\test"})
File "C:\Python27\lib\site-packages\goose_extractor-1.0.2-py2.7.egg\goose__init__.py", line 36, in init
self.extend_config()
File "C:\Python27\lib\site-packages\goose_extractor-1.0.2-py2.7.egg\goose__init__.py", line 44, in extend_config
setattr(config, k, v)
AttributeError: can't set attribute
Any suggestion?
from python-goose.
Related Issues (20)
- li tags in html not extracted HOT 2
- ModuleNotFoundError: No module named 'urlparse' HOT 2
- Not working with ABC News and The Hill articles
- [extractors/title.py] None value for `site_name` in line 40
- encoding error : input conversion failed due to input error, bytes 0xEC 0xD8 0xFD 0xFF
- Failed extraction from blogger post HOT 12
- ImportError: dynamic module does not define init function (init_imaging)
- python-goose/goose/utils/encoding.py
- PLEASE SUBMIT ISSUES TO GOOSE3
- Not parsing following articles. HOT 1
- Japanease functionality
- lots of temporary files in /tmp/goose HOT 1
- Goose is not extracting article whole text
- what's python's version HOT 1
- no result return and waiting HOT 3
- any paper or algorithm description about text extraction?
- Add support for HTTP and HTTPS proxies
- Installation error HOT 2
- Unable to use goose with Python 3 HOT 1
- Unable to execute the install script
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-goose.