Comments (6)
Is it possible to change the configuration of html_parser for unicode encoding by means of goose
from python-goose.
@PriyeshV I don't get your last comment. Goose only deals with unicode. It seems to be an error in lxml html parser because the soup parser works.
That said. Goose in not suitable for extracting this kind of web page. It's build to extract article. Your page doesn't have any article.
from python-goose.
Thanks grangier. I too figured it out that the error was with lxml html parser. I wanted to know whether there is any possibility of passing arguments to lxml html parser when invoking goose.
from python-goose.
@PriyeshV no. Use raw_html extraction if you want to preprocess content
from python-goose.
Thanks again @grangier
from python-goose.
Hello! Sorry to drag this up again, but I'm a little less confident than PriyeshV that I know how to handle raw_html preprocessing... I'm having the exact same exception, and I was wondering what I'm supposed to do about it?
(Love Goose, by the way. Thanks so much!)
Do I need to convert the html to an encoding that won't bother lxml? You say it works with the soup parser, is there a way to use that instead? I guess I'm just a little unsure of how to proceed. Thanks so much!
--- Edit -----
Oh, the page I'm trying to extract article text from is here: http://arxiv.org/abs/1401.4454
I was hoping to end up with just the abstract extracted.
Thanks.
from python-goose.
Related Issues (20)
- li tags in html not extracted HOT 2
- ModuleNotFoundError: No module named 'urlparse' HOT 2
- Not working with ABC News and The Hill articles
- [extractors/title.py] None value for `site_name` in line 40
- encoding error : input conversion failed due to input error, bytes 0xEC 0xD8 0xFD 0xFF
- Failed extraction from blogger post HOT 12
- ImportError: dynamic module does not define init function (init_imaging)
- python-goose/goose/utils/encoding.py
- PLEASE SUBMIT ISSUES TO GOOSE3
- Not parsing following articles. HOT 1
- Japanease functionality
- lots of temporary files in /tmp/goose HOT 1
- Goose is not extracting article whole text
- what's python's version HOT 1
- no result return and waiting HOT 3
- any paper or algorithm description about text extraction?
- Add support for HTTP and HTTPS proxies
- Installation error HOT 2
- Unable to use goose with Python 3 HOT 1
- Unable to execute the install script
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-goose.