timbertson / python-readability Goto Github PK

[abandoned] python port of arc90's readability bookmarklet

Python 100.00%

python-readability's Introduction

This code is under the Apache License 2.0.  http://www.apache.org/licenses/LICENSE-2.0

This is a python port of a ruby port of arc90's readability project

http://lab.arc90.com/experiments/readability/

Given a html document, it pulls out the main body text and cleans it up.

Ruby port by starrhorne and iterationlabs
Python port by gfxmonk

This port uses BeautifulSoup for the HTML parsing. That means it can be
a little slow, but will work on Google App Engine (unlike libxml-based
libraries)


**note**: I don't currently have any plans for using or improving this
library, and it's far from perfect (slow, and almost certainly buggy).
So if you do something cool with it or have a better tool that does
the same job, please let me know and I can link to it from here.

If you're looking for alternatives / forks, here's the list so far:
 - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
 - https://github.com/buriy/python-readability

python-readability's People

Contributors

Stargazers

Watchers

python-readability's Issues

investigate

http://www.carrefour.com/cdc/group/current-news/colombia---opening-of-the-65-amp--66th-carrefour.html

readbility gives the content; python-readability gives a hidden table.

No handlers could be found for logger "readability.readability"
Traceback (most recent call last):
File "readability_parse.py", line 73, in
page_content = readability_about(html_path)
File "readability_parse.py", line 26, in readability_about
page_content = Document(html_str).summary()
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 195, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 147, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 105, in _html
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 109, in _parse
File "build/bdist.linux-x86_64/egg/readability/htmls.py", line 21, in build_doc
File "/home/work/anaconda2/lib/python2.7/site-packages/lxml/html/init.py", line 614, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3103, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70569)
File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:106403)
File "parser.pxi", line 1716, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:105194)
File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:99876)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786)
File "parser.pxi", line 631, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95065)
readability.readability.Unparseable: None

bug in scoring

I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0

all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.

An idea would be to remove child nodes from the parent before calculating the score.

Polish characters

If site containts polish (and probably any non-standard) characters, scripts remove them from output. Input:

test 123 zażółć gęślą jaźń tęst

Output:

test 123 za gl ja tst

issue when compiling dragnet

vagrant@vagrant-ubuntu-trusty-64:/vagrant/dragnet$ sudo make test
nosetests --exe --cover-package=dragnet --with-coverage --cover-branches -v --cover-erase
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
Failure: ImportError (No module named readability) ... ERROR
Failure: ImportError (No module named readability) ... ERROR

timbertson / python-readability Goto Github PK

python-readability's Introduction

python-readability's People

Contributors

Stargazers

Watchers

Forkers

python-readability's Issues

investigate

lxml error

bug in scoring

Polish characters

issue when compiling dragnet

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent