timbertson / python-readability Goto Github PK
View Code? Open in Web Editor NEW[abandoned] python port of arc90's readability bookmarklet
[abandoned] python port of arc90's readability bookmarklet
This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0 This is a python port of a ruby port of arc90's readability project http://lab.arc90.com/experiments/readability/ Given a html document, it pulls out the main body text and cleans it up. Ruby port by starrhorne and iterationlabs Python port by gfxmonk This port uses BeautifulSoup for the HTML parsing. That means it can be a little slow, but will work on Google App Engine (unlike libxml-based libraries) **note**: I don't currently have any plans for using or improving this library, and it's far from perfect (slow, and almost certainly buggy). So if you do something cool with it or have a better tool that does the same job, please let me know and I can link to it from here. If you're looking for alternatives / forks, here's the list so far: - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ - https://github.com/buriy/python-readability
readbility gives the content; python-readability gives a hidden table.
No handlers could be found for logger "readability.readability"
Traceback (most recent call last):
File "readability_parse.py", line 73, in
page_content = readability_about(html_path)
File "readability_parse.py", line 26, in readability_about
page_content = Document(html_str).summary()
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 195, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 147, in summary
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 105, in _html
File "build/bdist.linux-x86_64/egg/readability/readability.py", line 109, in _parse
File "build/bdist.linux-x86_64/egg/readability/htmls.py", line 21, in build_doc
File "/home/work/anaconda2/lib/python2.7/site-packages/lxml/html/init.py", line 614, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3103, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70569)
File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:106403)
File "parser.pxi", line 1716, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:105194)
File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:99876)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786)
File "parser.pxi", line 631, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95065)
readability.readability.Unparseable: None
I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0
all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.
An idea would be to remove child nodes from the parent before calculating the score.
If site containts polish (and probably any non-standard) characters, scripts remove them from output. Input:
test 123 zażółć gęślą jaźń tęst
test 123 za gl ja tst
vagrant@vagrant-ubuntu-trusty-64:/vagrant/dragnet$ sudo make test
nosetests --exe --cover-package=dragnet --with-coverage --cover-branches -v --cover-erase
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
Failure: ImportError (No module named readability) ... ERROR
Failure: ImportError (No module named readability) ... ERROR
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.