The script should be run successfully.
$ pip install scrapy pandas readability-lxml
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Collecting scrapy
Using cached Scrapy-1.8.0-py2.py3-none-any.whl (238 kB)
Collecting pandas
Using cached pandas-0.24.2-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (16.7 MB)
Processing /Users/redacted/Library/Caches/pip/wheels/cf/80/76/f6eaec8f1622db6af7ceaeef9e4481e9dc766ccfc16b1cbd0b/readability_lxml-0.8.1-py2-none-any.whl
Collecting queuelib>=1.4.2
Using cached queuelib-1.6.1-py2.py3-none-any.whl (12 kB)
Collecting cryptography>=2.0
Using cached cryptography-3.3.2-cp27-cp27m-macosx_10_10_x86_64.whl (1.8 MB)
Collecting w3lib>=1.17.0
Using cached w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting zope.interface>=4.1.3
Using cached zope.interface-5.4.0-cp27-cp27m-macosx_10_14_x86_64.whl (208 kB)
Collecting pyOpenSSL>=16.2.0
Using cached pyOpenSSL-20.0.1-py2.py3-none-any.whl (54 kB)
Processing /Users/redacted/Library/Caches/pip/wheels/50/41/57/228635c140878de06d942d072c9924afa56a86bb8fc2d319a4/Protego-0.1.16-py2-none-any.whl
Collecting six>=1.10.0
Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0
Using cached parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=16.0.0
Using cached service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting Twisted>=16.0.0; python_version == "2.7"
Using cached Twisted-20.3.0-cp27-cp27m-macosx_10_6_intel.whl (3.2 MB)
Collecting lxml>=3.5.0
Using cached lxml-4.6.3-cp27-cp27m-macosx_10_9_x86_64.whl (4.5 MB)
Processing /Users/redacted/Library/Caches/pip/wheels/35/5f/0f/474144aca7e2624be7670cdd9c6eca4979713cee237d16b464/PyDispatcher-2.0.5-py2-none-any.whl
Collecting cssselect>=0.9.1
Using cached cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting pytz>=2011k
Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Collecting numpy>=1.12.0
Using cached numpy-1.16.6-cp27-cp27m-macosx_10_9_x86_64.whl (13.9 MB)
Collecting python-dateutil>=2.5.0
Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting chardet
Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Collecting cffi>=1.12
Using cached cffi-1.14.5-cp27-cp27m-macosx_10_9_x86_64.whl (175 kB)
Collecting enum34; python_version < "3"
Using cached enum34-1.1.10-py2-none-any.whl (11 kB)
Collecting ipaddress; python_version < "3"
Using cached ipaddress-1.0.23-py2.py3-none-any.whl (18 kB)
Requirement already satisfied: setuptools in /Users/redacted/.ve/censorednews-headlines/lib/python2.7/site-packages (from zope.interface>=4.1.3->scrapy) (44.1.1)
Processing /Users/redacted/Library/Caches/pip/wheels/c2/ea/a3/25af52265fad6418a74df0b8d9ca8b89e0b3735dbd4d0d3794/functools32-3.2.3.post2-py2-none-any.whl
Collecting pyasn1
Using cached pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
Collecting pyasn1-modules
Using cached pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB)
Collecting attrs>=19.1.0
Using cached attrs-21.2.0-py2.py3-none-any.whl (53 kB)
Collecting Automat>=0.3.0
Using cached Automat-20.2.0-py2.py3-none-any.whl (31 kB)
Collecting incremental>=16.10.1
Using cached incremental-21.3.0-py2.py3-none-any.whl (15 kB)
Collecting hyperlink>=17.1.1
Using cached hyperlink-21.0.0-py2.py3-none-any.whl (74 kB)
Processing /Users/redacted/Library/Caches/pip/wheels/f5/8c/e2/f0cea19d340270166bbfd4a2e9d8b8c132e26ef7e1376a0890/PyHamcrest-1.10.1-py2-none-any.whl
Collecting constantly>=15.1
Using cached constantly-15.1.0-py2.py3-none-any.whl (7.9 kB)
Collecting pycparser
Using cached pycparser-2.20-py2.py3-none-any.whl (112 kB)
Collecting idna>=2.5
Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting typing; python_version < "3.5"
Using cached typing-3.10.0.0-py2-none-any.whl (26 kB)
Installing collected packages: queuelib, pycparser, cffi, enum34, six, ipaddress, cryptography, w3lib, zope.interface, pyOpenSSL, protego, lxml, functools32, cssselect, parsel, pyasn1, pyasn1-modules, attrs, service-identity, Automat, incremental, idna, typing, hyperlink, PyHamcrest, constantly, Twisted, PyDispatcher, scrapy, pytz, numpy, python-dateutil, pandas, chardet, readability-lxml
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-1.10.1 Twisted-20.3.0 attrs-21.2.0 cffi-1.14.5 chardet-4.0.0 constantly-15.1.0 cryptography-3.3.2 cssselect-1.1.0 enum34-1.1.10 functools32-3.2.3.post2 hyperlink-21.0.0 idna-2.10 incremental-21.3.0 ipaddress-1.0.23 lxml-4.6.3 numpy-1.16.6 pandas-0.24.2 parsel-1.6.0 protego-0.1.16 pyOpenSSL-20.0.1 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 python-dateutil-2.8.1 pytz-2021.1 queuelib-1.6.1 readability-lxml-0.8.1 scrapy-1.8.0 service-identity-21.1.0 six-1.16.0 typing-3.10.0.0 w3lib-1.22.0 zope.interface-5.4.0
$ scrapy runspider scrapy_test.py
/Users/redacted/.ve/headlines/lib/python2.7/site-packages/OpenSSL/crypto.py:14: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in the next release.
from cryptography import utils, x509
2021-06-19 18:04:09 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2021-06-19 18:04:09 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 2.7.16 (default, Jan 27 2020, 04:46:15) - [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i 8 Dec 2020), cryptography 3.3.2, Platform Darwin-18.7.0-x86_64-i386-64bit
2021-06-19 18:04:09 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2021-06-19 18:04:09 [scrapy.extensions.telnet] INFO: Telnet Password: 861e55f8a3b0e3bb
2021-06-19 18:04:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2021-06-19 18:04:09 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 184, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 188, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 104, in crawl
six.reraise(*exc_info)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 86, in crawl
self.engine = self._create_engine()
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 111, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/engine.py", line 67, in __init__
self.scheduler_cls = load_object(self.settings['SCHEDULER'])
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/utils/misc.py", line 46, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/scheduler.py", line 7, in <module>
from queuelib import PriorityQueue
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/__init__.py", line 1, in <module>
from queuelib.queue import FifoDiskQueue, LifoDiskQueue
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/queue.py", line 7, in <module>
from contextlib import suppress
exceptions.ImportError: cannot import name suppress
2021-06-19 18:04:09 [twisted] CRITICAL:
Traceback (most recent call last):
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 104, in crawl
six.reraise(*exc_info)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 86, in crawl
self.engine = self._create_engine()
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/crawler.py", line 111, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/engine.py", line 67, in __init__
self.scheduler_cls = load_object(self.settings['SCHEDULER'])
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/utils/misc.py", line 46, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/scrapy/core/scheduler.py", line 7, in <module>
from queuelib import PriorityQueue
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/__init__.py", line 1, in <module>
from queuelib.queue import FifoDiskQueue, LifoDiskQueue
File "/Users/redacted/.ve/headlines/lib/python2.7/site-packages/queuelib/queue.py", line 7, in <module>
from contextlib import suppress
ImportError: cannot import name suppress
scrapy version --verbose
/Users/redacted/.ve/headlines/lib/python2.7/site-packages/OpenSSL/crypto.py:14: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in the next release.
from cryptography import utils, x509
Scrapy : 1.8.0
lxml : 4.6.3.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 20.3.0
Python : 2.7.16 (default, Jan 27 2020, 04:46:15) - [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)]
pyOpenSSL : 20.0.1 (OpenSSL 1.1.1i 8 Dec 2020)
cryptography : 3.3.2
Platform : Darwin-18.7.0-x86_64-i386-64bit
'''
headline_scraper.py
A simple scrapy spider to collect web page titles
'''
'''
headline_scraper.py
A simple scrapy spider to collect web page titles
'''
import scrapy
from pandas import read_csv
from readability.readability import Document
PATH_TO_DATA = 'https://gist.githubusercontent.com/jackbandy/208028b404d8c6a6f822397e306a5a34/raw/ef7f73357e77c29c63b5b7632d840a923327e179/100_urls_sample.csv'
class HeadlineSpider(scrapy.Spider):
name = "headline_spider"
start_urls = read_csv(PATH_TO_DATA).url.tolist()
def parse(self, response):
doc = Document(response.text)
yield {
'short_title': doc.short_title(),
'full_title': doc.title(),
'url': response.url
}