rmax / scrapy-redis Goto Github PK

View Code? Open in Web Editor NEW

5.5K 5.5K 1.6K 217 KB

Redis-based components for Scrapy.

Home Page: http://scrapy-redis.readthedocs.io

License: MIT License

Python 93.96% Makefile 5.90% Dockerfile 0.14%

crawler distributed redis scrapy

scrapy-redis's Introduction

Scrapy-Redis

Redis-based components for Scrapy.

Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
Documentation: https://github.com/rmax/scrapy-redis/wiki.
Release: https://github.com/rmax/scrapy-redis/wiki/History
Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
LICENSE: MIT license

Features

Distributed crawling/scraping

You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing

Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components

Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added json supported data in Redis
data contains url, `meta and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata`.

For example:
```
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
```
this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies

Note

This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.

Requirements

Python 3.7+
Redis >= 5.0
Scrapy >= 2.0
redis-py >= 4.0

Installation

From pip

pip install scrapy-redis

From GitHub

git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install

Note

For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.

pip uninstall scrapy-redis

Alternative Choice

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

scrapy-redis's People

Contributors

Stargazers

Watchers

Forkers

svetlyak40wt netconstructor xucheng mechanism boyi darthbear byszhao evan0 hsomesun liugehao publicbull epigos shahidashraff lovevn llonchj ashot scraping-xx xuemy mizhgun liuwei000000 xaqq bigdata-tools big-data openscripts-xx remotecontrol yyqvod listings-xx mapping lover103 aburan28 parsing models datamodels taxonomy naturallanguage hosting-scripts netcon-source hacder rongnan wehrlem theata osiloke npc7 schmich secplus rustemt nopper wfxiang08 atassumer eric011 vincentdong nside likaiguo igledaniel imclab chrox martinrembeck sing1ee qinyushuang carlchen0928 guoyunsky nikhgupta bihicheng jaykizhou wangzijian0x7c6 askyer terrancy jamestree ceablecui yiyinianhua happylumia mulinfro luisyang leeflora lmorillas no2key beiyexertz audunfo jayvischeng huwung archy-yang wyrover huangdehui2013 yian454 parisholley 5ace mrluowen xzflin techdragon yolanda1989 huokedu jinglingshu ultrachode figot dominicx boollab pjpan younghz smalltrue etongle

scrapy-redis's Issues

Upgrade scrapy-redis

Hi darkrho,
As we all know, the newest version of Scrapy is 0.24.4. However, some attributions of scrapy-redis is based on the old version.
There are some attributions of Scrapy used in scrapy-redis has changed in the new version , Such as,

Class scrapy.contrib.loader.XPathItemLoader was renamed to scrapy.contrib.loader.ItemLoader.
Class scrapy.spider.BaseSpider was renamed to scrapy.spider.Spider.
The data structure of ITEM_PIPELINES was changed from set/list to dict.
Class HtmlXpathSeletor was changed to Selector.

When we use scrapy-redis with Scrapy(new version), some warnings would appear.
So, something could be done to upgrade 'scrapy-redis' . If it is OK, I'd like to have a pull request.

How to use redis-cluster?

just support single redis ? now i want use redis cluster , what can i do

TypeError: init() got an unexpected keyword argument 'server'

What does this mean?

C:\Users\dell\AppData\Local\Programs\Python\Python35-32\python.exe D:/scrapyspider/tutorial/main.py
2016-07-17 01:04:49 [scrapy] INFO: Scrapy 1.2.0dev2 started (bot: tutorial)
2016-07-17 01:04:49 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2774.3 Safari/537.36', 'SPIDER_MODULES': ['tutorial.spiders'], 'NEWSPIDER_MODULE': 'tutorial.spiders', 'DOWNLOAD_DELAY': 1, 'BOT_NAME': 'tutorial', 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler'}
2016-07-17 01:04:49 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
2016-07-17 01:04:49 [dmoz] INFO: Reading start URLs from redis key 'dmoz:start_urls' (batch size: 16)
2016-07-17 01:04:49 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-17 01:04:49 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-17 01:04:50 [scrapy] INFO: Enabled item pipelines:
['tutorial.pipelines.DmozPipeline']
2016-07-17 01:04:50 [scrapy] INFO: Spider opened
2016-07-17 01:04:50 [scrapy] INFO: Closing spider (shutdown)
Unhandled error in Deferred:
2016-07-17 01:04:50 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\commands\crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\crawler.py", line 163, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\crawler.py", line 167, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\internet\defer.py", line 1273, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\internet\defer.py", line 1125, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\python\failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\crawler.py", line 87, in crawl
    yield self.engine.close()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 100, in close
    return self._close_all_spiders()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 340, in _close_all_spiders
    dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 340, in <listcomp>
    dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 298, in close_spider
    dfd = slot.close()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 44, in close
    self._maybe_fire_closing()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 51, in _maybe_fire_closing
    self.heartbeat.stop()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\internet\task.py", line 202, in stop
    assert self.running, ("Tried to stop a LoopingCall that was "
builtins.AssertionError: Tried to stop a LoopingCall that was not running.
2016-07-17 01:04:50 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy_redis\scheduler.py", line 120, in open
    debug=spider.settings.getbool('DUPEFILTER_DEBUG'),
TypeError: __init__() got an unexpected keyword argument 'server'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\crawler.py", line 74, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
ValueError: ("Failed to instantiate dupefilter class '%s': %s", 'scrapy.dupefilters.RFPDupeFilter', TypeError("__init__() got an unexpected keyword argument 'server'",))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\internet\defer.py", line 1125, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\python\failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\crawler.py", line 87, in crawl
    yield self.engine.close()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 100, in close
    return self._close_all_spiders()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 340, in _close_all_spiders
    dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 340, in <listcomp>
    dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 298, in close_spider
    dfd = slot.close()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 44, in close
    self._maybe_fire_closing()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 51, in _maybe_fire_closing
    self.heartbeat.stop()
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\internet\task.py", line 202, in stop
    assert self.running, ("Tried to stop a LoopingCall that was "
AssertionError: Tried to stop a LoopingCall that was not running.

Process finished with exit code 0

my settings.py

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_ORDER = 'BFO'
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

Crawler stop or idle after a while

hello

i found that redis crawler will idle after a while,but there are a lot of item still in redis

strace show this:

strace -p 12879

gettimeofday({1421031574, 300673}, NULL) = 0
gettimeofday({1421031574, 301008}, NULL) = 0
epoll_wait(6, {}, 2, 31826) = 0
gettimeofday({1421031606, 142810}, NULL) = 0
gettimeofday({1421031606, 142931}, NULL) = 0
gettimeofday({1421031606, 143139}, NULL) = 0
gettimeofday({1421031606, 143261}, NULL) = 0
gettimeofday({1421031606, 143417}, NULL) = 0
epoll_wait(6, {}, 2, 356) = 0
gettimeofday({1421031606, 500132}, NULL) = 0
gettimeofday({1421031606, 500389}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1159, ...}) = 0
write(9, "2015-01-12 11:00:06+0800 [Level6"..., 107) = 107
gettimeofday({1421031606, 501373}, NULL) = 0
gettimeofday({1421031606, 501497}, NULL) = 0
gettimeofday({1421031606, 501725}, NULL) = 0

always epoll_wait

log show as:

INFO: Crawled 369 pages (at 0 pages/min), scraped 369 items (at 0 items/min)
INFO: Crawled 369 pages (at 0 pages/min), scraped 369 items (at 0 items/min)
INFO: Crawled 369 pages (at 0 pages/min), scraped 369 items (at 0 items/min)
INFO: Crawled 369 pages (at 0 pages/min), scraped 369 items (at 0 items/min)

server info:
server A: redis
server B: crawler 1
server C: crawler 2

spider is very simple code similar to example-project

Add a mechanism to FLUSH the queue when the crawl is completed

vaulstein: Can you add a mechanism to FLUSH the queue when the crawl is completed so that we can pause/resume even in distributed crawling?

Request from #41 (comment)

spider run a while can't work

hello
when i run a redis_spider a while it cant't work

the log is
2015-04-05 02:05:51+0800 [price] DEBUG: Scraped from <200 http://product.suning.com/104266764.html>
{'commentCount': 11,
'crawl_url': 'http://product.suning.com/104266764.html',
'isSuccess': True,
'itemId': 3771807,
'mallItemid': u'104266764',
'pingfen': 5.0,
'price': 0,
'shopId': '-1',
'status': 7}
2015-04-05 02:05:52+0800 [price] DEBUG: Redirecting (301) to <GET http://www.amazon.cn/%E6%89%8B%E6%9C%BA-%E9%80%9A%E8%AE%AF/dp/B00FIEWFZ6> from <GET http://www.amazon.cn/gp/product/B00FIEWFZ6>
http://www.amazon.cn/gp/product/B00FEZW85Q
2015-04-05 02:05:54+0800 [price] DEBUG: Redirecting (301) to <GET http://www.amazon.cn/%E6%89%8B%E6%9C%BA-%E9%80%9A%E8%AE%AF/dp/B00FF1JL7W> from <GET http://www.amazon.cn/gp/product/B00FEZW85Q>

#2015-04-05 02:06:30+0800 [price] INFO: Crawled 102 pages (at 26 pages/min), scraped 101 items (at 28 items/min)

2015-04-05 02:07:30+0800 [price] INFO: Crawled 102 pages (at 0 pages/min), scraped 101 items (at 0 items/min)

#2015-04-05 02:08:30+0800 [price] INFO: Crawled 102 pages (at 0 pages/min), scraped 101 items (at 0 items/min)
#2015-04-05 02:09:30+0800 [price] INFO: Crawled 102 pages (at 0 pages/min), scraped 101 items (at 0 items/min)

2015-04-05 02:10:30+0800 [price] INFO: Crawled 102 pages (at 0 pages/min), scraped 101 items (at 0 items/min)

Python3: TypeError: a bytes-like object is required, not 'str'

In the spider.py line 96:

if '://' in data:

In python 3, the data is bytes type, but '://' is a str, so here will raise error.

My solution is below, pay attention: only data is not None can it be decode()

def next_requests(self):
        """Returns a request to be scheduled or none."""
        use_set = self.settings.getbool('REDIS_START_URLS_AS_SET')
        fetch_one = self.server.spop if use_set else self.server.lpop
        # XXX: Do we need to use a timeout here?
        found = 0
        while found < self.redis_batch_size:
            data = fetch_one(self.redis_key)
            if not data:
                # Queue empty.
                break
            data = data.decode() # this line work fine in both Py2 and Py3
            req = self.make_request_from_data(data)
            if req:
                yield req
                found += 1
            else:
                self.logger.debug("Request not made from data: %r", data)

        if found:
            self.logger.debug("Read %s requests from '%s'", found, self.redis_key)

scheduler does not honor DUPEFILTER_CLASS from settings

Here's what we have in original

    dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
    dupefilter = dupefilter_cls.from_settings(settings)

in redis version the class name is hardcoded

self.df = RFPDupeFilter(self.server, self.dupefilter_key % {'spider': spider.name})

How to share cookie between different spiders when using scrapy-redis

Suppose I get session id in spider one, how can i set it for all other requests in other spiders?

how used scrapy to polling redis queue

i have a demand, user submit url into redis queue, scrapy get url from redis and fetch it, but when redis queue became empty, scrapy stopped working. i hope scrapy will be blocked, not shutdown, i tried to rewrite next_request() or has_pending_requests() function in Scheduler. but it failed.
could you help me? thank you

Tried to stop a LoopingCall that was not running.

Ref: #59

The example didn't run

hi,when I run scrapy crawl dmoz
KeyError: 'Spider not found: 'dmoz'

The example didn't run

hi,when I run scrapy crawl dmoz
KeyError: 'Spider not found: 'dmoz'

distribute crawl？

Hi, I want run this project on three machines, and share a single items queue. I don't know how to share the same redis queue?
Can you give me some suggestion？
Thank you！！！

Queue Implementation forces Breadth-First search

I have a very large crawl project, and breadth-first meant I had to wait a very long time to get my first item (they are 2 or 3 layers down from the start url).

A quick change of Queue.py line 33 from:

pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0)

to:

pipe.zrange(self.key, -1, -1).zremrangebyrank(self.key, -1, -1)

Gives the queue depth-first-like behavior.

Perhaps the addition of a setting like, REDIS_QUEUE_PRIORITIZE_DEPTH (defaulting to FALSE) that switches between the two behaviors would be helpful for others.

How can I choose link to save redis

For list , I don't think it save in redis. How can I do.
My English is so bad . sorry .

settings.get('REDIS_PORT', ... should be settings.getint('REDIS_PORT', ...

settings.get results in errors when you override the port on the command line. The command line settings override will set the value of REDIS_PORT to type str. After the scheduler launches it uses the value as a string and the redis library's connect method expects an int.

Unable to assign values to request.meta

Based on the example RedisCrawlSpider I have created an extended version, which should capture the start_url in a meta object. It looks like this doesn't work. Here is the code:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy.http import Request
from scrapy_redis.spiders import RedisCrawlSpider

deny_domains = ['facebook.com', 'google.com', 'linkedin.com', 'twitter.com', 'youtube.com']

class MyCrawler(RedisCrawlSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'mycrawler_redis'
    redis_key = 'mycrawler:start_urls'

    rules = (
        # follow all links
        Rule(LinkExtractor(deny_domains=deny_domains), callback='parse_page', follow=True),
    )

    def make_requests_from_url(self, url):
        """A method that receives a URL and returns a Request object (or a list of Request objects) to scrape.
        This method is used to construct the initial requests in the start_requests() method,
        and is typically used to convert urls to requests.
        """
        return Request(url, dont_filter=True, meta={'start_url': url})

    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))
        super(MyCrawler, self).__init__(*args, **kwargs)

    def parse_page(self, response):
        print response.meta.get('start_url')
        return {
            'url': response.url,
        }

I have checked with ipdb that the make_requests_from_url method is triggered, but when I set an ipdb trace in parse_page, I can't access the response.meta.get('start_url') key. Any help would be appreciated!

How to handle download timeout?

In some case, the download will be timeout, and I need to remove the request from duplicated filter so that it can be requested next time.

Who can give me a solution?

PY3 compatibility issue in scrapy-redis/src/scrapy_redis next_requests()

It works fine in PY2.
but as for PY3, it sucks.

2016-08-22 09:00:05 [scrapy] ERROR: Error caught on signal handler: <bound method RedisMixin.spider_idle of <RighticSpider 'rightic' at 0x7efe1de59390>>
Traceback (most recent call last):
  File "/home/eric/.virtualenvs/scrapy3/lib/python3.5/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/home/eric/.virtualenvs/scrapy3/lib/python3.5/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/eric/.virtualenvs/scrapy3/lib/python3.5/site-packages/scrapy_redis/spiders.py", line 109, in spider_idle
    self.schedule_next_requests()
  File "/home/eric/.virtualenvs/scrapy3/lib/python3.5/site-packages/scrapy_redis/spiders.py", line 103, in schedule_next_requests
    for req in self.next_requests():
  File "/home/eric/.virtualenvs/scrapy3/lib/python3.5/site-packages/scrapy_redis/spiders.py", line 84, in next_requests
    req = self.make_request_from_data(data)
  File "/home/eric/.virtualenvs/scrapy3/lib/python3.5/site-packages/scrapy_redis/spiders.py", line 96, in make_request_from_data
    if '://' in data:
TypeError: a bytes-like object is required, not 'str'
2016-08-22 09:00:05 [scrapy] INFO: Closing spider (finished)
2016-08-22 09:00:05 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 22, 1, 0, 5, 210158),
 'log_count/ERROR': 2,
 'log_count/INFO': 8,
 'start_time': datetime.datetime(2016, 8, 22, 1, 0, 4, 378904)}
2016-08-22 09:00:05 [scrapy] INFO: Spider closed (finished)

Consider to add decode() like this:

def next_requests(self):
        """Returns a request to be scheduled or none."""
        use_set = self.settings.getbool('REDIS_START_URLS_AS_SET')
        fetch_one = self.server.spop if use_set else self.server.lpop
        # XXX: Do we need to use a timeout here?
        found = 0
        while found < self.redis_batch_size:
            data = fetch_one(self.redis_key).decode()
            if not data:
                # Queue empty.
                break
            req = self.make_request_from_data(data)
            if req:
                yield req
                found += 1
            else:
                self.logger.debug("Request not made from data: %r", data)

        if found:
            self.logger.debug("Read %s requests from '%s'", found, self.redis_key)

Ensure serialized requests can be deserialized.

Related #50

抓取时出现 NotImplementedError

2016-07-18 23:25:34 [scrapy] ERROR: Spider error processing <GET https://play.google.com/store/apps> (referer: None)
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-1.1.0-py2.7.egg/scrapy/spiders/__init__.py", line 76, in parse
    raise NotImplementedError
NotImplementedErro

Does scrapy-redis bring the concurrency of scrapy down?

I notice the connection with Redis server in scrapy_redis.scheduler.Scheduler is in blocking mode. Now if it is slowed down for whatever reason, isn't the Twisted event loop handling other ongoing request blocked at the same time?

how to close the spider automatically after the last item was crawled?

after the last item was crawled,it is always keep in state:
[scrapy] INFO: Crawled 40 pages (at 0 pages/min), scraped 71 items (at 0 items/min)
[scrapy] INFO: Crawled 40 pages (at 0 pages/min), scraped 71 items (at 0 items/min)
but this is not good for me. i want it close the spider after the last item['link'] crawled. what should i do?

Command Shell Error, Read URLs from Redis

Hello,

When executing a scrapy shell into an URL it loads Redis List, as you can see in the output:
https://paste.ee/r/PPkg2

Attention to:
2016-04-26 02:14:01 [epocacosmeticos.com.br] -> DEBUG: Reading URLs from redis list 'epocacosmeticos.com.br:start_urls' prior to error.

Before install scrapy-redis, scrapy shell was working ok.

Scrapy 1.1.0rc3 and Python 3.5.1
Thank you

Crawler no long gets urls from started_urls?

I have two urls in my started_urls queue. The first one gets picked up, and the second one never gets picked up. Is this because the crawler spiders from the requests queue before started_urls?

call "from_crawler" with args and kwargs in RedisSpider and RedisCrawlSpider

For spider arguments (http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments), the code should be like this:

class RedisSpider(RedisMixin, Spider):
    """Spider that reads urls from redis queue when idle."""

    @classmethod
    def from_crawler(self, crawler, *args, **kwargs):
        obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs)
        obj.setup_redis(crawler)
        return obj


class RedisCrawlSpider(RedisMixin, CrawlSpider):
    """Spider that reads urls from redis queue when idle."""

    @classmethod
    def from_crawler(self, crawler, *args, **kwargs):
        obj = super(RedisCrawlSpider, self).from_crawler(crawler, *args, **kwargs)
        obj.setup_redis(crawler)
        return obj

TypeError when loading from Redis

Hello,

The crawler is working correctly using the command crawl with scrapy-redis but when I tried to insert a job directly to redis, using Redis Desktop Manager like: lpush mycrawler:start_urls http://www.epocacosmeticos.com.br/polo-red-body-spray-ralph-lauren-spray-corporal/p I got errors:

2016-04-26 01:04:24 [scrapy.extensions.logstats] -> INFO: Crawled 10 pages (at 10 pages/min), scraped 0 items (at 0 items/min)
2016-04-26 01:05:24 [scrapy.extensions.logstats] -> INFO: Crawled 10 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-26 01:05:59 [scrapy.utils.signal] -> ERROR: Error caught on signal handler: <bound method RedisMixin.spider_idle of <EpocaCosmeticosSpider 'epocacosmeticos.com.br' at 0x7f01cdc4c438>>
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/usr/lib/python3.5/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/lib/python3.5/site-packages/scrapy_redis/spiders.py", line 40, in spider_idle
    self.schedule_next_request()
  File "/usr/lib/python3.5/site-packages/scrapy_redis/spiders.py", line 34, in schedule_next_request
    req = self.next_request()
  File "/usr/lib/python3.5/site-packages/scrapy_redis/spiders.py", line 30, in next_request
    return self.make_requests_from_url(url)
  File "/usr/lib/python3.5/site-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 51, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got bytes:
2016-04-26 01:05:59 [scrapy.core.engine] -> INFO: Closing spider (finished)

Using Scrapy 1.1.0rc3 and Python 3.5.1

Thank you

redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value

I use scrapy-redis to implement a distributed spider application (1 master and multiple slaves). However, when a slave starts, some errors appear.
Here is the error information:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, *_opts.spargs)
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 163, in crawl
return self._crawl(crawler, *args, *_kwargs)
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 167, in _crawl
d = crawler.crawl(_args, *_kwargs)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- ---
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python3.5/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 87, in crawl
yield self.engine.close()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 100, in close
return self._close_all_spiders()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 340, in _close_all_spiders
dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 340, in
dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 298, in close_spider
dfd = slot.close()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 44, in close
self._maybe_fire_closing()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 51, in _maybe_fire_closing
self.heartbeat.stop()
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 202, in stop
assert self.running, ("Tried to stop a LoopingCall that was "
builtins.AssertionError: Tried to stop a LoopingCall that was not running.
2016-07-20 16:33:52 [twisted] CRITICAL:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 74, in crawl
yield self.engine.open_spider(self.spider, start_requests)
redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/lib/python3.5/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 87, in crawl
yield self.engine.close()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 100, in close
return self._close_all_spiders()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 340, in _close_all_spiders
dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 340, in
dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 298, in close_spider
dfd = slot.close()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 44, in close
self._maybe_fire_closing()
File "/usr/local/lib/python3.5/dist-packages/scrapy/core/engine.py", line 51, in _maybe_fire_closing
self.heartbeat.stop()
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 202, in stop
assert self.running, ("Tried to stop a LoopingCall that was "
AssertionError: Tried to stop a LoopingCall that was not running.

I dont't know how to fix it.
Please help!
Thanks in advance!

Why in the serialized item removed dict() ?

error ?

分布式爬虫爬取多个网站的新闻

你好!,怎么实现多个爬虫分布式爬取多个网站的新闻文章呢,是不是每个新闻网站的建一个爬虫呢,谢谢.

Any way to pop redis-queue only if the spider is idle?

I want to decrease redis load. If there any way to pop next request from redis-queue only if the spider is idle (not every request) and use origin queuelib.PriorityQueue everywhere else?

exceptions.AttributeError: 'Pipeline' object has no attribute 'multi'

I get an error
'''
exceptions.AttributeError: 'Pipeline' object has no attribute 'multi'
'''

and if I comment code like this :
"

pipe.multi()

pipe.exec()
"
This error will occur:
‘’‘
pipe.exec()
^
SyntaxError: invalid syntax
’‘’

if I comment code like this:
''

pipe.multi()

pipe.exec()

''
This error will occur:
'''
exceptions.TypeError: zadd() got an unexpected keyword argument '{t�'
'''

why?

when using (RedisMixin,CrawlSpider) instead of RedisSpider, the spider doesn't work

When scraping, if I use
class MyCrawler(RedisMixin, CrawlSpider) ,
and run my crawler by "scrapy crawl mycrawler", the crawler just automatically shutdown in a minite without any exceptions or errors.

But if a use
class mySpider(RedisSpider), it works well.

Is there any problem with CrawlSpider in scrapy-redis?

How to know scrapy-redis finish?

2016-09-14 13:15:23 [scrapy] INFO: Crawled 22 pages (at 0 pages/min), scraped 20 items (at 0 items/min)
2016-09-14 13:16:23 [scrapy] INFO: Crawled 22 pages (at 0 pages/min), scraped 20 items (at 0 items/min)
2016-09-14 13:17:23 [scrapy] INFO: Crawled 22 pages (at 0 pages/min), scraped 20 items (at 0 items/min)

The output means spider finishes crawling, but the spider is still running. How to make sure the spider quit after it finishes crawling?

Spider doesn't fetch more urls from redis

Hi, I'm trying to use the RedisMixin in order to crawl some 250.000 urls from multiple machines. It's working fine, although I have an issue whereby some spiders just stop fetching more urls from the redis queue even though there are more to fetch. For example, at some point the spider just idles and says:

[scrapy] INFO: Crawled 74 pages (at 0 pages/min), scraped 71 items (at 0 items/min)
[scrapy] INFO: Crawled 74 pages (at 0 pages/min), scraped 71 items (at 0 items/min)
[scrapy] INFO: Crawled 74 pages (at 0 pages/min), scraped 71 items (at 0 items/min)
...

Any idea what might cause this?

ValueError: Function <function robo1Spider.parse.<locals>.<lambda> at 0x7f0ccc7250d0> is not a method of: <robo1Spider 'proj_20_f47c8458-705e-11e6-b152-0242ac110009' at 0x7f0d000acbe0>

Hi,

after enabeling scrapy-redis i got the following error:

`2016-09-01 16:13:04 [scrapy] ERROR: Spider error processing <GET https://www.domain.com/> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
Exception ignored in: <generator object iter_errback at 0x7f0cf01488b8>
RuntimeError: generator ignored GeneratorExit
Unhandled error in Deferred:
2016-09-01 16:13:04 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/twisted/internet/base.py", line 1195, in run
self.mainLoop()
File "/usr/local/lib/python3.4/site-packages/twisted/internet/base.py", line 1204, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python3.4/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(_call.args, *_call.kw)
File "/usr/local/lib/python3.4/site-packages/twisted/internet/task.py", line 671, in _tick
taskObj._oneWorkUnit()
--- ---
File "/usr/local/lib/python3.4/site-packages/twisted/internet/task.py", line 517, in _oneWorkUnit
result = next(self._iterator)
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/defer.py", line 63, in
work = (callable(elem, _args, *_named) for elem in iterable)
File "/usr/local/lib/python3.4/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File "/usr/local/lib/python3.4/site-packages/scrapy/core/engine.py", line 209, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.4/site-packages/scrapy/core/engine.py", line 215, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/local/lib/python3.4/site-packages/scrapy_redis/scheduler.py", line 146, in enqueue_request
self.queue.push(request)
File "/usr/local/lib/python3.4/site-packages/scrapy_redis/queue.py", line 93, in push
data = self._encode_request(request)
File "/usr/local/lib/python3.4/site-packages/scrapy_redis/queue.py", line 36, in _encode_request
obj = request_to_dict(request, self.spider)
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/reqser.py", line 21, in request_to_dict
eb = _find_method(spider, eb)
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/reqser.py", line 73, in _find_method
raise ValueError("Function %s is not a method of: %s" % (func, obj))
builtins.ValueError: Function <function robo1Spider.parse.. at 0x7f0ccc7250d0> is not a method of: <robo1Spider 'proj_20_f47c8458-705e-11e6-b152-0242ac110009' at 0x7f0d000acbe0>
2016-09-01 16:13:04 [twisted] CRITICAL:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/twisted/internet/task.py", line 517, in _oneWorkUnit
result = next(self._iterator)
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/defer.py", line 63, in
work = (callable(elem, _args, *_named) for elem in iterable)
File "/usr/local/lib/python3.4/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File "/usr/local/lib/python3.4/site-packages/scrapy/core/engine.py", line 209, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.4/site-packages/scrapy/core/engine.py", line 215, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/local/lib/python3.4/site-packages/scrapy_redis/scheduler.py", line 146, in enqueue_request
self.queue.push(request)
File "/usr/local/lib/python3.4/site-packages/scrapy_redis/queue.py", line 93, in push
data = self._encode_request(request)
File "/usr/local/lib/python3.4/site-packages/scrapy_redis/queue.py", line 36, in _encode_request
obj = request_to_dict(request, self.spider)
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/reqser.py", line 21, in request_to_dict
eb = _find_method(spider, eb)
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/reqser.py", line 73, in _find_method
raise ValueError("Function %s is not a method of: %s" % (func, obj))
ValueError: Function <function robo1Spider.parse.. at 0x7f0ccc7250d0> is not a method of: <robo1Spider 'proj_20_f47c8458-705e-11e6-b152-0242ac110009' at 0x7f0d000acbe0>
2016-09-01 16:13:04 [scrapy] INFO: Closing spider (finished)
2016-09-01 16:13:05 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1072,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,`

Python: 3.4.5
Scrapy: 1.1.2

Any hints what i can check to figure out whats wrong? Thank!

Add support for python 3.5 (>=3.4?)

CRITICAL error when running the examples

I installed scrapy-redis using pip and cloned this project to run the examples. But got the error.

Here are the messages:

2016-05-18 16:29:03 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'example.spiders', 'SPIDER_MODULES': ['example.spiders'], 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler'}
2016-05-18 16:29:03 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-05-18 16:29:03 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-05-18 16:29:03 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-05-18 16:29:03 [scrapy] INFO: Enabled item pipelines:
['example.pipelines.ExamplePipeline', 'scrapy_redis.pipelines.RedisPipeline']
2016-05-18 16:29:03 [scrapy] INFO: Spider opened
2016-05-18 16:29:03 [scrapy] INFO: Closing spider (shutdown)
Unhandled error in Deferred:
2016-05-18 16:29:03 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 163, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 167, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/opt/anaconda2/lib/python2.7/site-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/opt/anaconda2/lib/python2.7/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/opt/anaconda2/lib/python2.7/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 87, in crawl
    yield self.engine.close()
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/core/engine.py", line 100, in close
    return self._close_all_spiders()
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/core/engine.py", line 340, in _close_all_spiders
    dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/core/engine.py", line 298, in close_spider
    dfd = slot.close()
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/core/engine.py", line 44, in close
    self._maybe_fire_closing()
  File "/opt/anaconda2/lib/python2.7/site-packages/scrapy/core/engine.py", line 51, in _maybe_fire_closing
    self.heartbeat.stop()
  File "/opt/anaconda2/lib/python2.7/site-packages/twisted/internet/task.py", line 202, in stop
    assert self.running, ("Tried to stop a LoopingCall that was "
exceptions.AssertionError: Tried to stop a LoopingCall that was not running.

I am using scrapy 1.1.0.

Could anyone help?

Thanks,
Jay

When will `set_crawler ` be called?

I've read your definition of RedisSpider in file spiders.py, but I can't figure out when will the method set_crawler be called, either what type of the param crawler is.
It seems that without this initialization, the spider couldn't fetch url from redis.

cant get the 'start_urls' key from a given spidername

IE. if my spider name is 'ralphlauren' ,the spider will open ,but soon closed.
the screen dont show 'DEBUG: Reading URLs from redis list'

but if I change the name to any other different,it will work .

this really confuse me. I dont know why it happen

hoping for your help,many thanks

builtins.TypeError: zadd() keywords must be strings

Unhandled Error
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/usr/local/lib/python3.5/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1194, in run
self.mainLoop()
File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1203, in mainLoop
self.runUntilCurrent()
--- ---
File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(_call.args, *_call.kw)
File "/usr/local/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in call
return self._func(_self._a, *_self._kw)
File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 134, in _next_request
self.crawl(request, spider)
File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 209, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 215, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/local/lib/python3.5/site-packages/scrapy_redis/scheduler.py", line 78, in enqueue_request
self.queue.push(request)
File "/usr/local/lib/python3.5/site-packages/scrapy_redis/queue.py", line 83, in push
self.server.zadd(self.key, **pairs)
builtins.TypeError: zadd() keywords must be strings

2016-06-08 12:32:25 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/usr/local/lib/python3.5/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1194, in run
self.mainLoop()
File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1203, in mainLoop
self.runUntilCurrent()
--- ---
File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(_call.args, *_call.kw)
File "/usr/local/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in call
return self._func(_self._a, *_self._kw)
File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 134, in _next_request
self.crawl(request, spider)
File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 209, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.5/site-packages/scrapy/core/engine.py", line 215, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/local/lib/python3.5/site-packages/scrapy_redis/scheduler.py", line 78, in enqueue_request
self.queue.push(request)
File "/usr/local/lib/python3.5/site-packages/scrapy_redis/queue.py", line 83, in push
self.server.zadd(self.key, **pairs)
builtins.TypeError: zadd() keywords must be strings

2016-06-08 12:32:30 [scrapy] INFO: Closing spider (finished)
2016-06-08 12:32:30 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 8, 4, 32, 30, 385579),
'log_count/CRITICAL': 1,
'log_count/INFO': 7,
'scheduler/enqueued/redis': 1,
'start_time': datetime.datetime(2016, 6, 8, 4, 32, 25, 376144)}
2016-06-08 12:32:30 [scrapy] INFO: Spider closed (finished)

Warn if dupefilter is not scrapy-redis compatible

Ref: #59

Add example integration with pyrebloom

From #37 (comment)

Add a redis backed StatsCollector

There seems to be no existing redis backed StatsCollector which shares stats between all spiders.

I was messing around with the following:

from scrapy.statscollectors import StatsCollector
from scrapy_redis.connection import from_settings as redis_from_settings

class RedisStatsCollector(StatsCollector):
    def __init__(self, crawler, spider=None):
        self.redis = redis_from_settings(crawler.settings)
        self.spider = spider

    @classmethod
    def from_spider(cls, spider):
        return cls(spider.crawler, spider)

    def _get_key(self, key, spider=None):
        if spider is None:
            name = '<scrapy>'
        elif self.spider is not None:
            name = self.spider.name
        else:
            name = spider.name

        return '%s:stats:%s' % (name, key)

    def get_value(self, key, default=None, spider=None):
        key = self._get_key(key, spider)
        value = self.redis.get(key)
        if value is None:
            return default
        else:
            return value

    def get_stats(self, spider=None):
        keys = self.redis.keys(self._get_key('*', spider))
        return {k: v for (k, v) in self.redis.mget(*keys)}

    def set_value(self, key, value, spider=None):
        key = self._get_key(key, spider)
        self.redis.set(key, value)

    def inc_value(self, key, count=1, start=0, spider=None):
        pipe = self.redis.pipeline()
        key = self._get_key(key, spider)
        pipe.setnx(key, start)
        pipe.incrby(key, count)
        pipe.execute()

    def max_value(self, key, value, spider=None):
        key = self._get_key(key, spider)
        pipe = self.redis.pipeline()
        pipe.zadd(key, value, value)
        pipe.zremrangebyrank(key, 0, -2)
        pipe.execute()

    def min_value(self, key, value, spider=None):
        key = self._get_key(key, spider)
        pipe = self.redis.pipeline()
        pipe.zadd(key, value, value)
        pipe.zremrangebyrank(key, 2, -1)
        pipe.execute()

    def clear_stats(self, spider=None):
        keys = self.redis.keys(self._get_key('*', spider))
        self.redis.delete(*keys)

    def open_spider(self, spider):
        self.spider = spider

    def close_spider(self, spider, reason=None):
        self.spider = None

If there is any interest in this I can submit a PR.

Slow crawling speeds with allowed_domains

I have a crawler setup with Scrapy version 0.24.2 and the latest version of scrapy-redis. I have seen a drastic drop in performance when I add in a list of allowed_domains. If I delete the allowed_domains list, my crawler goes from 1-3 pages/min up to 200-300 pages/min. I feel this has to do with scrapy-redis causing this performance issues. Have you ever encountered this issue? Is there anything that could be causing this issue?

Settings: no auto throttle , no download limit, SCHEDULER_PERSIST = False, tried not passing a list of start_urls also.
I am also getting very low cpu usage about 0.5%, if I remove the allowed_domains it goes back up to about 10% (single thread)

Also I am seeing at the beginning of every crawl:
/usr/local/lib/python2.7/dist-packages/scrapy_redis/spiders.py:40: ScrapyDeprecationWarning: scrapy_redis.spiders.RedisSpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class RedisSpider(RedisMixin, BaseSpider):

Thanks,
Doginal

some error in scrapy redis

hi,when i use scrapy redis ,i found this problem:
i write two spider,the first spider parse list urls and then obtain detail url then push detail url to redis schedule,
when i run the second spider ,
i want the second spider call some function,like def start_requets(self): (this method is login method,and get cookies,after get cookies can visit detail url),this spider didn't call def start_request(self): function,this spider get detail url from redis schedule,and then this spider request detail url,and response code 302(because the second spider not cookies).

i want to call some function before my spider get url from redis schedule , after spider run these function ,this spider get url from redis schedule,then request these url.

def start_requests(self):
url = 'http://XXXXX'

    yield FormRequest(url=url,formdata=data, callback=self.get_checkcode,dont_filter=True)

hope ,some one give me advice.

thank you .

scheduler bug?

The scheduler doesn't seem to respect "allowed_domains" and the settings like "DOWNLOAD_DELAY". Everything works fine when you disable scrapy-redis scheduler in settings. Tried looking at the code but can't find what's going wrong.

Scrapy 1.0 version throwing Warnings

With the new version of Scrapy, following warning is thrown.

/Library/Python/2.7/site-packages/scrapy_redis/dupefilter.py:4: ScrapyDeprecationWarning: Module scrapy.dupefilter is deprecated, use scrapy.dupefilters instead
from scrapy.dupefilter import BaseDupeFilter

Please modify the code and push it to pypi repository.

Why is the data obtained from redis queue data is processed after cPikle?

Hello, I need to read more than one slave machine from redis url queue master of the machine and crawling task. But url I read the slave machine is the result of data cPike processed, resulting in the http request unrecognized url. I like how to get treatment from a normal url redis queue. Test code myspider_redis.py, do not know if you have any good suggestions?

rmax / scrapy-redis Goto Github PK

scrapy-redis's Introduction

Scrapy-Redis

Features

Requirements

Installation

Alternative Choice

scrapy-redis's People

Contributors

Stargazers

Watchers

Forkers

scrapy-redis's Issues

pipe.multi()

pipe.multi()

pipe.exec()

Recommend Projects

Recommend Topics

Recommend Org