scrapinghub / scrapyrt Goto Github PK

View Code? Open in Web Editor NEW

816.0 43.0 159.0 293 KB

HTTP API for Scrapy spiders

License: BSD 3-Clause "New" or "Revised" License

Python 99.09% HTML 0.44% Dockerfile 0.47%

python crawling crawler scrapy scraper twisted webcrawling webcrawler hacktoberfest hacktoberfest2021

scrapyrt's Introduction

ScrapyRT (Scrapy realtime)

Add HTTP API for your Scrapy project in minutes.

You send a request to ScrapyRT with spider name and URL, and in response, you get items collected by a spider visiting this URL.

All Scrapy project components (e.g. middleware, pipelines, extensions) are supported
You run Scrapyrt in Scrapy project directory. It starts HTTP server allowing you to schedule spiders and get spider output in JSON.

Quickstart

1. install

> pip install scrapyrt

2. switch to Scrapy project (e.g. quotesbot project)

> cd my/project_path/is/quotesbot

3. launch ScrapyRT

> scrapyrt

4. run your spiders

> curl "localhost:9080/crawl.json?spider_name=toscrape-css&url=http://quotes.toscrape.com/"

5. run more complex query, e.g. specify callback for Scrapy request and zipcode argument for spider

>  curl --data '{"request": {"url": "http://quotes.toscrape.com/page/2/", "callback":"some_callback"}, "spider_name": "toscrape-css", "crawl_args": {"zipcode":"14000"}}' http://localhost:9080/crawl.json -v

Scrapyrt will look for scrapy.cfg file to determine your project settings, and will raise error if it won't find one. Note that you need to have all your project requirements installed.

Note

Project is not a replacement for Scrapyd or Scrapy Cloud or other infrastructure to run long running crawls
Not suitable for long running spiders, good for spiders that will fetch one response from some website and return items quickly

Documentation

Documentation is available on readthedocs.

Support

Open source support is provided here in Github. Please create a question issue (ie. issue with "question" label).

Commercial support is also available by Zyte.

License

ScrapyRT is offered under BSD 3-Clause license.

Development

Development taking place on Github.

scrapyrt's People

Contributors

Stargazers

Watchers

Forkers

jsargiot cgc1983 changguanghua stevieramsay gt11799 xiliangsong pawelmhm getwingm imclab hilllinux mattisbusycom lonely7345 wangst321 agz1990 sunchen009 dharmeshpandav sjhewitt tianqj hanbingtel mtaziz nw4869 guoronghua knusul nkhuyu starrify smileyjames lookfwd priya91mobiles awesome-mian4 ghostbod aorzh dongnikeji hitesh-cd yujun1993 bclowcode zolrath 3rawkz hosnid lboynton shuiiiiiimu ruilvcom gdelfresno wenbinzhang fluke sharkfinliu anikjp aleroot nihao2984 pianista215 optionalg dmeyers83 number0 sarlc granitosaurus hanfeijp hubitor hybridious huichen90 hau qingshiyimeng aftership moolighty shadiakiki1986 widy28 vishalbelsare sumit366 sgnls nguyen-cao paulozullu mfrigillana rodrigogonzalez lalit-sh bryanhan1001 naren4das jurya expired-brain prohle timbloeme mrminhnguyen ajoyoommen allen-oneill pombredanne digenis myhololens wukuan405 shlomizadok quocnguyenh debasissaha1982 eglet27 nacholore pedrosalgueiro tapanhp1995 imbillu ninjayoto corentinlimier masterscott mikhail010 datafields-team gokhu18 wall-eeeeeee

scrapyrt's Issues

Scrapyrt with scrapy-splash - NotImplementedError\nexceptions.NotImplementedError

Hello,

I have a spider working totally fine with scrapy splash


class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, url=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.rotate_user_agent = True
        self.start_urls = ['%s' % url]

    def start_requests(self):
        splash_args = {
            # args
        }
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_item,
                args=splash_args,
                endpoint='render.html'
            )

    def parse_item(self, response):
        # Some xpath processing
        print json.dumps(results)

When trying to scrape with scrapyrt, I get

Traceback (most recent call last):
File \"/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py\", line 1203, in mainLoop
  self.runUntilCurrent()
File \"/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py\", line 825, in runUntilCurrent
  call.func(*call.args, **call.kw)
File \"/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py\", line 393, in callback
  self._startRunCallbacks(result)
File \"/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py\", line 501, in _startRunCallbacks
  self._runCallbacks()\n--- <exception caught here> ---
File \"/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py\", line 588, in _runCallbacks
  current.result = callback(current.result, *args, **kw)
File \"/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py\", line 76, in parse
  raise NotImplementedError\nexceptions.NotImplementedError:

Is it possible to use splash ?

Thanks for the great tools !

Update

Changing parse_item to parse function name get rid of the error. But while I get results with the command: scrapy crawl myspider -a url='http://<any_url>'

I get nothing either with POST or GET requests, items is always empty.

{"status": "ok", "items": [], "spider_name": "myspider", "stats": {"downloader/request_bytes": 338, "downloader/request_count": 1, "downloader/request_method_count/GET": 1, "downloader/response_bytes": 14966, "downloader/response_count": 1, "downloader/response_status_count/200": 1, "finish_reason": "finished", "finish_time": "2016-07-23 01:30:59", "log_count/DEBUG": 1, "log_count/INFO": 8, "response_received_count": 1, "scheduler/dequeued": 1, "scheduler/dequeued/memory": 1, "scheduler/enqueued": 1, "scheduler/enqueued/memory": 1, "start_time": "2016-07-23 01:30:58"}, "items_dropped": []}

tests.test_crawl_manager.TestLimitRuntime failing

py.test tests

fails with:


self = <tests.test_crawl_manager.TestLimitRuntime testMethod=test_limit_runtime>, condition = True, msg = None

    def assertFalse(self, condition, msg=None):
        """
            Fail the test if C{condition} evaluates to True.

            @param condition: any object that defines __nonzero__
            """
        if condition:
>           raise self.failureException(msg)
E           FailTest: None

../local/lib/python2.7/site-packages/twisted/trial/_synctest.py:296: FailTest
_____________________________________________________________________________ TestLimitRuntime.test_string_number_timeout_value _____________________________________________________________________________

self = <tests.test_crawl_manager.TestLimitRuntime testMethod=test_string_number_timeout_value>

    def test_string_number_timeout_value(self):
        _timeout = settings.TIMEOUT_LIMIT
        try:
            settings.TIMEOUT_LIMIT = '1'
            self.crawl_manager = self._create_crawl_manager()
>           self._test_limit_runtime()

tests/test_crawl_manager.py:150: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_crawl_manager.py:137: in _test_limit_runtime
    self.assertFalse(self.crawler.engine.close_spider.called)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.test_crawl_manager.TestLimitRuntime testMethod=test_string_number_timeout_value>, condition = True, msg = None

    def assertFalse(self, condition, msg=None):
        """
            Fail the test if C{condition} evaluates to True.

            @param condition: any object that defines __nonzero__
            """
        if condition:
>           raise self.failureException(msg)
E           FailTest: None

this line is https://github.com/scrapinghub/scrapyrt/blob/master/tests/test_crawl_manager.py#L137 is causing this, I see that this should probably be assertTrue not assertFalse.

Unhandled Error

`
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\web\http.py", line 1906, in allContentReceived
req.requestReceived(command, path, version)
File "C:\Python27\lib\site-packages\twisted\web\http.py", line 771, in requestReceived
self.process()
File "C:\Python27\lib\site-packages\twisted\web\server.py", line 190, in process
self.render(resrc)
File "C:\Python27\lib\site-packages\twisted\web\server.py", line 241, in render
body = resrc.render(self)
--- ---
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 24, in render
result = resource.Resource.render(self, request)
File "C:\Python27\lib\site-packages\twisted\web\resource.py", line 250, in render
return m(request)
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 161, in render_POST
return self.prepare_crawl(request_data, spider_data, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 201, in prepare_crawl
spider_name, spider_data, max_requests, *args, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 210, in run_crawl
dfd = manager.crawl(*args, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\core.py", line 156, in crawl
dfd = self.crawler_process.crawl(self.spider_name, *args, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\core.py", line 80, in crawl
cleanup_handler = setup_spider_logging(crawler.spider, self.settings)
File "C:\Python27\lib\site-packages\scrapyrt\log.py", line 141, in setup_spider_logging
handler = logging.FileHandler(filename, encoding=encoding)
File "C:\Python27\lib\logging_init.py", line 913, in init
StreamHandler.init(self, self.open())
File "C:\Python27\lib\logging_init.py", line 945, in _open
stream = codecs.open(self.baseFilename, self.mode, self.encoding)
File "C:\Python27\lib\codecs.py", line 896, in open
file = builtin.open(filename, mode, buffering)
exceptions.IOError: [Errno 22] invalid mode ('ab') or filename: u'E:\Py workspace\amazon_crawler\logs\asinspider\2017-03-21T17:25:32.311000.log'

How to run as daemon?

Hi,
How can I run scrapyrt as daemon?
Thanks,

Impossible to shutdown process using Ctrl+C after crawl

If ScrapyRT server was started and at least one crawl request was served - the only way to shutdown process is to press Ctrl+C twice - for some reason single SIGINT doesn't work when at least one crawl was performed.

How to run scrapyrt on Apache instead of SimpleHTTPServer or http.server

Port binding error on heroku boot

Hi,

I'm trying to release a scrapyrt server on heroku. I'm almost there, but I still have an issue on bootup : heroku doesn't detect the port as properly 'bound' to ..

my Procfile looks like this :
web: scrapyrt -p $PORT

and the logs on heroku bootup look like this :

2016-01-24T15:52:37.903625+00:00 heroku[web.1]: Starting process with command `scrapyrt -p 21555`
2016-01-24T15:52:40.549269+00:00 app[web.1]: 2016-01-24 15:52:40+0000 [-] Log opened.
2016-01-24T15:52:40.605826+00:00 app[web.1]: 2016-01-24 15:52:40+0000 [-] Site starting on 21555
2016-01-24T15:52:40.606006+00:00 app[web.1]: 2016-01-24 15:52:40+0000 [-] Starting factory <twisted.web.server.Site instance at 0x7f26e9e8b950>
2016-01-24T15:53:38.378592+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch

I've checked and everything seems to be running fine, when running it from a heroku run bash one off instance, and starting a scrapyrt process in the background, I can query it via curl localhost:8090/crawl.json without a problem.
I also checked with a simple Twisted server by replacing my Procfile with twistd --nodaemon web -p $PORT and it worked fine.

I'm not a big Twisted expert nor HTTP/TCP protocols, but I guess something is not done like a classic web server. I've seen that you use a TCPServer class, maybe there's something I'm missing ? There's a heroku doc page for a plugin that allows routing TCP sockets to app's dynos, but I don't think this applies. Also, I couldn't find any precise info on how heroku checks for port binding ..

Maybe you can point me in the right direction if you have some insight and save me some debugging time :)
I'd also be happy to write some doc on how to deploy to heroku afterwards, as I already ran into a few more caveats.

I ❤️ scrapy !

Parallel crawling of different URLS doesn't work

Hi,

I'm calling the scrapyRT interface with different urls to be crawled "in parallel". The result is always the same. Only one returns the others never return.

http://myserver/crawl.json?spider_name=mySpider&url=TheURLToCrawl
The settings are more or less default.

If I crawl each url after the other all respond correctly.

Is this a known issue? How would parallel crawling work with scrapyRT?

Thanks in advance!

Scrapyrt doesn't complete job

I have a spider that yields new requests depending on how many results are displayed on my target site. My issue is scrapyrt stops after 7 requests, but running my spider through scrapy crawl or python "Script name" gives me the expected results. Anyone experience anything like this ?

twisted headers error

scrapyrt invoke scrapy crawler and crawl normally but no response is returned after upating twisted to 19.2.0

crossbario/crossbar#1557

Traceback (most recent call last):
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1475, in gotResult
_inlineCallbacks(r, g, status)
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1421, in _inlineCallbacks
status.deferred.callback(getattr(e, "value", None))
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 460, in callback
self._startRunCallbacks(result)
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
self._runCallbacks()
--- ---
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/scrapyrt/resources.py", line 37, in finish_request
request.write(self.render_object(obj, request))
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/scrapyrt/resources.py", line 95, in render_object
request.setHeader('Content-Length', len(r))
File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/web/http.py", line 1271, in setHeader

  File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in setRawHeaders
    else:
  File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 220, in <listcomp>
    else:
  File "/home/crawler/easyitem_crawler_rest/venv/lib/python3.6/site-packages/twisted/web/http_headers.py", line 40, in _sanitizeLinearWhitespace
    above, using just L{bytes} arguments to the methods of this class will
builtins.AttributeError: 'int' object has no attribute 'splitlines'

ubuntu(14.04) server

How to run scrapyrt on ubuntu server？

GET api of ScrapyRT not working - 'Spider not found: XYZ'

I have a scrapy project which works fine if run using standalone python program. However, problem begins when I try using it from Flask, I could no longer invoke the spiders. So, on searching on Internet, I found the wrapper ScrapyRT, which exposes http interface for scrapy.
I installed ScrapyRT and the app is now running on port 6080, but when I make a GET call to Crawl api, I am returned following error on UI:

{"status": "error", "message": "'Spider not found: TestSpider'", "code": 404}
URL am trying: http://localhost:9080/crawl.json?spider_name=TestSpider&url=https://www.cnn.com
What am i doing wrong here?

How to connect to scrapyrt from outside?

I deployed my code to digitalocean server by cant run crawl via GET request from outside. From server localy, works fine

Overriding setting with --set from command line doesn't work

In the case we need to override a project setting with:

-s SETTING_NAME=value

from the command line scrapyrt does't really override the settings but the setting from the project file(settings.py) is used .

dockerfile python3 support

Hello. Awesome app! 🥇 The dockerfile still uses python2 though. I'd like to submit a PR for a new file for python3 support. You can maintain support for both with separate tags, e.g. scrapinghub/scrapyrt:python2 and scrapinghub/scrapyrt:python3. Thoughts?

errback assignment from JSON POST is not working

When CrawlManager is initialized, the code saves callback_name for later since it doesn't know if the spider has that method. This is not the case for errback since it is passed directly to create_spider_request where a new request is being generated. What happens is that Request would complain errback (a string now) is not callable since the spider object has been loaded yet. Is it possible to treat errback the same way it does for callback?
Please let me know if I'm missing anything in my reading.

Here is a copy of the code I'm looking at:

class CrawlManager(object):
    """
    Runs crawls
    """

    def __init__(self, spider_name, request_kwargs, max_requests=None, start_requests=False):
        self.spider_name = spider_name
        self.log_dir = settings.LOG_DIR
        self.items = []
        self.items_dropped = []
        self.errors = []
        self.max_requests = int(max_requests) if max_requests else None
        self.timeout_limit = int(settings.TIMEOUT_LIMIT)
        self.request_count = 0
        self.debug = settings.DEBUG
        self.crawler_process = None
        self.crawler = None
        # callback will be added after instantiation of crawler object
        # because we need to know if spider has method available
        self.callback_name = request_kwargs.pop('callback', None) or 'parse'
        if request_kwargs.get("url"):
            self.request = self.create_spider_request(deepcopy(request_kwargs))
        else:
            self.request = None
        self.start_requests = start_requests
        self._request_scheduled = False

Running out of file descriptors due to per-spider log files

Looking at lsof output - there's only 2-3 tcp connections, 99% of the open files are for the per-spider logs.

When this happens scrapyrt stops responding to requests.

It happens about once a day under lowish load (only a couple requests a minute).

How to access to meta values?

Hi,

I'm doing this request:
curl -XPOST -d '{ "spider_name":"quotes", "start_requests":true, "request":{ "meta": {
"test": "1", } } }' "http://138.219.228.215:9080/crawl.json"

Then I try to access from my spider by print(response.meta) and this is what it shows:

{'depth': 0, 'download_latency': 0.03323054313659668, 'download_slot': 'URL', 'download_timeout': 180.0}

of course response.meta["test"] throws error.

I need to use this "test" parameter to fill the form request

EDIT: spider : https://pastebin.com/EFz818qL

thanks!

Authentication mechanism on the REST API of scrapyrt

Basically I want to prevent unauthorized clients from accessing the scrapyrt API.
I would want to secure a scrapyrt API, is there anything built in handling an authorization mechanism ?

What kind of approach do you suggest ?

In addition I would like to understand if there is some mechanism to limit the number of maximum request per single client.

https seems not supported?

Is there any support for https link crawling with scrapyrt?
With Scrapy command, data can be exported easily but Scrapyrt api not returning data for https link.

regards
Vinit

URL required? why not use start_urls?

Is it possible to use start_urls from the spider instead requiring URL?

builtins.ModuleNotFoundError: No module named 'webscrape' when running spider

Full scrapyrt error

2019-08-12 16:37:47-0700 [scrapyrt] Unhandled Error
        Traceback (most recent call last):
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\twisted\web\http.py", line 2196, in allContentReceived
            req.requestReceived(command, path, version)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\twisted\web\http.py", line 920, in requestReceived
            self.process()
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\twisted\web\server.py", line 199, in process
            self.render(resrc)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\twisted\web\server.py", line 259, in render
            body = resrc.render(self)
        --- <exception caught here> ---
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapyrt\resources.py", line 26, in render
            result = resource.Resource.render(self, request)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\twisted\web\resource.py", line 250, in render
            return m(request)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapyrt\resources.py", line 127, in render_GET
            return self.prepare_crawl(api_params, scrapy_request_args, **kwargs)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapyrt\resources.py", line 217, in prepare_crawl
            start_requests=start_requests, *args, **kwargs)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapyrt\resources.py", line 226, in run_crawl
            dfd = manager.crawl(*args, **kwargs)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapyrt\core.py", line 157, in crawl
            self.get_project_settings(), self)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapyrt\core.py", line 178, in get_project_settings
            return get_project_settings(custom_settings=custom_settings)
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapyrt\conf\spider_settings.py", line 27, in get_project_settings
            crawler_settings.setmodule(module, priority='project')
          File "c:\users\user\microblog\job-visualizer\venv\lib\site-packages\scrapy\settings\__init__.py", line 288, in setmodule
            module = import_module(module)
          File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\importlib\__init__.py", line 127, in import_module
            return _bootstrap._gcd_import(name[level:], package, level)
          File "<frozen importlib._bootstrap>", line 1006, in _gcd_import

          File "<frozen importlib._bootstrap>", line 983, in _find_and_load

          File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked

          File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

          File "<frozen importlib._bootstrap>", line 1006, in _gcd_import

          File "<frozen importlib._bootstrap>", line 983, in _find_and_load

          File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked

        builtins.ModuleNotFoundError: No module named 'webscrape'

Spider data response
{'status': 'error', 'message': "No module named 'webscrape'", 'code': 500}

When attempting to run my spider in my web app I get the following issue. I'm not sure what it means by no module named 'webscrape' as I'm not importing webscrape anywhere. I've removed imports that contain relevance to webscrape and relative imports that weren't needed and I still get this error. Can anyone please help?

Default Log Format Incompatible With Windows

The default log file of datetime.datetime.now().isoformat() + '.log' does not work on Windows as : is an unsupported character in filenames. It'd be nice to replace them with another character!

Using custom modules with Docker

I'm using this middleware in my scrapy project: https://github.com/alecxe/scrapy-fake-useragent It works fine when I run scrapy from my project directly but when I'm using it through Docker. I get this error:

{"status": "error", "message": "No module named scrapy_fake_useragent.middleware", "code": 500}

I'm fairly new to Python so I would appreciate help on how to get it to work. Do I have to install it on the Docker container as well somehow or add a requirements.txt right now I've just pip installed the scrapy-fake-useragent package to the system and it works fine without Docker.

POST endpoint doesn't return 404 if spider is missing

Try doing following in dirbot project with scrapyrt running on localhost:9080

> curl -X POST -d '{"spider_name":"amazing", "request": {"url":"http://www.tesco.com/groceries/product/details/?id=258035097"}}' "http://localhost:9080/crawl.json" -v

that's curl with some url and spider_name. Spider does not exist in project. API should return 404 and this is what GET handler does. But POST doesn't return anything just hangs.

document deploy in wiki

#25 will document heroku, we should document all options so that users wont think Heroku is only (or preferred method)

Issue with CORS when posting

I can't get crawled data when sending post requests through axios in a VueJS app. The following errors are being returned:

Failed to load resource: the server responded with a status of 405 (Method Not Allowed)
Failed to load http://localhost:9080/crawl.json?: Response for preflight does not have HTTP ok status.

From what I've figured out so far scrapyrt doesn't allow OPTIONS (sent by the browser on pre-flight query). How can I solve this issue?

Note: my spider works fine from my app with GET and also with POST from terminal using curl

Cannot override log related spider settings

AFAICT it's not possible to override LOG_LEVEL, LOG_FILE, LOG_DIR, etc for spiders because the dict from get_scrapyrt_settings is applied with priority 'cmdline'.

I assume this is due to conflicting goals:

Have scrapyrt be a "drop in" runner with no config changes required
Have sane logging in the presence of multiple crawls

My take is

The dict should have priority 'default' (since they really are defaults - the spider developer might want to customize them)
scrapyrt should use a scrapyrt.cfg file rather than scrapy.cfg

scrapy.cfg is typically small enough that requiring the user to either copy it or use a template from the documentation wouldn't be a significant burden.

from scrapy import signals, log as scrapy_log builtins.ImportError: cannot import name 'log'

I installed scrapyrt today, but calling a GET request results in an error, which was reported at scrapy/scrapyd#311

I followed the issue above, and downgraded Twisted with pip3 install Twisted==18.9.0.

After that, however, scrapyrt is producing a different error, saying...

2019-07-28 16:53:22+0200 [scrapyrt] Unhandled Error
	Traceback (most recent call last):

       ...

	File "/home/user/app/scrapy-test/my_venv/lib64/python3.6/site-packages/scrapyrt/core.py", line 9, in <module>
    	from scrapy import signals, log as scrapy_log
	builtins.ImportError: cannot import name 'log'

Could I ask for help to fix it?

It's running on Python 3.6.3
Scrapy is 1.7.1

twisted.python.failure.NoCurrentExceptionError:

File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, _args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/scrapyrt-0.9-py2.7.egg/scrapyrt/resources.py", line 33, in handle_errors
return self.handle_render_errors(request, failure.value)
File "/usr/local/lib/python2.7/dist-packages/scrapyrt-0.9-py2.7.egg/scrapyrt/resources.py", line 56, in handle_render_errors
log.err()
File "/usr/local/lib/python2.7/dist-packages/scrapyrt-0.9-py2.7.egg/scrapyrt/log.py", line 32, in err
log.err(_stuff, _why, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/log.py", line 117, in err
_stuff = failure.Failure()
File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 214, in init
raise NoCurrentExceptionError()
twisted.python.failure.NoCurrentExceptionError:

why?

Python 3.x Support and Other Issues

Looking for Python 3.x Support

from ConfigParser import SafeConfigParser, NoOptionError, NoSectionError
ImportError: No module named 'ConfigParser'

if isinstance(module, basestring):
NameError: name 'basestring' is not defined

for route, resource_path in settings.RESOURCES.iteritems():
AttributeError: 'dict' object has no attribute 'iteritems'

Should URL be a required argument?

I realize the project is pretty young so things will change, but should URLs be required? In my particular use case, we have static URLs for each spider that are getting built based on a keyword or id passed into them through spider arguments, and the URL doesn't change that much. Just a thought.
Thanks

exceptions.IOError: [Errno 22] invalid mode ('ab') or filename:

`
2017-03-21 19:23:40+0800 [scrapyrt] Unhandled Error
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\web\http.py", line 1906, in allContentReceived
req.requestReceived(command, path, version)
File "C:\Python27\lib\site-packages\twisted\web\http.py", line 771, in requestReceived
self.process()
File "C:\Python27\lib\site-packages\twisted\web\server.py", line 190, in process
self.render(resrc)
File "C:\Python27\lib\site-packages\twisted\web\server.py", line 241, in render
body = resrc.render(self)
--- ---
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 24, in render
result = resource.Resource.render(self, request)
File "C:\Python27\lib\site-packages\twisted\web\resource.py", line 250, in render
return m(request)
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 130, in render_GET
return self.prepare_crawl(request_data, spider_data, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 201, in prepare_crawl
spider_name, spider_data, max_requests, *args, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\resources.py", line 210, in run_crawl
dfd = manager.crawl(*args, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\core.py", line 156, in crawl
dfd = self.crawler_process.crawl(self.spider_name, *args, **kwargs)
File "C:\Python27\lib\site-packages\scrapyrt\core.py", line 80, in crawl
cleanup_handler = setup_spider_logging(crawler.spider, self.settings)
File "C:\Python27\lib\site-packages\scrapyrt\log.py", line 142, in setup_spider_logging
handler = logging.FileHandler(filename, encoding=encoding)
File "C:\Python27\lib\logging_init.py", line 913, in init
StreamHandler.init(self, self.open())
File "C:\Python27\lib\logging_init.py", line 945, in _open
stream = codecs.open(self.baseFilename, self.mode, self.encoding)
File "C:\Python27\lib\codecs.py", line 900, in open
file = builtin.open(filename, mode, buffering)
exceptions.IOError: [Errno 22] invalid mode ('ab') or filename: u'E:\Py workspace\amazon_crawler\logs\asinspider\2017-03-21T19:23:40.156000.log'

all items are replaced by last crawled item.

In response items, all the items are replaced by last crawled item. While debugging core.py, it seems like every item in self.items is a shallow copy which is causing the issue. A deepcopy of every item can fix the issue.

https://github.com/scrapinghub/scrapyrt/blob/master/scrapyrt/core.py#L242

exceptions.AttributeError: 'ScrapyrtCrawlerProcess' object has no attrib ute 'crawl'

when i try to run scrapy using scrapyrt i am getting this error please help me

add support for request kwargs in GET handler

looking here there is comment saying:

At the moment kwargs for scrapy request are not supported in GET. 
They are supported in POST handler.

this is inconvenient, we should probably add some way to support request kwargs in GET so that there is consistent interface between POST and GET method calls.

Bug when install scrapyrt

Hi,
When i use command pip install scrapyrt it shows i succesfully installed scrapyrt-0.10, but when i run scrapyrt in command line, it shows "RuntimeError: Cannot find scrapy.cfg file".
I have already installed scrapy before, how can i solve this bug?

Autothrottle extension

Hi,
scrapyrt works very well for me, but I would like to use it with the autothrottle extension. Why is it disabled by default? Is there an easy way to enable the extension?

Thanks,
Siavash

did scrapyrt support crawlera plugin

I use cralwlera plugin in my scrapy project, so i wonder did it support crawlera?

Where I can handle the returned JSON in broweser.

I wrote a scrapy with scrapyrt. It extracts the data perfectly and it shows in the browser in the given URL.
But How I can handle the data which I got in the browser. I need to generate the analytics from the JSON.
Can you please help me.

Return custom HTTP code from spider

Is there any way I can return back to scrapyrt from my spider with a custom HTTP code and message?

Let´s say I´m trying to login but have the wrong user and password. I check for the error text in xpath and return a HTTP 401 with a custom message.

Thanks for a lovely library.

Cheers

allow users to pass spider arguments via url

When running Scrapy from command line you can do:

> scrapy crawl foo_spider -a zipcode=10001

but this is NOT possible with ScrapyRT now. You cannot pass arguments for spiders, you can only pass arguments for request. Adding support for "command_line" arguments is not difficult to implement and seems important IMO.

You could simply pass

localhost:8050/crawl.json?spider=foo.spider&zipcode=10001&url=some_url

EDIT:
clarify we're talking about passing arguments to API via url

scrapyrt wont go beyod Starting factory < twisted.web.server.Site instance at xxxx>

(scrapyrt)root@ubuntu:~/project/scrapy_2088# scrapyrt
2016-05-11 18:46:39+0000 [-] Log opened.
2016-05-11 18:46:39+0000 [-] Site starting on 9080
2016-05-11 18:46:39+0000 [-] Starting factory <twisted.web.server.Site instance at 0x7fec8c6c9440>

it was on my server, working ok, it has stopped before few days and now it wont start.

i have not updated the version or anything...it seems like some cache/lock file is corrupted or is it a bug ?

How can I use proxy service like Crawlera with Scrapyrt? Is it supported?

cannot import name 'log' from 'scrapy'

I saw you merged in the log branch and saw the change on master
but when i pip install and run
it still have this issus

from scrapy import signals, log as scrapy_log
builtins.ImportError: cannot import name 'log' from 'scrapy' (/Users/hao.chen/Envs/spiders/lib/python3.7/site-packages/scrapy/__init__.py)

how to install the newest version on pip

return json in utf-8

My spider response data is returned in unicode. What needs to be done in order for it to be returned in utf-8?

Cheers 😄

How can i use scrapyrt on production on my own server?

i tried to start scrapyrt using terminal but after some time it times out.
is there any way i can run scrapyrt on production?
Thanks.

Asynchronous running

(This should have a question label, however I can't seem to add it myself)

Hello, I've been taking a look at this project to handle the running of multiple, potentially long running crawlers at once. Currently it seems if you have a long running crawler, the API will simply wait until is has finished before returning the response.

Is it possible to start a crawler in the background and then check the status/get the crawler result via the API?

scrapyrt logging is bit too noisy and not easy to customize

We should add some command line argument to allow to customize logging, this would be useful for running scrapyrt in production but also good for apps that inherit from scrapyrt and add some loggers of their own.

Argument can be called -v --verbosity and accept integer matching loglevels.

Couldn't disable log or change log level in stdout

In my custom CrawlManager, I changed LOG_LEVEL and LOG_ENABLED.

class CrawlManager(ScrapyrtCrawlManager):

       ...

        def get_scrapyrt_settings(self):
            spider_settings = {
                "LOG_LEVEL": "INFO",
                "LOG_ENABLED": False,
                "LOG_FILE": None,
                "LOG_STDOUT": False,
                "EXTENSIONS": {
                    'scrapy.extensions.logstats.LogStats': None,
                    'scrapy.webservice.WebService': None,
                    'scrapy.extensions.telnet.TelnetConsole': None,
                    'scrapy.extensions.throttle.AutoThrottle': None
                }
            }
            return spider_settings

        def get_project_settings(self):
            custom_settings = self.get_scrapyrt_settings()
            return get_project_settings(custom_settings=custom_settings)

      ...

However neither of them worked, I could still see logs and DEBUG

2018-12-20 09:56:44+1100 [scrapyrt] {'request': {'url': 'https://www.railexpress.com.au/brisbanes-cross-river-rail-will-feed-the-centre-at-the-expense-of-people-in-the-suburbs/', 'meta': {'test': False, 'show_rule': True}}, 'spider_name': 'news_scraper'}
WARNING:py.warnings:/Users/ubuntu/miniconda3/envs/streem/lib/python3.6/site-packages/scrapyrt/core.py:9: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
  from scrapy import signals, log as scrapy_log

2018-12-20 09:56:44+1100 [scrapyrt] Created request for spider news_scraper with url https://www.railexpress.com.au/brisbanes-cross-river-rail-will-feed-the-centre-at-the-expense-of-people-in-the-suburbs/ and kwargs {'meta': {'test': False, 'show_rule': True}}
INFO:scrapy.crawler:Overridden settings: {'BOT_NAME': 'news_scraper', 'LOG_ENABLED': False, 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'news_scraper.spiders', 'SPIDER_MODULES': ['news_scraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
DEBUG:scrapy.core.engine:Crawled (200) <GET https://www.railexpress.com.au/brisbanes-cross-river-rail-will-feed-the-centre-at-the-expense-of-people-in-the-suburbs/> (referer: http://media.streem.com.au)
DEBUG:scrapy.core.scraper:Scraped from <200 https://www.railexpress.com.au/brisbanes-cross-river-rail-will-feed-the-centre-at-the-expense-of-people-in-the-suburbs/>
{'author': 'Industry Opinion',
 'body': 'OPINION: The rail project may well help get more commuters into the '
         '...',
 'detected_lang': {'code': 'en', 'confidence': 99.0},
 'language': 'en',
 'modified_at': None,
 'published_at': '2017-07-10T02:42:24+10:00',
 'title': 'Brisbane’s Cross River Rail will feed the centre at the expense of '
          'people in the suburbs',
 'url': 'https://www.railexpress.com.au/brisbanes-cross-river-rail-will-feed-the-centre-at-the-expense-of-people-in-the-suburbs/'}
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/request_bytes': 426,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 19470,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 12, 19, 22, 56, 44, 571116),
 'item_scraped_count': 1,
 'log_count/INFO': 6,
 'log_count/WARNING': 1,
 'memusage/max': 57487360,
 'memusage/startup': 57487360,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 12, 19, 22, 56, 44, 277020)}
INFO:scrapy.core.engine:Spider closed (finished)
2018-12-20 09:56:44+1100 [-] "127.0.0.1" - - [19/Dec/2018:22:56:43 +0000] "POST /scrape HTTP/1.1" 200 7665 "-" "PostmanRuntime/7.4.0"
2018-12-20 09:57:44+1100 [-] Timing out client: IPv4Address(TCP, '127.0.0.1', 53118)

My gold is to remove all default logs from stdout, so I can use stdout to only show my own logs.