eventuallyc0nsistent / arachne Goto Github PK

View Code? Open in Web Editor NEW

128.0 128.0 36.0 380 KB

A flask API for running your scrapy spiders

Home Page: http://arachne.readthedocs.org/en/latest/

License: Other

Python 100.00%

arachne's People

Contributors

Stargazers

Watchers

arachne's Issues

pip install Arachne installs older version

I noticed that whenever I install Arachne using pip install Arachne it installs Arachne==0.3.1. However, when I install it using pip install git+ssh://[email protected]/kirankoduru/arachne.git it installs Arachne==0.4.0.

Arachne==0.3.1 isn't working for me. But Arachne==0.4.0 works just fine.

404 not found

Hi, nice framework by the way..i have got it setup. when i run the application and go to the url on localhost to list the endpoints, it works, but when i try visiting an individual endpoint say http://localhost:5000/spiders/konga, it throws a 404 not found error

Dynamic-endpoint support

It would be nice if I could parse dynamic endpoints(in SPIDER_SETTINGS) like: 'endpoint': 'crawl/string:account'

Is that possible to add my own view to arachne?

I make a small project with Arachne that parses some data and add it to SQLite. And I want to make some webpage with some info based on crapped data. Should I create a separate flask app or I can add a view function to arachne somehow?

Item_pipeline not working

I have made a news scraping spider which stores items into an sqlite3 database. I've added the following to the settings.py file
SPIDER_SETTINGS = [ { 'endpoint': 'tech_news', 'location': 'spiders.news_spider', 'spider': 'TechSpider', 'scrapy_settings': { 'ITEM_PIPELINES': { 'pipelines.NewsCrawlerPipeline': 300 } } } ]

But the Item_pipeline doesn't seem to be working as there is no data being sent to the database.
The spider is working perfectly fine when I run it through the terminal.

Item DB insertion pipeline

To allow inserting scrapy items into DB using the DB insertion pipeline

Python 3 support

Modify endpoints for spiders

Endpoints are confusing for first time users /spiders and /run-spider/<spider-name> isn't intuitive as expressed by #7

CLOSESPIDER_PAGECOUNT Setting doesn't work for me

I added this to my settings.py but it doesn't work

SPIDER_SETTINGS = [
    {
        'endpoint': 'dmoz',
        'location': 'spiders.dmoz',
        'spider': 'DmozSpider',
        'scrapy_settings': {
            'ITEM_PIPELINES': {
                'pipelines.AddTablePipeline': 500
            },
            'CLOSESPIDER_PAGECOUNT': 2
        }
    }
]

UPD: It looks like any other scrapy settings doesn;t work too. Even ITEM_PIPELINES

$ python app.py

2017-11-28 01:25:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-28 01:25:33 [scrapy.utils.log] INFO: Overridden settings: {}
2017-11-28 01:25:39 [py.warnings] WARNING: C:\Users\ghost\Google Диск\Active Projects\Contacts Parser Arachne\spiders\doska_orbita_co_il.py:3: ScrapyDeprecationWarning: Module `scrapy.linkextractor` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.linkextractor import LinkExtractor

2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-28 01:25:39 [scrapy.core.engine] INFO: Spider opened
2017-11-28 01:25:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Add support for Scrapy 1.0

Current version supports only v0.24

Added support for using Scrapy Settings

To customize every spiders scrapy setting in the SPIDER_SETTINGS

Documentation issue: Running Arachne

Could you please highlight the fact that the application is not runnable by 'flask run' but 'python .py'

Flask + Arachne + Scrapy DigitalOcean Deployment problem

Great job Kiran, for couple of days now, i've been battling with the deployment of Flask with Arachne on DigitalOcean, first here is my project structure on the server

Navigating to my IP: http://104.131.31.XXX, it display internal here, but if i run "sudo python app.py" and navigate to http://104.131.31.XXX:8888, it display my site. i check log/apache2/error.log for the error, i found this.

'/var/www/FlaskApp/flaskapp.wsgi' cannot be loaded as Python module.
[Fri Dec 04 03:07:12.844591 2015] [:error] [pid 622] [client 169.255.236.76:17980] mod_wsgi (pid=622): Exception occurred processing WSGI script '/var/www/FlaskApp/flaskapp.wsgi'.
[Fri Dec 04 03:07:12.844640 2015] [:error] [pid 622] [client 169.255.236.76:17980] Traceback (most recent call last):
[Fri Dec 04 03:07:12.844688 2015] [:error] [pid 622] [client 169.255.236.76:17980]   File "/var/www/FlaskApp/flaskapp.wsgi", line 8, in <module>
[Fri Dec 04 03:07:12.844822 2015] [:error] [pid 622] [client 169.255.236.76:17980]     from app import app
[Fri Dec 04 03:07:12.844868 2015] [:error] [pid 622] [client 169.255.236.76:17980] ImportError: No module named app

here is my WSGI configuration

import sys
import logging
logging.basicConfig(stream=sys.stderr)
sys.path.insert(0,"/var/www/FlaskApp/")

from app import app as application 
application.secret_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

Flask Config /etc/apache2/sites-available/FlaskApp.conf

<VirtualHost *:80>
                ServerName 104.131.31.XXX
                ServerAdmin [email protected]
                WSGIScriptAlias / /var/www/FlaskApp/flaskapp.wsgi
                <Directory /var/www/FlaskApp/FlaskApp/>
                        Order allow,deny
                        Allow from all
                </Directory>
                Alias /static /var/www/FlaskApp/FlaskApp/static
                <Directory /var/www/FlaskApp/FlaskApp/static/>
                        Order allow,deny
                        Allow from all
                </Directory>
                ErrorLog ${APACHE_LOG_DIR}/error.log
                LogLevel warn
                CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>

Here is my app.py

app = Arachne(__name__)

resource = WSGIResource(reactor, reactor.getThreadPool(), app)
site = Site(resource)
reactor.listenTCP(8888, site)

if __name__ == '__main__':
    reactor.run()

Please, what's missing????

Using Arachne with Django

@kirankoduru Please how do i implement this concept with Django application and how do i declare multiple spider in SETTINGS. Thanks

Rendered-HTML Views as response

After I've hit a request to a certain endpoint, I'll receive a JSON-View. Is there a way for me to render a HTML-file and return it as response?

How to pass any url as a parameter to the Spider

Hey @kirankoduru ,
How can I pass any (dynamic) URL to scrape to the spider ?

Documentation issue: Project structure

Could you please highlight the importance of a certain project structure in the documentation? It didn't work out for me when I was using the scrapy-given project structure. It threw 'SPIDER_SETTINGS not found' -exception. Since scrapy requires an identical settings.py-file it would be great if you could rename that file. It may cause defects

Default Settings: Empty SCRAPY_SETTINGS dict.

Hi @kirankoduru, I have just started using your module and I may be wrong about this. Here is the situation.

In my case I need to specify a pipeline, but since the SCRAPY_SETTINGS dictionary is empty in default_settings.py, the pipeline is never enabled.

I dug around a bit in your source code and it seems like the get_spider_settings method in scrapy_utils.py only updates/sets a setting if it's already present in SCRAPY_SETTINGS dict - line 66.

If I add the 'ITEM_PIPELINES' empty dict to SCRAPY_SETTINGS in default_settings.py, the pipeline gets enabled. I feel like there is a more elegant way of making it work, will submit a PR when I get time.

Correct me if I am wrong.

EDIT - Just realized am using Python Version 3.6 and Scrapy version 1.5. Maybe the difference on versions could be causing this?

eventuallyc0nsistent / arachne Goto Github PK

arachne's People

Contributors

Stargazers

Watchers

Forkers

arachne's Issues

Recommend Projects

Recommend Topics

Recommend Org