eventuallyc0nsistent / arachne Goto Github PK
View Code? Open in Web Editor NEWA flask API for running your scrapy spiders
Home Page: http://arachne.readthedocs.org/en/latest/
License: Other
A flask API for running your scrapy spiders
Home Page: http://arachne.readthedocs.org/en/latest/
License: Other
I noticed that whenever I install Arachne using pip install Arachne
it installs Arachne==0.3.1
. However, when I install it using pip install git+ssh://[email protected]/kirankoduru/arachne.git
it installs Arachne==0.4.0
.
Arachne==0.3.1
isn't working for me. But Arachne==0.4.0
works just fine.
Hi, nice framework by the way..i have got it setup. when i run the application and go to the url on localhost to list the endpoints, it works, but when i try visiting an individual endpoint say http://localhost:5000/spiders/konga, it throws a 404 not found error
It would be nice if I could parse dynamic endpoints(in SPIDER_SETTINGS) like: 'endpoint': 'crawl/string:account'
I make a small project with Arachne that parses some data and add it to SQLite. And I want to make some webpage with some info based on crapped data. Should I create a separate flask app or I can add a view function to arachne somehow?
I have made a news scraping spider which stores items into an sqlite3 database. I've added the following to the settings.py file
SPIDER_SETTINGS = [ { 'endpoint': 'tech_news', 'location': 'spiders.news_spider', 'spider': 'TechSpider', 'scrapy_settings': { 'ITEM_PIPELINES': { 'pipelines.NewsCrawlerPipeline': 300 } } } ]
But the Item_pipeline doesn't seem to be working as there is no data being sent to the database.
The spider is working perfectly fine when I run it through the terminal.
To allow inserting scrapy items into DB using the DB insertion pipeline
Endpoints are confusing for first time users /spiders
and /run-spider/<spider-name>
isn't intuitive as expressed by #7
I added this to my settings.py
but it doesn't work
SPIDER_SETTINGS = [
{
'endpoint': 'dmoz',
'location': 'spiders.dmoz',
'spider': 'DmozSpider',
'scrapy_settings': {
'ITEM_PIPELINES': {
'pipelines.AddTablePipeline': 500
},
'CLOSESPIDER_PAGECOUNT': 2
}
}
]
UPD: It looks like any other scrapy settings doesn;t work too. Even ITEM_PIPELINES
$ python app.py
2017-11-28 01:25:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-28 01:25:33 [scrapy.utils.log] INFO: Overridden settings: {}
2017-11-28 01:25:39 [py.warnings] WARNING: C:\Users\ghost\Google Диск\Active Projects\Contacts Parser Arachne\spiders\doska_orbita_co_il.py:3: ScrapyDeprecationWarning: Module `scrapy.linkextractor` is deprecated, use `scrapy.linkextractors` instead
from scrapy.linkextractor import LinkExtractor
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-28 01:25:39 [scrapy.core.engine] INFO: Spider opened
2017-11-28 01:25:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Current version supports only v0.24
To customize every spiders scrapy setting in the SPIDER_SETTINGS
Could you please highlight the fact that the application is not runnable by 'flask run' but 'python .py'
Great job Kiran, for couple of days now, i've been battling with the deployment of Flask with Arachne on DigitalOcean, first here is my project structure on the server
|--------FlaskApp
|----------------FlaskApp
|-----------------------static
|-----------------------templates
|-----------------------venv
|-----------------------app.py
|-----------------------etc
|----------------flaskapp.wsgi
Navigating to my IP: http://104.131.31.XXX, it display internal here, but if i run "sudo python app.py" and navigate to http://104.131.31.XXX:8888, it display my site. i check log/apache2/error.log for the error, i found this.
'/var/www/FlaskApp/flaskapp.wsgi' cannot be loaded as Python module.
[Fri Dec 04 03:07:12.844591 2015] [:error] [pid 622] [client 169.255.236.76:17980] mod_wsgi (pid=622): Exception occurred processing WSGI script '/var/www/FlaskApp/flaskapp.wsgi'.
[Fri Dec 04 03:07:12.844640 2015] [:error] [pid 622] [client 169.255.236.76:17980] Traceback (most recent call last):
[Fri Dec 04 03:07:12.844688 2015] [:error] [pid 622] [client 169.255.236.76:17980] File "/var/www/FlaskApp/flaskapp.wsgi", line 8, in <module>
[Fri Dec 04 03:07:12.844822 2015] [:error] [pid 622] [client 169.255.236.76:17980] from app import app
[Fri Dec 04 03:07:12.844868 2015] [:error] [pid 622] [client 169.255.236.76:17980] ImportError: No module named app
here is my WSGI configuration
import sys
import logging
logging.basicConfig(stream=sys.stderr)
sys.path.insert(0,"/var/www/FlaskApp/")
from app import app as application
application.secret_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Flask Config /etc/apache2/sites-available/FlaskApp.conf
<VirtualHost *:80>
ServerName 104.131.31.XXX
ServerAdmin [email protected]
WSGIScriptAlias / /var/www/FlaskApp/flaskapp.wsgi
<Directory /var/www/FlaskApp/FlaskApp/>
Order allow,deny
Allow from all
</Directory>
Alias /static /var/www/FlaskApp/FlaskApp/static
<Directory /var/www/FlaskApp/FlaskApp/static/>
Order allow,deny
Allow from all
</Directory>
ErrorLog ${APACHE_LOG_DIR}/error.log
LogLevel warn
CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>
Here is my app.py
app = Arachne(__name__)
resource = WSGIResource(reactor, reactor.getThreadPool(), app)
site = Site(resource)
reactor.listenTCP(8888, site)
if __name__ == '__main__':
reactor.run()
Please, what's missing????
@kirankoduru Please how do i implement this concept with Django application and how do i declare multiple spider in SETTINGS. Thanks
After I've hit a request to a certain endpoint, I'll receive a JSON-View. Is there a way for me to render a HTML-file and return it as response?
Hey @kirankoduru ,
How can I pass any (dynamic) URL to scrape to the spider ?
Could you please highlight the importance of a certain project structure in the documentation? It didn't work out for me when I was using the scrapy-given project structure. It threw 'SPIDER_SETTINGS not found' -exception. Since scrapy requires an identical settings.py-file it would be great if you could rename that file. It may cause defects
Hi @kirankoduru, I have just started using your module and I may be wrong about this. Here is the situation.
In my case I need to specify a pipeline, but since the SCRAPY_SETTINGS dictionary is empty in default_settings.py, the pipeline is never enabled.
I dug around a bit in your source code and it seems like the get_spider_settings method in scrapy_utils.py
only updates/sets a setting if it's already present in SCRAPY_SETTINGS dict - line 66.
If I add the 'ITEM_PIPELINES' empty dict to SCRAPY_SETTINGS
in default_settings.py
, the pipeline gets enabled. I feel like there is a more elegant way of making it work, will submit a PR when I get time.
Correct me if I am wrong.
EDIT - Just realized am using Python Version 3.6 and Scrapy version 1.5. Maybe the difference on versions could be causing this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.