Git Product home page Git Product logo

arachne's People

Contributors

ankurdedania avatar eventuallyc0nsistent avatar kirankoduru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arachne's Issues

pip install Arachne installs older version

I noticed that whenever I install Arachne using pip install Arachne it installs Arachne==0.3.1. However, when I install it using pip install git+ssh://[email protected]/kirankoduru/arachne.git it installs Arachne==0.4.0.

Arachne==0.3.1 isn't working for me. But Arachne==0.4.0 works just fine.

404 not found

Hi, nice framework by the way..i have got it setup. when i run the application and go to the url on localhost to list the endpoints, it works, but when i try visiting an individual endpoint say http://localhost:5000/spiders/konga, it throws a 404 not found error

Dynamic-endpoint support

It would be nice if I could parse dynamic endpoints(in SPIDER_SETTINGS) like: 'endpoint': 'crawl/string:account'

Is that possible to add my own view to arachne?

I make a small project with Arachne that parses some data and add it to SQLite. And I want to make some webpage with some info based on crapped data. Should I create a separate flask app or I can add a view function to arachne somehow?

Item_pipeline not working

I have made a news scraping spider which stores items into an sqlite3 database. I've added the following to the settings.py file
SPIDER_SETTINGS = [ { 'endpoint': 'tech_news', 'location': 'spiders.news_spider', 'spider': 'TechSpider', 'scrapy_settings': { 'ITEM_PIPELINES': { 'pipelines.NewsCrawlerPipeline': 300 } } } ]

But the Item_pipeline doesn't seem to be working as there is no data being sent to the database.
The spider is working perfectly fine when I run it through the terminal.

CLOSESPIDER_PAGECOUNT Setting doesn't work for me

I added this to my settings.py but it doesn't work

SPIDER_SETTINGS = [
    {
        'endpoint': 'dmoz',
        'location': 'spiders.dmoz',
        'spider': 'DmozSpider',
        'scrapy_settings': {
            'ITEM_PIPELINES': {
                'pipelines.AddTablePipeline': 500
            },
            'CLOSESPIDER_PAGECOUNT': 2
        }
    }
]

UPD: It looks like any other scrapy settings doesn;t work too. Even ITEM_PIPELINES

$ python app.py
2017-11-28 01:25:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-28 01:25:33 [scrapy.utils.log] INFO: Overridden settings: {}
2017-11-28 01:25:39 [py.warnings] WARNING: C:\Users\ghost\Google Диск\Active Projects\Contacts Parser Arachne\spiders\doska_orbita_co_il.py:3: ScrapyDeprecationWarning: Module `scrapy.linkextractor` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.linkextractor import LinkExtractor

2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-28 01:25:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-28 01:25:39 [scrapy.core.engine] INFO: Spider opened
2017-11-28 01:25:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Flask + Arachne + Scrapy DigitalOcean Deployment problem

Great job Kiran, for couple of days now, i've been battling with the deployment of Flask with Arachne on DigitalOcean, first here is my project structure on the server

|--------FlaskApp
|----------------FlaskApp
|-----------------------static
|-----------------------templates
|-----------------------venv
|-----------------------app.py
|-----------------------etc
|----------------flaskapp.wsgi

Navigating to my IP: http://104.131.31.XXX, it display internal here, but if i run "sudo python app.py" and navigate to http://104.131.31.XXX:8888, it display my site. i check log/apache2/error.log for the error, i found this.

'/var/www/FlaskApp/flaskapp.wsgi' cannot be loaded as Python module.
[Fri Dec 04 03:07:12.844591 2015] [:error] [pid 622] [client 169.255.236.76:17980] mod_wsgi (pid=622): Exception occurred processing WSGI script '/var/www/FlaskApp/flaskapp.wsgi'.
[Fri Dec 04 03:07:12.844640 2015] [:error] [pid 622] [client 169.255.236.76:17980] Traceback (most recent call last):
[Fri Dec 04 03:07:12.844688 2015] [:error] [pid 622] [client 169.255.236.76:17980]   File "/var/www/FlaskApp/flaskapp.wsgi", line 8, in <module>
[Fri Dec 04 03:07:12.844822 2015] [:error] [pid 622] [client 169.255.236.76:17980]     from app import app
[Fri Dec 04 03:07:12.844868 2015] [:error] [pid 622] [client 169.255.236.76:17980] ImportError: No module named app

here is my WSGI configuration

import sys
import logging
logging.basicConfig(stream=sys.stderr)
sys.path.insert(0,"/var/www/FlaskApp/")

from app import app as application 
application.secret_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

Flask Config /etc/apache2/sites-available/FlaskApp.conf

<VirtualHost *:80>
                ServerName 104.131.31.XXX
                ServerAdmin [email protected]
                WSGIScriptAlias / /var/www/FlaskApp/flaskapp.wsgi
                <Directory /var/www/FlaskApp/FlaskApp/>
                        Order allow,deny
                        Allow from all
                </Directory>
                Alias /static /var/www/FlaskApp/FlaskApp/static
                <Directory /var/www/FlaskApp/FlaskApp/static/>
                        Order allow,deny
                        Allow from all
                </Directory>
                ErrorLog ${APACHE_LOG_DIR}/error.log
                LogLevel warn
                CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>

Here is my app.py

app = Arachne(__name__)

resource = WSGIResource(reactor, reactor.getThreadPool(), app)
site = Site(resource)
reactor.listenTCP(8888, site)

if __name__ == '__main__':
    reactor.run()

Please, what's missing????

Rendered-HTML Views as response

After I've hit a request to a certain endpoint, I'll receive a JSON-View. Is there a way for me to render a HTML-file and return it as response?

Documentation issue: Project structure

Could you please highlight the importance of a certain project structure in the documentation? It didn't work out for me when I was using the scrapy-given project structure. It threw 'SPIDER_SETTINGS not found' -exception. Since scrapy requires an identical settings.py-file it would be great if you could rename that file. It may cause defects

Default Settings: Empty SCRAPY_SETTINGS dict.

Hi @kirankoduru, I have just started using your module and I may be wrong about this. Here is the situation.

In my case I need to specify a pipeline, but since the SCRAPY_SETTINGS dictionary is empty in default_settings.py, the pipeline is never enabled.

I dug around a bit in your source code and it seems like the get_spider_settings method in scrapy_utils.py only updates/sets a setting if it's already present in SCRAPY_SETTINGS dict - line 66.

If I add the 'ITEM_PIPELINES' empty dict to SCRAPY_SETTINGS in default_settings.py, the pipeline gets enabled. I feel like there is a more elegant way of making it work, will submit a PR when I get time.

Correct me if I am wrong.

EDIT - Just realized am using Python Version 3.6 and Scrapy version 1.5. Maybe the difference on versions could be causing this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.