Git Product home page Git Product logo

django-crawler's People

Contributors

acdha avatar chrisv2 avatar dnordberg avatar ericholscher avatar miracle2k avatar mthornhill avatar zen4ever avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

django-crawler's Issues

Update for Django 1.3 staticfiles compatibility

28f5c8b avoids outright errors when crawling a site which uses Django 1.3's staticfiles package but doesn't provide a way for us to verify file existence. We could either use the same finders() call which the serve view has to simply verify that a file can be located or find some way to enable insecure=True before requesting static content.

Unable to crawl /admin URLs

Hi,

I run django-crawler as:

python manage.py crawl --auth=login:sysdba,password:XXX /admin/bpp

The password should be good, I double-checked it.

The results I get:

crawler [INFO] base: Log in with login: sysdba, password: ********************
crawler [WARNING] base: /admin/bpp links to /static/admin/css/base.css, which returned HTTP status 404
crawler [WARNING] base: /admin/bpp links to /static/admin/css/login.css, which returned HTTP status 404
crawler [WARNING] base: /admin/bpp links to /static/admin_tools/css/theming.css, which returned HTTP status 404
crawler [INFO] time_plugin: http://testserver/admin/bpp/ took 0.156000
crawler [INFO] time_plugin: /static/admin/css/base.css took 0.016000
crawler [INFO] time_plugin: /static/admin_tools/css/theming.css took 0.015000
crawler [INFO] time_plugin: /static/admin/css/login.css took 0.000000
crawler [INFO] time_plugin: /admin/bpp took 0.000000
make: *** [crawl] Error 1

My INSTALLED_APPS looks like this:

INSTALLED_APPS = (
    'south',
    'admin_tools',
    'admin_tools.theming',
    'admin_tools.menu',
    'admin_tools.dashboard',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'django.contrib.admin',
    'django_jenkins',
    'bpp',
    'debug_toolbar',
    'werkzeug_debugger_runserver',
    'crawler'
)

New plugin Saver

To create a static version of the site, I propose to add plugin:
#coding:utf8
from crawler.plugins import Plugin
import os
import urlparse

class Saver(Plugin):

    def post_request(self, sender, response, url=None, **kwargs):
        if response.status_code == 200:
            content = response.content
            is_html = response['Content-Type'].startswith('text/html')
            path = urlparse.urljoin(self.output_dir + '/', url[1:])
            if is_html:
                basename = 'index.html'
                dirname = path
            else:
                basename = os.path.basename(path)
                dirname = os.path.dirname(path) + '/'
            try:
                os.makedirs(dirname)
            except OSError:
                pass
            file(dirname + basename, 'w').write(content)

PLUGIN = Saver

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.