Git Product home page Git Product logo

opengazettes_sl_scrapy's Introduction

Open Gazettes SL Scraper

Installation

  • Clone repo and cd into it
  • Make virtual environment
  • pip install -r requirements.txt
  • Set ENV variables
    • SCRAPY_AWS_ACCESS_KEY_ID - Get this from AWS
    • SCARPY_AWS_SECRET_ACCESS_KEY - Get this from AWS
    • SCRAPY_FEED_URI=s3://name-of-bucket-here/gazettes/gazete_index.jsonlines - Where you want the jsonlines output for crawls to be saved. This can also be a local location
    • SCRAPY_FILES_STORE=s3://name-of-bucket-here/gazettes - Where you want scraped gazettes to be stored. This can also be a local location

Running it Locally

  • To run the spider locally, you can choose to store the scraped files locally to do this set the ENV variable
  • SCRAPY_FILES_STORE=/directory/to/store/the/files which should point to a local folder
  • Then run the command scrapy crawl sl_gazettes -a year=2016 -o sl_gazettes.jsonlines
    where year is the year you want to scrape gazettes from
  • sn_gazettes.jsonlines is the file where crawls are saved, this too can be a directory

Deploying to Scraping Hub

It is recommended that you deploy your crawler to scrapinghub for easy management. Follow these steps to do this:

  • Sign up for free scraping hub account here
  • Install shub locally using pip install shub. Further instructions here
  • shub login
  • shub deploy
  • Login to scrapinghub and set up the above ENV variables Note that on scraping hub, environment variables should not have SCRAPY_ prefix

Installing scrapy-deltafetch on MacOS

  • brew install berkeley-db
  • export YES_I_HAVE_THE_RIGHT_TO_USE_THIS_BERKELEY_DB_VERSION=1
  • BERKELEYDB_DIR=$(brew --cellar)/berkeley-db/6.2.23 pip install bsddb3. Replace 6.2.23 with the version of berkeley-db that you installed
  • pip install scrapy-deltafetch

opengazettes_sl_scrapy's People

Contributors

gathuboswell avatar boswellgathu avatar davidlemayian avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.