Git Product home page Git Product logo

city-scrapers-stl's Introduction

City Scrapers St. Louis

CI Cron Build

What are the City Scrapers and why do we want them?

Public meetings are important spaces for democracy where any resident can participate and hold public figures accountable. But how does the public know when meetings are happening? It isn’t easy! These events are spread across dozens of websites, rarely in useful data formats.

Our Mission

The mission of the City Scrapers project is to increase access and transparency around public meetings across the St. Louis County by making it easier for everyone to know when and where public meetings are held.

All of the meetings gathered by our spiders can be viewed here!

What can I learn from working on the City Scrapers?

A lot about the City of St. Louis (and other municipalities of the Greater St. Louis area)! What is City Council talking about this week? What are the local school councils, and what community power do they have? What neighborhoods is the police department doing outreach in? Who governs our water?

From building a scraper, you'll gain experience with:

  • How the web works (HTTP requests and responses, reading HTML)
  • Writing functions and tests in Python
  • Version control and collaborative coding (git and Github)
  • A basic data file format (JSON), working with a schema and data validation
  • Problem-solving, finding patterns, designing robust code

Contributing

We welcome both coders and non-coders to help out with our project!

  1. Fill out this form to join our Slack channel and meet the community!
  2. Read about how we collaborate for more details.

Don't see your local public meetings?

Fill out this form to join our Slack channel! We love hearing from the community and learning about how we can better serve our city.

If there are any public meetings that you would like us to create a scraper for, please fill out this form to make a request.

When reviewing scraper requests, we might consider things such as:

  • Are these one-off meetings or recurring?
  • If they are one-off meetings, do we expect more in the future to be announced using a similar structure?
  • Is there historical data that could also be scraped using the same spider and might that be useful?
  • What is the estimated time and effort to write the scraper vs manual entry (i.e. if it takes 2-3 minutes to manually enter a single meeting and there are x number of meetings, how does that compare to the time taken to write the scraper)

Notes

This project is based off of a template repo provided by City Bureau. You can read more about what they do at citybureau.org.

We would also like to thank Pat at City Bureau for his patience and help setting up this project.

city-scrapers-stl's People

Contributors

bchao99 avatar ledaliang avatar pjsier avatar vishalvishw10 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

city-scrapers-stl's Issues

Archive scraped pages and documents on the Wayback Machine

Great work getting this up and running so quickly! One of the aspects of the City Scrapers project we haven't documented very well is our use of a Python package we created, scrapy-wayback-middleware.

The overall goal of the City Scrapers project is to improve transparency and create an archive not just of upcoming meetings, but past meetings and related documents as well as how they change over time. An important part of that for us has been archiving (almost) every page and document we scrape on the Internet Archive's Wayback Machine as well as in our static output.

Having a second, more public and accessible location makes the meeting information more available regardless of how long the project goes. We've even used it to track potential violations of open meetings laws, since it provides an external source for seeing what content was or was not on a website at a given time. Here's an example of snapshots of the Chicago Plan Commission's website over time.

The con of this approach is that it can make cron builds take significantly longer, but currently we're well under the 6 hour GitHub Actions time limit with over 100 scrapers on the main City Scrapers repo.

If you're interested, you can add scrapy-wayback-middleware as a dependency, and then you'll likely want to subclass the middleware to also scrape any documents you find like we've done in our main middleware.py. Then you can add it in your settings/prod.py like we did in our settings.

We're only activating it when the WAYBACK_ENABLED environment variable is set, and the template cron.yml file already sets this so once it's added in your settings file you should be good to go!

Let me know if you have any questions, and I'm happy to put in a PR for this if it's helpful

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.