city-bureau / city-scrapers Goto Github PK

View Code? Open in Web Editor NEW

329.0 39.0 310.0 11.27 MB

Scrape, standardize and share public meetings from local government websites

Home Page: https://cityscrapers.org

License: MIT License

Python 41.25% Shell 0.03% HTML 58.72%

web-scraping python open-data scrapy city-scrapers

city-scrapers's Introduction

City Scrapers

Who are the City Bureau Documenters, and why do they want to scrape websites?

Public meetings are important spaces for democracy where any resident can participate and hold public figures accountable. City Bureau's Documenters program pays community members an hourly wage to inform and engage their communities by attending and documenting public meetings.

How does the Documenters program know when meetings are happening? It isn’t easy! These events are spread across dozens of websites, rarely in useful data formats.

That’s why City Bureau is working together with a team of civic coders to develop and coordinate the City Scrapers, a community open source project to scrape and store these meetings in a central database.

What are the City Scrapers?

The City Scrapers collect information about public meetings. Every day, the City Scrapers go out and fetch information about new meetings from the Chicago city council's website, the local school council's website, the Chicago police department's website, and many more. This happens automatically so that a person doesn't have to do it. The scrapers store all of the meeting information in a database for journalists at City Bureau to report on them.

Community members are also welcome to use this code to set up their own databases.

What can I learn from working on the City Scrapers?

A lot about the City of Chicago! What is City Council talking about this week? What are the local school councils, and what community power do they have? What neighborhoods is the police department doing outreach in? Who governs our water?

From building a scraper, you'll gain experience with:

how the web works (HTTP requests and responses, reading HTML)
writing functions and tests in Python
version control and collaborative coding (git and Github)
a basic data file format (JSON), working with a schema and data validation
problem-solving, finding patterns, designing robust code

How can I set up City Scrapers for my area?

This repo is focused on Chicago, but you can set up City Scrapers for your area by following the instructions in the City-Bureau/city-scrapers-template repo.

Community Mission

The City Bureau Labs community welcomes contributions from everyone. We prioritize learning and leadership opportunities for under-represented individuals in tech and journalism.

We hope that working with us will fill experience gaps (like using git/GitHub, working with data, or having your ideas taken seriously), so that more under-represented people will become decision-makers in both our community and Chicago’s tech and media scenes at large.

Ready to code with us?

Fill out this form to join our Slack channel and meet the community.
Read about how we collaborate and review our Code of Conduct.
Check out our documentation, and get started with Installation and Contributing a spider.

We ask all new contributors to start by writing a spider and its documentation or fixing a bug in an existing one in order to gain familiarity with our code and culture. Reach out on Slack for support if you need it.

Don't want to code?

Join our Slack channel (chatroom) to discuss ideas and meet the community!

A. We have ongoing conversations about what sort of data we should collect and how it should be collected. Help us make these decisions by commenting on issues with a non-coding label.

B. Research sources for public meetings. Answer questions like: Are we scraping events from the right websites? Are there local agencies that we're missing? Should events be updated manually or by a scraper? Triage events sources on these issues.

Support this work

This project is organized and maintained by City Bureau.

city-scrapers's People

Contributors

Stargazers

Watchers

Forkers

mwgalloway eads euniceylee ryanmasson jim csethna ryan-koch bergren2 ruthjohnson2350 dcldmartin maxine adebruine milti nickdevlin bonfirefan joshuarrrr easherma jacobroufa pjsier pauliebe randy7771026 jmattick meshulam chicagosound jackhowa nickjordan hancush ckwms63 rebecca-burwei aboodhk o-stovicek btkelly jessicamcinchak cande313 wildisle jmcbroom novellac withtwoemms mikecchan cherdeman citybureaulabs brettvanderwerff coraltint hyeness mglavey noahkconley ellissimani zizify alaaalhilly mkrump kappklot paddyboyscode thenoelman lzblack mbriley joegermuska myersjustinc macloo laurieskelly alonbee aeggenberger bdmorin maitrivasa ronakdedhia chin4thewin jjkohler kaigilly shiks07 nessapotamia abhijit-nimbalkar shrayo kaumudi indiadaniels dkori nathanderon thoo ssjavadpour ishabandi67 hellonewman nerdperks04 digitalexoh mrdiggles2 guesschess lilianhj lyles aop4 garrettmooney danielahuang pwprosap wingroove margsk jackycute robami opencleveland prairietribune jtotoole zarifmahmud jayqi kenlitton eidietrich

city-scrapers's Issues

pick documentation framework

Two top options imho:

Gitbook
Read The Docs

Leaning towards Gitbook.

Start versioning aggregator

one of the `test_status` functions should be `test_location`

SPIDER: City Colleges of Chicago Board of Trustees

geolocation of all addresses/meeting locations

failing gen_spider test

(documenters-aggregator) ➜  documenters-aggregator git:(master) ✗ pytest
======================================================================== test session starts =========================================================================
platform darwin -- Python 3.6.2, pytest-3.1.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/DE-Admin/Code/documenters-aggregator, inifile:
collected 310 items

tests/test_cchhs.py ..........................................................................................................................................................................................................................................................
tests/test_idph.py .......................................................
tests/test_tasks.py ....F

============================================================================== FAILURES ==============================================================================
_______________________________________________________________________ test_gen_html_content ________________________________________________________________________

    def test_gen_html_content():
        tasks._gen_html(SPIDER_NAME, SPIDER_START_URLS)
        test_file_content = read_test_file_content('files/testspider_articles.html.example')
        rendered_content = read_test_file_content('files/testspider_articles.html')
>       assert test_file_content == rendered_content
E       assert '<!doctype ht...ody>\n</html>' == '<!doctype htm...ody>\n</html>'
E         Skipping 3940 identical leading characters in diff, use -v to show
E         Skipping 210551 identical trailing characters in diff, use -v to show
E         - ed/common-5087336d1f748f6e2186-min.js"]; })(SQUARESPACE_ROLLUPS, 'squarespace-common');</script>
E         ?           ^ -----  ^^ ^^^  ^ ^
E         + ed/common-d8a982e40d16144e2580-min.js"]; })(SQUARESPACE_ROLLUPS, 'squarespace-common');</script>
E         ?           ^^^^^^^^   ^^ ^  ^ ^
E         - <script crossorigin="anonymous" src="//static.squarespace.com/universal/scripts-compressed/common-5087336d1f748f6e2186-min.js"></script><script>(function(rollups, name) { if (!ro...
E
E         ...Full output truncated (9 lines hidden), use '-vv' to show

tests/test_tasks.py:40: AssertionError
================================================================ 1 failed, 309 passed in 2.85 seconds ================================================================

build out contribution guide

abstract county spiders into a class

... or something

SPIDER: Chicago Board of Election Commissioners

Event format validator

I think it would be nice to have an event validator as a part of the pipeline in order to ensure that any data that gets downstream has all of the required fields in the correct formats.

document using spreadsheet better

especially using stuff via comment mode

package aggregator for pypi

proposal: add a code of conduct

I'd like to propose we implement a code of conduct for all project-related spaces and activities.

I'm partial to the generic code we developed at FreeGeek (and we stole a tremendous amount of it from the Geek Feminism anti-harassment policy).

cc @diaholliday

Link to documentation in README is a 404.

are there any events that could be collected from email lists

Suggested by @milti

scaffold out gitbook

globally disable robots.txt check

write spider creation docs

How to add public meetings for other municipalities?

Hey @diaholliday, @eads said you would be the one to ask about this. I'd like to contribute events to your aggregator from the Rockford area. Being a municipal area with one of the larger populations in Illinois, I'd like to get some action going around documenting public meetings here, as nothing like this currently exists. Ideally, I will be able to add agencies/orgs from not just Rockford proper but including the surrounding municipalities within Winnebago county.

What is the appropriate way to go about doing this? I don't want to step on toes and it looks as if your current spreadsheet has things almost exclusively limited to Chicago and Cook County. Should I add a new tab for Rockford Area, or add Rockford and related items to their own bordered section on the appropriate tabs that already exist?

I'm looking forward to contributing!

create FAQ

finish #54 then move on to this

Context for meetings

Hi all,

We've built out a few sites that seem somewhat similar to the aims of this project

We have found that these have not been widely used.

We suspect that the main barrier to participation is not knowing when and where meetings occur, but an understanding of

what takes place in the meeting
why a person might go to the meeting
how to engage in the meeting (as either as speaker or just an observer)
does going to this meeting feel safe

A comprehensive source for the time and place of public meetings can be useful, but I'm wondering if you all are thinking of addressing any of the other barriers?

what to do about files genspider tests downloads

@r-wei

The genspider tests leave new files in the tests directory, somewhat out of necessity. Not a big deal but worth noting.

we could:

ignore completely
remove when finished
do some mock stuff so they never get written

fake to test travis

SPIDER Office of the City Clerk (City Council)

Write airtable pipeline adapter

get rid of ipython dependency

Leverage OCD/Datamade scrapers and projects

This has come up a few times in various issues (#33, #5). We should evaluate the following existing projects and APIs and identify how we can leverage them to eliminate duplicated effort:

Open Civic Data has some scrapers already built that we should consider using:
- python-legistar-scraper
- opencivicdata/scrapers-us-municipal
Datamade has an API that is used to power Councilmatic, which displays Chicago City Council meetings.

Schedules in PDF format

Based on a discussion, we need to document the following process for developers in the documentation:

If a data source is difficult to scrape (data is in a PDF, image, etc.), notate that on the sources spreadsheet and don't attempt to write a scraper for it.

Tasks to close this issue:

Add documentation on how to handle difficult to scrape sources.

Original issue content

Not sure if we want to get into trying to parse PDFs, but the Chicago Housing Authority's meetings are posted as an image and PDF.

Is it worth trying to automate the import of this information? Or is is better to just have a list of sources that a human needs to enter into the system once per year?

SPIDER: Public Building Commission of Chicago

SPIDER: Cook County Dept of Public Health

Documentation Link Broken

The link to documentation in the readme file 404s.

Start with the documentation.

SPIDER: Chicago Public Library

invoke fails on windows

invoke runtests
You indicated pty=True, but your platform doesn't support the 'pty' module!

Unfortunately pseudo terminal handling is not available for Windows, and the pty=true flag prevents you from invoking runtests. Most people are probably not doing the frustrating thing of trying to run things in Windows like I am, but just case, I know one solution mentioned was to put in an os check.

OCD Spider: Cook County Board of Comissioners

Step 1: To access the data, guess the name of the client here: http://webapi.legistar.com/
The name of the client is not:
cookcounty
cook-county
cookcountyil
cookcountygov
cookcountygovil
cookcountyilgov

idph has a "special" date format for multi-day events

https://sadtrombone.com/

document data model

build out data model that all scrapers / aggregators should use

Plan for scraping ASP sites

The Cook County Clerk site uses JavaScript to load the details for events, but unfortunately, it uses ASP and therefore the AJAX requests are not as easy to fake as they would be in other web apps.

http://www.cookcountyclerk.com/countyboard/boardmeetings/Pages/default.aspx

There are probably other sites that we'll need to handle using a similar approach once we figure this one out.