invinst / chicago-police-data Goto Github PK

View Code? Open in Web Editor NEW

154.0 154.0 60.0 1.05 GB

a collection of public data re: CPD officers involved in police encounters

Home Page: https://invisible.institute/police-data

Python 3.14% Makefile 0.17% JavaScript 4.27% CSS 0.01% HTML 86.08% Jupyter Notebook 6.33% Shell 0.01% Dockerfile 0.01%

chicago-police-data's People

Contributors

Stargazers

Watchers

chicago-police-data's Issues

Condense the May data further

I haven't looked at April and February as much, but the May data can be condensed to roughly 10% of its current volume by simply eliminating the duplicate rows created by each IPRA officer on an investigation. An idea of how I formatted my condensed May data can be found here under the Condensed sheet. Note that I managed to condense the cases from ~25,000 to ~2600 just by eliminating the IPRA duplicate data. The reason that this could be useful is it allows people without coding experience to sift through the data better, increasing the ease of access.

Repo directory structure

As part of cleaning up the repository, we should determine a directory structure.

I personally would like something a little flatter than what we already have, so that people can see all of the datasets at a glance.

So in the root we have:

README.md -- basic info about repo, issue tracker, tell people to look at wiki for more info
[datasetname]/ -- directories that contain the relevant files to the dataset
- README.md -- basic info about that dataset
- cleaned_[datasetname].csv -- cleaned up data (but without any processing that would require interpretation)
- raw/ -- subdirectory that contains raw data files and copies of FOIA correspondence
[datasetname]/
- ... (same)
[datasetname]/
- ... (same)
context/ -- basically the same as what we currently have, extra data files for contextual information. We should try to digitize everything into a machine readable format for things that aren't already.

Naming is discussed separately in issue #46

Migrating files associated with shootings-append.csv discussed in issue #50

Append February data to shootings-append.csv

Right now, shootings-append.csv includes only data from the April and May files.

The February file includes only three types of incident categories: 18A, 18B, 20A. It uses a very different column schema from the April and May files, and it has less information that the other two files.

We would like to merge it in to shootings-append.csv, but we need to figure out how to accurately map the column schemas onto each other given that February has less information.

Double-check data in /Clean/Feb2016 folder

See pull request #21

Double-check data in /Clean/May2016 folder

Data was created in pull request #17

Add all of Anna's work to the repo

Compare officer statements in the IPRA dump against each other?

See if there are patterns?

Document problems with the data

Could be useful for advocating for better data systems in the future. via Chaclyn.

Change the name of June Dump Materials?

Now that we have actual June raw data, it may be beneficial to differentiate between that and the IPRA dump in the github. A newcomer may easily get confused by the difference between "June dump materials" and "raw dump June2016."

May FOIA response

Is there a copy of the May dump FOIA response letter available?

Migrate shootings-append files and related to somewhere else

Since the effort to merge these datasets (issue #4) is not trivial and involves some significant interpretation on top of what is in the raw data files, I am suggesting that we consider it to be an independent project for now. In that case, we would want to migrate all files associated with shootings-append and shootings-merge to somewhere else.

Request for information

Dear Sir, or Madam,

My name is Christopher Wilson and I am a Walden University student studying Criminal Justice. I am writing to you because I am requesting information about the use of excessive force for my dissertation study. The purpose of this quantitative study is to investigate the use of excessive force by law enforcement officers in Chicago, I am not asking for names, addresses, phone numbers for police officers or citizens for security concerns, however, I am requesting the data set from 2019-2016 in the attachment listed below that was found on GITHUB. I am requesting that under the subtitle beat, maybe you can include the community instead of the beat the officer worked. I am looking
CHICAGO POLICE DEPARTMENT DATA.xlsx

Additionally, I am requesting the statistical information in an excel spreadsheet similar to the attachment listed below. ( Please include sustained and unsustained complaint in the excel spreadsheet) If you need further information please do not hesitate to contact me.

Thank you for your time and consideration of my request.

Kind regards,

Christopher Wilson

Walden University, Student

[Help Wanted] Please make us contributors

The following people want to label issues and do other neat things but can't because we're not contributors

[email protected]
[email protected]
[email protected]
[email protected]

@jayqi @rajivsinclair

Finish COLUMN-DICTIONARY.txt

Let's set up a plan / target date for doing this @rajivsinclair @ithinkidunno.

Upload current onboarding document to the Wiki?

I think we're at a point where the Onboarding document has enough basic material to be uploaded to the Wiki. There's still more to add, but we can continue to update it as we go. Thoughts?

Display CSV table(s) as search-able HTML on GitHub Pages

create a searchable table - thanks @derekeder for this JavaScript template
submit a pull request to host it on GitHub pages in this sandbox repo

Fix date formatting in ipra-may2016

The Accused_Appointment_Date is not formatted correctly (there is simply an integer value that, if opened with Excel and changed to being formatted as a Date then correctly shows the date). This has been fixed for the dat_may2016.csv file, but not for the other file formats in that folder, or for the concise files.

The other columns with date values should also be checked to make sure this issue doesn't exist in any other columns.

Is there location data for each complaint?

I am working on a geography project around urban renewal and campaign finance, and wanted to take a look at some of the location data around complaints, but can't find location information on any of the spreadsheets (which seems strange, given that it is visualized in the website). Does this exist? Thanks!

help to resolve unmatched officer identities

in attempting to match up the officers named in the shootings-append.csv file against the officer profiles in the all-sworn-officers datatable we found 40 rows that are mismatched/malformed (missing data fields such as the first name of the accused officer ACCSUEDOFFICER_FNAME [sic]).

I propose the following methodology for attempting to resolve them:
take the not_match_officer.csv file and work through it one row at a time to fill in the missing identifying information based on date-of-appointment matches in the all-sworn-officers table.

take the date of appointment for the unmatched officer's row in not_match_officer.csv
filter the table CPD_Employees-one-row-per-individual to show only CPD employees who have the exact same Date of Appointment as the unmatched officer, and export this filtered list to a new file
look for close/confident matches and copy those officer profile rows into a new file along with a new field identifying whether the row is a proposed confident match or if it is close but not sufficiently confident without more information
repeat for all rows in not_match_officer.csv
save the the new table of close+confident matches and add it to this branch or a new one
(include all your working files, e.g., exports of filtered lists of all officers with the same APPOINTED_DATE, only if you think it’s helpful / not totally redundant)

Decimal ages in June data

@DGalt How were the decimal ages determined? Was it taking into account month and day as well as year?

Arrests data numbers

There are way, way more inmates listed in the Cook County table of the same name than in the arrests table for the CPD. The timespan for the former is around 3.5 times longer, but the number of rows is nearly 10x as much; this seems strange since I would expect most Cook County cases to be CPD ones. What's going on here?

Also, the overwhelming majority of Cook County inmates seem to have no charges - the charges table, at least, has far fewer rows. By contrast, the same naive count would suggest that on average CPD arrests seem to have two charges per person. There must be different definitions going on here; what is the difference?

Dataset naming

We can probably all agree that our current month designations for the datasets are not the most illuminative.

I propose a new scheme of: topic-source-monthyearofrelease

So for example:

May -> shootings-ipra-may16

I think datasets that correspond to the same FOIA request content should have the same topic name.

What does "inmate" mean?

Questions we could ask of the data

Right now we're busy scraping, cleaning, appending, merging, and double-checking data. @Yahwes and @DGalt and @banoonoo2 have made enormous progress on this over the past week.

Once the data is in a highly usable and double checked form, we can start asking it questions. Here are a couple that @ithinkidunno and I started brainstorming on Friday. Please add your own to this thread.

Questions for the data

What patterns do we notice in terms of the officers most frequently associated with each category (involved / accused)?
Do we see any patterns in terms of officer age and rank?
How are the incidents distributed across the city? (Geocode every incident on a map)
Do we see different patterns for victims of different races when it comes to initial category vs. current category?
Do we see different patterns for victims of different races when it comes to change in outcome (“finding code”)?
Do we see different patterns for victims of different races when it comes to length of investigation?
What does the data have to say, if anything, about the highest-ranking CPD officers?

Where can I find data to run the individual task roster_1936-2017_2017-04_p058155?

After going through the workflow, I see the input/ folders missing for the import process in roster_1936-2017_2017-04_p058155/ task in individual/ ? Is the raw data required to run the scripts for this task available somewhere else?

License?

First of all, congratulations on this relaunch. I thought what you managed to collect before was impressive enough, today's investigative project and relaunch of the data is incredible.

I had planned to use what you previously had up as a usecase/scenario for a SQL book I'm writing, but am happy to use what you've now published, particularly the great examples of reporting you've provided. Concerning this repo, have you thought yet about what the license will be?

Processing on October 2016 FOIA

I didn't see any details under the complaints-cpd-2016-october, on cleanup, but if it's still something you're looking for I've taken a stab at it here chicago-police-data-cleanup. The overall totals I'm getting roughly match up with the Tribune's counts, but I'm not sure if there are any details in particular I'm overlooking.

How did we aggregate the different data files that make up each FOIA response?

Questions for @ithinkidunno:

When we say "the April data" (for example), are we referring to the result of appending all these files in shootings-data/Raw/FOIA_April2016/ together?

218 Resp SS_2012.xls
218 Resp SS_2013.xls
218 Resp SS_2014.xls
218 Resp SS_2015.xls
218 Resp SS_2016.xls

Is the outcome of appending all these files stored anywhere in this repo?

And is that what Compare_FOIA.do is doing?

Missing FOIA documents

What we're missing:

shootings-ipra-may2016 - FOIA request and/or response
complaints-ipra-apr2016 - FOIA request (we have responses but they don't quote the requests in detail)
cpdb_complaints-cpd - FOIA requests and/or responses for august 2015, march 2015, and september 2015

Scrape all information and hosted media from portal.iprachicago.org

Scrape published info

extract all published information per log number
store it in a useful data structure saved as CSV (comma-separated values) files

Capture media files

download raw files from Vimeo and SoundCloud then upload them to archive.org
download all PDF files and upload them to DocumentCloud

Create an index table

create a CSV table of archived media with the public URL for each media asset, the media type (audio/video/etc.), and the log number for the incident.

Make it easily search-able (see #8)

create a searchable table in html - thanks @derekeder for this JavaScript template
submit a pull request to host it on GitHub pages in this sandbox repo

Does the Complaint dataset also include complainant officer's info?

Hello, I'm interested in exploring who were the police officers that made complaints. Is there a way to find out about the complainant's ID?

Onboarding document

It's been started here. Please feel free to add on accurate and _helpful_ information.

Data access license?

I work with pdap.io. We're looking to pull this dataset into ours; however, I don't see an open source license.

This repo would be one of our third-party datasets, and the data would be associated with it that way. Is that all right?

Assign all POs in all_sworn a unique ID?

I'm currently working on putting together data sets for the individuals in the May and April dumps that I can confidently ID as either police officers or not police officers, and I'm finding that there currently is no good way to uniquely ID a particular officer. Their name alone isn't enough, so I end up having to use a combination of different data sources to ID them.

This is fine, except when I want to go back and look at that officer again (or match him/her in another data set again) I need to once again use those different sources to ID him.

It might be worth considering assigning all of the officers in the all_sworn data set some kind of unique ID so that when I identify someone in the April and May data set as one particular officer, I can assign that entry that unique ID. It would make cross referencing these different data sets easier I think. @rajivsinclair I know that we don't have employee IDs, but have you all discussed assigning some kind of equivalent ID # to the entries in all sworn for this purpose?

Compare incident counts and overlap between data sources

Via #4 (comment):

One simple way to start wrapping our heads around the data might be to start by gathering up incident counts from each of the four data sets we care about: February, April, May, and June.

We could also find out how much overlap exists between the incident IDs in each pair of data sets.

That could let us answer questions like:

"Can the April data set alone let us link most of the June 3 incidents to CPD officers?"

Where to upload the 80 gigs of IPRA Vimeo data

@rajivsinclair Is there an account I should upload the Vimeo data under at archive.org?

Add FOIA for May

So that we can understand where the May data set came from.

Try to merge shootings-append.csv into shootings-merge.csv

shootings-append.csv appends data from 2 data sets on police shootings released pursuant to FOIA requests submitted in April and May 2016.

This data is dynamic, not static. For example, the status or details about a police shooting may change as an investigation progresses. There may also be other inconsistencies between the data sets.

We need to merge the shootings-append.csv data into a combined file, shootings-merge.csv. You can use your GitHub username or initials to identify your attempt at merging the data. For example, Matt Li (fictional person) might name his solution shootings-merge-ml.csv.

Feel free to use your own tools to merge the data. One contributor is using Stata, but you can also use Python or R. We hope that through multiple parallel attempts at merging the data, we can catch each other's mistakes.

To quote @rajivsinclair: "We are all about redundancy. We can't have enough of that. Do it your own way, look at our start, look at our finish, and compare the results. Help us identify the flaws in our approach."

We are going to work on documenting shootings-append.csv so the meaning of each column is clear. Please share questions about the data in this thread.

Rename repo?

The repo name is shootings-data, but it contains more than shootings data. Shall we rename? Great suggestion from @shua123.

Basic projects/work that newcomers can tackle?

Today we had a lot of new people come in, and while I was able to get them somewhat up to date on all of the data, I wasn't sure what work they should tackle next. What are some basic projects that these people could work on to get a feel for the data as well as contribute?

Proposal: Clarify purpose of repo and steps to organize

Main idea

Go back to the roots of this repo. From the README:

This is a living repository of public data about Chicago’s police officers and their encounters with the public.

This means separating out:

The data (this repo)
Usage of the data (keep elsewhere)

Define a clear purpose of this repo

A copy of the data for anybody (not just InvInst and ChiHackNight people) who wants to use it for something. This is both cleaned data and raw data, as well as supporting primary documents like FOIA requests and responses.
A resource to facilitate people using the data. This means documentation of data, background and context, information about CPD/IPRA processes. This probably should be stuff in the wiki, and can be evolved as needed.
A central hub for community of data users to announce and coordinate their projects

Keeping projects using data elsewhere

We've ended up with a whole bunch of threads of people investigating different things. It doesn't make sense for those to be merged into this repo, but we still want some way keep track of that, facilitate their work, coordinate, and potentially build a community.

Basic idea: projects or investigations using the data are separate, whether independent GitHub repo, Google docs, etc. This would include stuff like, e.g., mapping of incidents; resolving officer identities; analyzing pdf reports. Still encourage people to announce and document their projects so that everyone can know what they're working on.

What we need to do

Make sure we have all the baseline dataset files, e.g., make sure we have a Cleaned April, and a system for keeping an up-to-date scrape of June IPRA Portal
Clean up stuff in repo so we only have what any random person interested in using the data would need
Document what every file is. Document what each dataset contains
Startup guide for people for people to get started, potentially with examples. Great idea by Evan on Slack.
Design a process for data users to interact with each other and with InvInst. This needs to at least include ways to (1) ask clarifying questions about the data or about CPD/IPRA, (2) post info about their individual projects so that other people can get involved, and (3) raise issues about the repo or wiki.

Idea: Use issue tracker as coordination tool

We're already kind of doing this, but here is a proposal to make it more structured using the tag system. We can have three tags:

question -- people can ask a question about the data or about CPD/IPRA, e.g. "Why do POs have INVOLVED_OFFICER_TYPE set to Victim?" For useful general questions we can add documentation to the wiki when the answer is figured out.
project -- people can document a project they are working on. Can link to wherever they are hosting their project. This repo would only host a link and some basic info. Inspired by new ChiHackNight breakout group tracker.
repo issue -- people can raise an issue about the repo, whether with the data or the wiki. e.g., "IPRA portal scrape is missing CRID 0000000 that is on the website"

Open question: Role of ChiHackNight group going forward

My idea here is as follows:

In short term:

Primary work is to implement the above.

In long term:

Once above work is done, then our group is done.
Still potentially value in keeping a general police accountability breakout group to facilitate ChiHackNight attendees who want to use this data for something, e.g., Chaclyn visits and answers questions; tutorial of datasets; mob programming session with data. Maybe less frequently than weekly if we don't need weekly.
If someone has a project idea that would want multiple people to contribute, can start new breakout group for that project specifically

Open question: Messaging platform (Slack or otherwise or none)

Do we need/want a messaging platform to support a potential community of data users?

Right now, contributors from ChiHackNight are guests on InvInst's Slack. Once this effort is finished, we probably wouldn't need it for that purpose anymore. Presumably, we would want to generally keep chat for independent projects separate (i.e., not open up InvInst's Slack to anybody random person from anywhere).

We could make a separate Slack. We could use a competitor product like gitter which has free public channels viewable to anyone and only requires a GitHub account to say something (advantages over Slack). Or we could decide that the Issues Tracker is sufficient for discussion and not have anything.

[Help Wanted] Make a FOIA label

Use to prioritize FOIA requests

Documentation for working with data?

Hi,

Great work y'all have done. Truly appreciate it. It's a force for the greater good. I looked at the wiki, the documentation on the workflow, and the data dictionary, but I'm wondering if you have any other resources for looking at the data and getting a handle on it? It's a bit unwieldy and hard to know where to start. I'm playing around some with the data now as well, and I'm not sure if what I'm doing is correct or not, and certain things I'm seeing I'm not sure what they mean (e.g. complaints-complaints with complaints-accused I find not every cr_id matches up and also there are complaints going all the way back to 1919).

How to relate Feb, April, and May data sets?

A question that I keep coming up against is what is the most appropriate way, if at all, to combine and/or relate the different data sets that we have available to us. Just to briefly summarize what we have (this is in the wiki as well):

February data set: from CPD, composed of firearm discharge data
April data set: from IPRA, composed of police misconduct data
May data set: from IPRA, composed of shooting / tasing (+a few misc others) data

The unique-identifier column in February is Log No, while in April and May datasets it is Complaint_Number

Assuming that we can treat the values in Log No in the February dataset as equivalent to the values found in the Complaint_Number column found in the April and May datasets (@chaclynhunt, @rajivsinclair can you confirm / refute this):

There are 405, 7175, and 361 unique IDs (whether Log No or Complain_Number) in Feb, April, and May, respectively
For the Feb set: 236, 322, and 211 of the IDs do not exist in the April, May, and combined April+May datasets, respectively
For the April set: 7006, 7014, and 6903 of the IDs do not exist in the Feb, May, and combined Feb+May datasets, respectively
For the May set: 278, 200, and 175 of the IDs do not exist in the Feb, April, and combined Feb+April datasets, respectively

Overall the columns in the April and May sets largely correspond to each other, although there are several extra columns in May that do not exist in April (some work on trying to match the April and May columns can be found here, about halfway down the page)

February, in contrast, is a bit sparser in terms of overall data. While I think most of the columns in February can be matched to columns in April/May, there are a number of columns in April/May that do not exist in February.

One thing that might be worth looking in to is, for the unique identifiers that overlap across the datasets, how much of the data for those identifiers overlaps.

Considering, though, that these three datasets are produced by different sources (particularly reports of misconduct vs. the reports generated when an officer uses his/her firearm), I don't know that collapsing them into one large dataset is the best path forward. Or maybe it is, hence the need for discussion :)

Understanding Number of Complaints- June vs all before 2016

Comparing Complaints data sets

We have 3 data sets that are marked as containing complaints data:

CPDB complaints data
April data dump (IPRA)
June data dump (CPD)

Some questions posed by @rajivsinclair for us to look at:

evaluate completeness of the new dataset (e.g., what percent of CRIDs include some valid value for each data field)
the consistency of the overlap with our data about existing known complaints (e.g., how many of existing data is apparently “updated” in this dataset and how many of those “updates” are dramatic changes (e.g., change of date by "more than 3 days”)
especially how well does this _data overlap_ with our existing complaints and shootings data, e.g., how many new CRIDs are added within the included time period of the older dataset (i.e., new CRIDs that should have been included already in our earlier data but was omitted from previous production exports for some reason), and e.g., how many known CRIDs from our previous dataset (that should be included in the timeframe of this new data update) are actually omitted/missing/disappeared from this file?,
and are any _shootings records_ or other documented use of force included in this new dump using the same category codes?, and e.g., do some categories of complaints tend to be more likely to be missing a certain field of data (e.g, has some officer identified, has some investigator identified, has some final investigation/disciplinary outcomes, has conflicting initial vs final outcomes?)

I will be leaving this issue open as a place to collect notes / analysis related to these questions