invinst / chicago-police-data Goto Github PK
View Code? Open in Web Editor NEWa collection of public data re: CPD officers involved in police encounters
Home Page: https://invisible.institute/police-data
a collection of public data re: CPD officers involved in police encounters
Home Page: https://invisible.institute/police-data
I haven't looked at April and February as much, but the May data can be condensed to roughly 10% of its current volume by simply eliminating the duplicate rows created by each IPRA officer on an investigation. An idea of how I formatted my condensed May data can be found here under the Condensed sheet. Note that I managed to condense the cases from ~25,000 to ~2600 just by eliminating the IPRA duplicate data. The reason that this could be useful is it allows people without coding experience to sift through the data better, increasing the ease of access.
As part of cleaning up the repository, we should determine a directory structure.
I personally would like something a little flatter than what we already have, so that people can see all of the datasets at a glance.
So in the root we have:
README.md
-- basic info about repo, issue tracker, tell people to look at wiki for more info[datasetname]/
-- directories that contain the relevant files to the dataset
cleaned_[datasetname].csv
-- cleaned up data (but without any processing that would require interpretation)raw/
-- subdirectory that contains raw data files and copies of FOIA correspondence[datasetname]/
[datasetname]/
context/
-- basically the same as what we currently have, extra data files for contextual information. We should try to digitize everything into a machine readable format for things that aren't already.Naming is discussed separately in issue #46
Migrating files associated with shootings-append.csv
discussed in issue #50
Right now, shootings-append.csv
includes only data from the April and May files.
The February file includes only three types of incident categories: 18A, 18B, 20A. It uses a very different column schema from the April and May files, and it has less information that the other two files.
We would like to merge it in to shootings-append.csv
, but we need to figure out how to accurately map the column schemas onto each other given that February has less information.
See pull request #21
Data was created in pull request #17
See if there are patterns?
Could be useful for advocating for better data systems in the future. via Chaclyn.
Now that we have actual June raw data, it may be beneficial to differentiate between that and the IPRA dump in the github. A newcomer may easily get confused by the difference between "June dump materials" and "raw dump June2016."
Is there a copy of the May dump FOIA response letter available?
Since the effort to merge these datasets (issue #4) is not trivial and involves some significant interpretation on top of what is in the raw data files, I am suggesting that we consider it to be an independent project for now. In that case, we would want to migrate all files associated with shootings-append
and shootings-merge
to somewhere else.
Dear Sir, or Madam,
My name is Christopher Wilson and I am a Walden University student studying Criminal Justice. I am writing to you because I am requesting information about the use of excessive force for my dissertation study. The purpose of this quantitative study is to investigate the use of excessive force by law enforcement officers in Chicago, I am not asking for names, addresses, phone numbers for police officers or citizens for security concerns, however, I am requesting the data set from 2019-2016 in the attachment listed below that was found on GITHUB. I am requesting that under the subtitle beat, maybe you can include the community instead of the beat the officer worked. I am looking
CHICAGO POLICE DEPARTMENT DATA.xlsx
Additionally, I am requesting the statistical information in an excel spreadsheet similar to the attachment listed below. ( Please include sustained and unsustained complaint in the excel spreadsheet) If you need further information please do not hesitate to contact me.
Thank you for your time and consideration of my request.
Kind regards,
Christopher Wilson
Walden University, Student
The following people want to label issues and do other neat things but can't because we're not contributors
[email protected]
[email protected]
[email protected]
[email protected]
Let's set up a plan / target date for doing this @rajivsinclair @ithinkidunno.
I think we're at a point where the Onboarding document has enough basic material to be uploaded to the Wiki. There's still more to add, but we can continue to update it as we go. Thoughts?
The Accused_Appointment_Date
is not formatted correctly (there is simply an integer value that, if opened with Excel and changed to being formatted as a Date then correctly shows the date). This has been fixed for the dat_may2016.csv
file, but not for the other file formats in that folder, or for the concise files.
The other columns with date values should also be checked to make sure this issue doesn't exist in any other columns.
I am working on a geography project around urban renewal and campaign finance, and wanted to take a look at some of the location data around complaints, but can't find location information on any of the spreadsheets (which seems strange, given that it is visualized in the website). Does this exist? Thanks!
in attempting to match up the officers named in the shootings-append.csv file against the officer profiles in the all-sworn-officers datatable we found 40 rows that are mismatched/malformed (missing data fields such as the first name of the accused officer ACCSUEDOFFICER_FNAME
[sic]).
I propose the following methodology for attempting to resolve them:
take the not_match_officer.csv file and work through it one row at a time to fill in the missing identifying information based on date-of-appointment matches in the all-sworn-officers table.
@DGalt How were the decimal ages determined? Was it taking into account month and day as well as year?
There are way, way more inmates listed in the Cook County table of the same name than in the arrests table for the CPD. The timespan for the former is around 3.5 times longer, but the number of rows is nearly 10x as much; this seems strange since I would expect most Cook County cases to be CPD ones. What's going on here?
Also, the overwhelming majority of Cook County inmates seem to have no charges - the charges table, at least, has far fewer rows. By contrast, the same naive count would suggest that on average CPD arrests seem to have two charges per person. There must be different definitions going on here; what is the difference?
We can probably all agree that our current month designations for the datasets are not the most illuminative.
I propose a new scheme of: topic-source-monthyearofrelease
So for example:
shootings-ipra-may16
I think datasets that correspond to the same FOIA request content should have the same topic name.
Right now we're busy scraping, cleaning, appending, merging, and double-checking data. @Yahwes and @DGalt and @banoonoo2 have made enormous progress on this over the past week.
Once the data is in a highly usable and double checked form, we can start asking it questions. Here are a couple that @ithinkidunno and I started brainstorming on Friday. Please add your own to this thread.
After going through the workflow, I see the input/ folders missing for the import process in roster_1936-2017_2017-04_p058155/ task in individual/ ? Is the raw data required to run the scripts for this task available somewhere else?
First of all, congratulations on this relaunch. I thought what you managed to collect before was impressive enough, today's investigative project and relaunch of the data is incredible.
I had planned to use what you previously had up as a usecase/scenario for a SQL book I'm writing, but am happy to use what you've now published, particularly the great examples of reporting you've provided. Concerning this repo, have you thought yet about what the license will be?
I didn't see any details under the complaints-cpd-2016-october
, on cleanup, but if it's still something you're looking for I've taken a stab at it here chicago-police-data-cleanup. The overall totals I'm getting roughly match up with the Tribune's counts, but I'm not sure if there are any details in particular I'm overlooking.
Questions for @ithinkidunno:
When we say "the April data" (for example), are we referring to the result of appending all these files in shootings-data/Raw/FOIA_April2016/
together?
218 Resp SS_2012.xls
218 Resp SS_2013.xls
218 Resp SS_2014.xls
218 Resp SS_2015.xls
218 Resp SS_2016.xls
Is the outcome of appending all these files stored anywhere in this repo?
And is that what Compare_FOIA.do
is doing?
What we're missing:
shootings-ipra-may2016
- FOIA request and/or responsecomplaints-ipra-apr2016
- FOIA request (we have responses but they don't quote the requests in detail)cpdb_complaints-cpd
- FOIA requests and/or responses for august 2015
, march 2015
, and september 2015
Hello, I'm interested in exploring who were the police officers that made complaints. Is there a way to find out about the complainant's ID?
It's been started here. Please feel free to add on accurate and _helpful_ information.
I'm currently working on putting together data sets for the individuals in the May and April dumps that I can confidently ID as either police officers or not police officers, and I'm finding that there currently is no good way to uniquely ID a particular officer. Their name alone isn't enough, so I end up having to use a combination of different data sources to ID them.
This is fine, except when I want to go back and look at that officer again (or match him/her in another data set again) I need to once again use those different sources to ID him.
It might be worth considering assigning all of the officers in the all_sworn data set some kind of unique ID so that when I identify someone in the April and May data set as one particular officer, I can assign that entry that unique ID. It would make cross referencing these different data sets easier I think. @rajivsinclair I know that we don't have employee IDs, but have you all discussed assigning some kind of equivalent ID # to the entries in all sworn for this purpose?
Via #4 (comment):
One simple way to start wrapping our heads around the data might be to start by gathering up incident counts from each of the four data sets we care about: February, April, May, and June.
We could also find out how much overlap exists between the incident IDs in each pair of data sets.
That could let us answer questions like:
"Can the April data set alone let us link most of the June 3 incidents to CPD officers?"
@rajivsinclair Is there an account I should upload the Vimeo data under at archive.org?
So that we can understand where the May data set came from.
shootings-append.csv
appends data from 2 data sets on police shootings released pursuant to FOIA requests submitted in April and May 2016.
This data is dynamic, not static. For example, the status or details about a police shooting may change as an investigation progresses. There may also be other inconsistencies between the data sets.
We need to merge the shootings-append.csv
data into a combined file, shootings-merge.csv
. You can use your GitHub username or initials to identify your attempt at merging the data. For example, Matt Li (fictional person) might name his solution shootings-merge-ml.csv
.
Feel free to use your own tools to merge the data. One contributor is using Stata, but you can also use Python or R. We hope that through multiple parallel attempts at merging the data, we can catch each other's mistakes.
To quote @rajivsinclair: "We are all about redundancy. We can't have enough of that. Do it your own way, look at our start, look at our finish, and compare the results. Help us identify the flaws in our approach."
We are going to work on documenting shootings-append.csv
so the meaning of each column is clear. Please share questions about the data in this thread.
The repo name is shootings-data
, but it contains more than shootings data. Shall we rename? Great suggestion from @shua123.
Today we had a lot of new people come in, and while I was able to get them somewhat up to date on all of the data, I wasn't sure what work they should tackle next. What are some basic projects that these people could work on to get a feel for the data as well as contribute?
Go back to the roots of this repo. From the README:
This is a living repository of public data about Chicago’s police officers and their encounters with the public.
This means separating out:
We've ended up with a whole bunch of threads of people investigating different things. It doesn't make sense for those to be merged into this repo, but we still want some way keep track of that, facilitate their work, coordinate, and potentially build a community.
Basic idea: projects or investigations using the data are separate, whether independent GitHub repo, Google docs, etc. This would include stuff like, e.g., mapping of incidents; resolving officer identities; analyzing pdf reports. Still encourage people to announce and document their projects so that everyone can know what they're working on.
We're already kind of doing this, but here is a proposal to make it more structured using the tag system. We can have three tags:
INVOLVED_OFFICER_TYPE
set to Victim
?" For useful general questions we can add documentation to the wiki when the answer is figured out.My idea here is as follows:
In short term:
In long term:
Do we need/want a messaging platform to support a potential community of data users?
Right now, contributors from ChiHackNight are guests on InvInst's Slack. Once this effort is finished, we probably wouldn't need it for that purpose anymore. Presumably, we would want to generally keep chat for independent projects separate (i.e., not open up InvInst's Slack to anybody random person from anywhere).
We could make a separate Slack. We could use a competitor product like gitter which has free public channels viewable to anyone and only requires a GitHub account to say something (advantages over Slack). Or we could decide that the Issues Tracker is sufficient for discussion and not have anything.
Use to prioritize FOIA requests
Hi,
Great work y'all have done. Truly appreciate it. It's a force for the greater good. I looked at the wiki, the documentation on the workflow, and the data dictionary, but I'm wondering if you have any other resources for looking at the data and getting a handle on it? It's a bit unwieldy and hard to know where to start. I'm playing around some with the data now as well, and I'm not sure if what I'm doing is correct or not, and certain things I'm seeing I'm not sure what they mean (e.g. complaints-complaints with complaints-accused I find not every cr_id matches up and also there are complaints going all the way back to 1919).
A question that I keep coming up against is what is the most appropriate way, if at all, to combine and/or relate the different data sets that we have available to us. Just to briefly summarize what we have (this is in the wiki as well):
The unique-identifier column in February is Log No
, while in April and May datasets it is Complaint_Number
Assuming that we can treat the values in Log No
in the February dataset as equivalent to the values found in the Complaint_Number
column found in the April and May datasets (@chaclynhunt, @rajivsinclair can you confirm / refute this):
Overall the columns in the April and May sets largely correspond to each other, although there are several extra columns in May that do not exist in April (some work on trying to match the April and May columns can be found here, about halfway down the page)
February, in contrast, is a bit sparser in terms of overall data. While I think most of the columns in February can be matched to columns in April/May, there are a number of columns in April/May that do not exist in February.
One thing that might be worth looking in to is, for the unique identifiers that overlap across the datasets, how much of the data for those identifiers overlaps.
Considering, though, that these three datasets are produced by different sources (particularly reports of misconduct vs. the reports generated when an officer uses his/her firearm), I don't know that collapsing them into one large dataset is the best path forward. Or maybe it is, hence the need for discussion :)
We have 3 data sets that are marked as containing complaints data:
Some questions posed by @rajivsinclair for us to look at:
has some officer identified
, has some investigator identified
, has some final investigation/disciplinary outcomes
, has conflicting initial vs final outcomes
?)I will be leaving this issue open as a place to collect notes / analysis related to these questions
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.