police-data-accessibility-project / meta Goto Github PK

Planning our activities with issues that don't fit in a specific repository yet.

License: GNU General Public License v3.0

meta's Introduction

Purpose

You should probably just look here: https://github.com/Police-Data-Accessibility-Project

Understandable code is better than highly performant but unclear code
Strive to minimize dependencies and moving parts to increase resilience in the face of dependency outages
Treat data gathering and analysis with extreme care. The worst thing we can do is lose public trust in our data as accurate
An API that's unreliable erodes trust in the data we provide. If we can't keep the API up, how can the public trust our data's accuracy
Keep components isolated with strict interfaces, so as to enable us to swap out backend implementations more easily in the future
Backwards-compatibility is paramount - we must not break existing customers of our APIs without extreme reason. Doing so risks breaking other services and tools which depend on our API access patterns

It can help keep team members on the same page, and help ramp new contributors more quickly by helping them understand the high level reasons for why we might've made certain decisions.

Question about felonies-13-19 cleaned CSV

Hi @ktynski , I have a question about the felonies data, I downloaded it to play around and see if I could get any interesting insight out of it. What location is this from?

Extract PDF Text from Bay County Florida docket attachments

I made a scraper for Bay County Florida in #15, this was merged in #50. One limitation is that Bay County Florida do not show many of the minimum scraped fields requirements on their portal. Instead, these are embedded in PDFs attachments in the case dockets.

My scraper downloads these and appends the case number to the beginning. Eg: 20000113-CASE FILED 01102020 CASE NUMBER 20000113CFMA.pdf

Here's an example case filing PDF. The blue fields contain the arrest date, arresting officer, officer ID, and DOB, all of which are missing from the scraped data at present. The forms are typed, not hand-written, but text is not embedded as a textbox, but a image/scan of the printed form.

By extracting these fields the scraped data will be more complete.
This would require the following:

Classify scraped PDFs into form types
Get data from desired parts of the form using OCR/template matching/other technique
Update fields in scrapped CSV with this new data

I have no experience scraping data from PDFs, some people with experience doing this have discussed this briefly in Slack. Feel free to take on this task or private message me (Oscar) if you are willing to lend advice.

Purge or make the "security" directory current.

https://github.com/Police-Data-Accessibility-Project/Police-Data-Accessibility-Project/tree/master/security

This needs to be made up to date, or removed. Does Security need its own Component in Github?

Slack Invite Link is Dead

Unable to join. Please generate a new invite link and update the links in README.md and CONTRIBUTE.md.

Update Slack link within CONTRIBUTING.md

Seems this link has expired.

Update text that suggests this repo is where code resides

Since we opted to switch from a mono-repo project to a multi-repo project, code related to scrapers has moved to Police-Data-Accessibility-Project/Scrapers.

In light of this, some text has become inaccurate, such as:

This repository now seems to primarily consist of organisation and meeting minutes.

Finalize data landscape architecture

Structured, semi-structured, RDBMS or flat files...

Consider layers of the data landscape: ingestion, processing, and serving.

I propose a flat file data lake for raw data. The data will likely have different schemas and therefore raw data should be curated in the processing layer by an ELT/ETL tool. Finally, data should be stored in a performant warehouse for serving.

Build Scraper For SC Public Index

All counties in SC appear to use the same platform. You can get to each counties records from this page... https://www.sccourts.org/caseSearch/

I'd like to see a scraper built for this, as the single scraper could cover the entire state.

PII Risk Assessment - seceng

Discussion for building risk assessment for holding PII. I'd like to focus on distilling this towards a format similar to:

Please be creative, threat analyze to your heart's content.

Project Decisions about PII, technical or non-technical - i.e. PDAP will do this approach to PII

This action: ....
Creates this risk: ....
This risk has this effect on the project: .....
General idea on how to hedge the risk it: ...
This is how we best hedge it: ....
This is how we decently hedge: .....
This is worst case hedge: ....
ELI5, the risk and the fix:

Ex:

This action: architecture decisions lead to automated ingestion of raw data from scrapers
Creates this risk: data enters our core infra w/ various levels of verifiable safety
This risk has this effect on the project: malicious data enters designated secure areas, and ....
General Idea: we want to validate the safety of the data entering our pipeline. Safety means it comes from a known or verifiable source, etc....

Thanks!

Standardize structure for scraper dirs

'Counties/Florida' is wonky, but we'll definitely hit county name collisions otherwise. Maybe we should have Municipalities/Florida/Bay or something? We should also probably ween ourselves off of spaces (and dashes) as path components, as many languages don't like that.

This could end up looking like:

Municipalities/Florida/Bay/...
Municipalities/Florida/St_Johns/...
Municipalities/New_York/New_York_City/...

Define CSV target schema for web scrapers

Goal

Provide a package for web scrapers to validate processed data.

List of Features

Minimal CSV schema that is extensible (research if there are Existing Formats).
Definitions of all schema items ('Plain English' terms of what it is with unique identifier.).
Data Types. Validation criteria for specific data types (ie: ISO formatted dates / max length / min length).
Tests to reduce regressions.
Easy distribution method. Location to install package (NPM, PIP, etc)
Graceful and useful error handling (ie "Record row 10302 contains invalid FIPS code")
- Includes suggestions for moving forward to enable devs with action items (thanks @ nicholastmosher)

Decisions

TODO: Place upcoming meeting info between #database_general and #scrapers_general here
Location of this code (repo/package)
Technology (I propose python as it is accessible for many people. Although I would like to do something like Scala, Go, or Java. Rust was also mentioned which I'm OK with)
...

Please add/comment on features scoped to CSV FORMATTING ONLY in this issue.

Out of scope to this discussion

Storing RAW Version of webpages for future fidelity checks and snapshotting
Data Lake/Data Warehouse/RDBMs discussion
...

I'll update this as we think through what is required for this. Refer to the README for the current list of minimum fields (and information about expected PII)

To project owners: Don't be afraid to step on my toes here and edit/update this issue however you want.

PS If you are reading this and looking to get caught up on info

The #database_general slack channel has a link to the first database meeting and the meeting minutes can be found here: https://github.com/Police-Data-Accessibility-Project/Police-Data-Accessibility-Project/blob/master/Meeting%20Minutes/db_team_minutes/5_31_2020.md

Add Operations team definition on Organization page in the wiki to match latest

https://policeaccessibility.slack.com/archives/C014C0S31A9/p1590971223167000

Create database of community skillsets

This can be compiled from the #_introduceyourself slack channel, just needs someone to do the tedious work of going through it all.

Create Scraper for Bay County Florida (Benchmark Portal)

I'm working on a scraper for Bay County Florida. This portal is based on Benchmark by Pioneer Technology, so the scraper shouldn't require many changes to work on other counties with this same portal.

If anyone would like to assist please let me know. I'm mostly making this issue so that others don't begin work on the same portal system.

As language/libraries have not yet been defined, I am using Python with Selenium as the scraper library.

Data Source Search / Index

Currently: https://airtable.com/shrUAtA8qYasEaepI/tblx8XaKnFTphWNQM

Brief

This is an interface which helps people find what they need in our Data Sources db. Most of it is a table with filters for "faceted search".

As a Scraper coder, I should be able to find a data source and its ID using search filters.

As a data consumer, I should be able to find list of data sources pertaining to a geographic or topic area.

As a PDAP volunteer, I should be able to find out which data sources have been archived, and which have scrapers.

Quick search

Police-Data-Accessibility-Project/data-sources-app#15

Advanced search

Police-Data-Accessibility-Project/data-sources-app#17

Data Source Detail

Police-Data-Accessibility-Project/data-sources-app#18

Start using labels to organize issues

We're moving more things from Slack to github issues (like suggestions and design discussions). It may be helpful to have labels which clearly denote the types of issues that are present.

This also lets us clearly denote issues which are good for folks just getting involved (it's kinda rough presently).

Compile a list of counties using Tyler Technologies based systems

To build a generalised scraper that works on Tyler Technologies outsourced court record portals (or other widely used portal systems), it would be a good first step to know which counties use which portal system.

What are the obvious indicators of a Tyler Technologies portal?
So far I've observed Calaveras Superior's domain (cacalaverasportal.tylerhost.net) has tylerhost in its name, though this is not the case for all Tyler Technologies portals.
The bottom right of the page has an "Empowered by Tyler Technologies" logo.
The bottom left of the page says '© 2020 Tyler Technologies, Inc.'.
Where should this list be compiled?
I propose adding an extra column to Privacy_Public Access to Court Records State Links.csv detailing each portal's vendor.

Add wiki page for meeting transcription, highlights and chat logs; organized by date and topic

https://policeaccessibility.slack.com/archives/C014C0S31A9/p1590889096146300
https://policeaccessibility.slack.com/archives/C014C0S31A9/p1591108093268800
https://policeaccessibility.slack.com/archives/C014C0S31A9/p1591311241475600

Outreach: email template for supporting the "National Police Misconduct Database"

Suggestion

We could create an email template supporting the National Police Misconduct Database

Slack links

Suggested here

The file I started in Slack

Inspiration

https://www.vice.com/en_us/article/889gva/defund12-tool-emails-city-councilmembers-with-one-click

https://defund12.org/nyc

Help needed

The outreach, recognition, and copywriting teams can help.

Template

To: [AUTOMATICALLY POPULATE RELEVANT CONGRESS PEOPLE]

Subject: [*** INSERT UNIQUE SUBJECT LINE ***]

Message:

Dear Representatives,

My name is [YOUR NAME] and I am a resident of [CITY/TOWN/DISTRICT]. I am writing to you in support of the National Police Misconduct Database introduced in [TBD].

[TODO: WRITE REQUEST FOR THEM TO SUPPORT BILL AS WELL]

[TODO: EXPLAIN WHY THE BILL IS IMPORTANT]

[TODO: ]

Thank you, 
[YOUR NAME]
[YOUR ADDRESS]
[YOUR EMAIL]
[YOUR PHONE NUMBER]

Run `black` on all Python, set up a Python linter action

Having a Python linter would help us maintain good code health. Before we (or immediately after) do that, we should hit everything with black once and manually tidy up what's necessary so we start out healthy.

If we want to use black more going forward, I'd be down for that, but I don't have a strong opinion.

Alameda County Courts Link

For anyone looking at scraping the court records for Alameda County, CA: it seems the link on the spreadsheet is only for civil trials which I am assuming we are ignoring. Criminal trials are available at: https://publicportal.alameda.courts.ca.gov/publicportal/ and they seem impossible to scrap by day.

Currently I think the best option is to go:
Search Hearings > Search type = Judicial Officer; Go through all cases under each individual officer in the drop-down from 2005(oldest available record) to present

This only provides case numbers though which we will have to plug back into their "smart search" to actually get case data from. These case records DO NOT contain any information about the arresting officer so I don't know if it is even worth scrapping.

Trouble joining the slack

I tried joining the slack, but it doesn't seem to be emailing me when it claims it is. Have attempted several times. Is link expired?

Define common 'interface' for all scrapers

Possible things to define:

Where data should land when a scraper is run?
- Should all scrapers handle upload to another service? If so, let's decide where.
- Preferably, something like a $OUTPUT_DIR system variable should be respected by the scraper and data should be put there. Then some common sidecar process when the scraper is run in the cloud can handle watching that directory and moving the data towards its final destination.
Should there be a common way to specify where to start? It isn't clear to me whether this has to be done as whole-sale scraping A-Z, or if certain data sets can be targeted in batches. Somewhat similar to message queue offsets.

Scraping Connecticut Superior Court

Link to portal

My fork

The site has an extensive form we can use to query records. Some desired data, such as

ChargeCount
ChargeStatute
ChargeDescription
ChargeDisposition
ChargeDispositionDate
ChargeOffenseDate
ChargeCitationNum
ChargePlea
ChargePleaDate
ArrestDate
FilingDate
OffenseDate
DivisionName

may only exist within court documents (summons, complaint, etc.) and more human-based approaches might be necessary.

Strict DigitalOcean firewall rules required on all assets

Context

Recommended by a cyber security expert on our board.

Requirements

Lockdown Digital Ocean network environments with firewall rulesets that deny all traffic by default and only allow traffic we architect into the application
- DO docs on this

Docs

Instructions in staff docs about how to request new routes
Guidance about when to open up new routes, what we're using certain routes for

Set up CI for scrapers

Maybe Travis?

Discuss: Mission Statement for PDAP

What is your view of the PDAP's Mission Statement

If you have an idea of the Mission Statement, please contribute in this form below - succinct and measured is best. If you'd like to contribute dialogue towards what other Mission Statements have been suggested, keep it civil and facts-based.

Example Mission Statement idea:

(Have these two)
Mission Statement: PDAP's Mission Statement is......

Why I think this: 

(Suggested)
What will this Mission Statement stop us from doing:

How does this Mission Statement help us grow:

SIEM/log monitoring and aggregation tool

Establish logging on all server assets in edition to the already existing base DO logging. My plan for this is to setup a Splunk server (free tier) and then put the Splunk Forwarding Agent on all servers

Obtain legal counsel

Please use this issue to discuss obtaining legal counsel and the risks associated with the groups activities.

Define formal requirements for MVP

I have picked up bits and pieces but let's start this thread to define the requirements of what we are building.

Two other sources to include

Chicago: citizens police data project (data via FOIA)

National: Police Data Initiative (from the departments themselves)

Add "Organization" page to the Wiki

I've added an adaptation of this post from the #_announcements channel in the team slack to the Wiki as the "Organization" page: https://policeaccessibility.slack.com/archives/C014C0S31A9/p1590889096146300

These changes can be viewed on my fork of the wiki.

I have made some changes that I'd like to point out for the sake of transparency:

I replaced all instances of "Product team" with "Engineering team" per the discussion here: https://policeaccessibility.slack.com/archives/C014X1CQ63B/p1591373296396400
When referring to the "Product" itself, I changed the text to instead refer to "the PDAP system". Instead of saying the project has a "product focus", I said it has a "software-deliverable focus".
I shortened the name of the "Security & Operations" team to "Operations", since the original writeup itself did shorten that name in the description.
Instead of saying that "the conversation around the mission statement is being discussed offline", I said "the conversation around the mission statement is being discussed among the leadership". This was done with the purpose of expressing more transparency.

The Slack Invite Link In README and CONTRIBUTING.md Not Working

The links to join the Slack channel in the documentation appear to have expired. They both link to a Slack page with the following message:

This link is no longer active
To join this workspace, you’ll need to ask the person who originally invited you for a new link.

Volunteer Support Doc - seceng

What: build 1-sheet recommendation of tool, actions, and points of contact on Security Engineering for volunteers working on PDAP
Why: start building communication bridges and trust b/t SecEng and the rest of the project, enable volunteers to safely participate in the project with least amount of possible friction and acceptable risk
How:
--> aggregate ideas here
--> split into separate docs base on role in the project
--> Split b/t 'must do,' 'important to do,' 'consider doing,' or similar idea (ranked suggestions)
--> distribute docs
Follow-on work: ID what can be enforced from these recommendations, start using this for threat vector planning, etc.

Centralized Metrics view

We're considering using the Notion API or Django here. Notion is configurable without code, and one less app to deploy. Django is more customizable.

GitHub

Police-Data-Accessibility-Project/pdap.io#41
Numbers of scrapers which exist
Percentage of scrapers working

DoltHub

Dolt SQL API: https://docs.dolthub.com/dolthub/api

PRs merged in the DoltHub repo (by user)
helpful blog post
Number of Datasets documented / Estimated total
Number Agencies documented / Estimated total
Number of agencies with all known datasets scraped

token published in codebase

saw reddit thread in /r/javascript about this repo and a user in there commented about a token being exposed.
adding this just to highlight it in the actual repo

there's a token exposed in the r.js file under the scraper

Scraping Missouri (courts.mo.gov)

I'm working on a scraper for Missouri with an emphasis on St. Louis (for now). It's a really old Java + JS site, and it's rather arcane. Once I'm done with the 22nd circuit (STL), I'll be able to generalize the scraper and capture the rest of MO.

Slack Invite Link expired

current link[1] in CONTRIBUTING.md is expired - can someone provide a new one? Thanks!

https://join.slack.com/t/policeaccessibility/shared_invite/zt-eji7fh9w-slynNpPJtcGLUUhbhBmbTg

Transfer scraper-related issues to the Scraping repo

Transferring issues is a thing. Can someone transfer the scraping-related issues over to the Scraping repo? I lack the permissions. 😢

Project Status view

As a data scraper or interested stakeholder, I should be able to get a birds-eye view of how many scrapers are written, active, etc. for the quantity of datasets.

I should be able to drill down and find specific areas of contribution.

This may just be a summary of a few key metrics above, below, or overlaid on the map.

List the quantity of datasets with working scrapers (by status) relative to known datasets.
- total
- relative to geographic locations
- relative to data types
- scoped by data type

Patreon/Support/Donations

Great initiative!

Just wanted to suggest some sort of way people could sponsor any of the compute/api/etc cost to help support this.

Adjust Engineering team definition on Organization page in the wiki to match latest

https://policeaccessibility.slack.com/archives/C014C0S31A9/p1591308050464300

Establish a project License

Hey everyone, I've started to see people talk about opening PRs and actually merging code (e.g. scrapers) already, but I'm concerned about the fact that there's no code license in this repository yet. IANAL, but my loose understanding is that if the repository does not have an explicit license, then the authors of each contribution have full copyright over their work, and others (i.e. the project owner) would not be able to assign a license to the project without the explicit approval of all the existing contributors. So I suppose I'd like to open up this issue as a place to discuss licenses for the project.

I've heard talk that it might be desirable to license the software in one way and the collected data in another. @KMK on slack suggested GNU GPLv3 for the software license, which I think would be great for this project. For anyone who's not familiar, here's a good summary of it from https://choosealicense.com/licenses/gpl-3.0/

Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights.

Your project is the top post on HN.

I came across your project on Reddit and figured I'd submitted to Hacker News. As I'm writing this, it's the top submission. This is an awesome project and I love the idea.

Just wanted to let you folks know and I apologize if you didn't want the attention yet.

https://news.ycombinator.com/item?id=23384556

Dataset Status update

Current statii

We should change the dataset_status table to be these instead, in this order:

Cannot be scraped now

1 invalid URL
2 legal blocker
3 not started
4 scraper out of date


Can be scraped now

5 scraper and sample data exist
6 scraper can be run manually
7 scraper is run automatically

Update Wiki with Project Description

I've transcribed the following document that was posted in Slack and added it to the wiki home page

You can preview the wiki page on my fork. Once you're ready to merge, you can use this URL for the forked wiki repo

police-data-accessibility-project / meta Goto Github PK

meta's Introduction

Purpose

See also

meta's People

Contributors

Stargazers

Watchers

Forkers

meta's Issues