Git Product home page Git Product logo

meta's Introduction

meta's People

Contributors

constantinek avatar csa-goose avatar cteea avatar danielmelles avatar dongately avatar ellygaytor avatar evanhahn avatar jlintag avatar josh-chamberlain avatar ktynski avatar mcpf15 avatar mcsaucy avatar mitchyme avatar nfrostdev avatar not-new avatar oscarvanl avatar rainmana avatar sambarnes avatar timwis avatar zgoulson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meta's Issues

Retain anonymized PII for ability to correlate trends with individuals without revealing identity

This will largely be up to the legal team, but it would be useful to be able to store personally identifying information such as the arresting officer, judge, and attorneys in an anonymized but consistent form.

By retaining that data we would be able to track trends in, for example, an officer's ratio of arrests by race, or a judge's sentencing by race.

One useful trend to track would be an officer's frequency of charging for "Resisting an Officer/Arrest", because this charge can be an indicator of a more violent or abusive officer. Especially if this correlates with the charge being applied more often to POC.

With this sort of data we could draw useful conclusions about the behavior of states, counties, courts and departments without naming individuals.

Containerize scraper using docker

As a developer, it would be ideal if I could easily spin up different scrapers using a common interface so we can easily test and deploy each scraper in a common way.

One possible path to accomplish this is to use docker and build deployable containers. This will allow a CI system to quickly build and test scrapers without needing to know specific details about each one.

Each container will need to install the specific dependencies for each scrapper.

The biggest challenge will be setting up the browsers in a headless manner. As of today the node scraper uses puppeteer with chrome, and the python scrapper uses selenium with firefox.

Sometimes Case Numbers in Bay County represent two separate cases

After scraping continually for a day, my scraper for Bay County stopped working properly.

This was because some case numbers on the portal result are associated with more than one case...

To reproduce:

Go to the portal's search page and search for case 19002535 and a single result will appear. Now search for case 19002536 and a search window with two cases will appear: 19002536CFMB and 19002536CFMA, a behaviour that was not seen before.

This should not be hard to resolve, each case number on this 'Case Search Results' window should be opened and scraped.

Currently, the search_portal() function treats arriving at the Case Search Results window as a failed search, as the same window opens when there are 0 search results. This assumption must also be changed.

If you intend to reproduce the behaviour, take a cell in the bay-county-scraped.csv output and change the bottom row's CaseNum to 19002535 and it will start scraping from the broken record.

I have created an issue as I do not intend to immediately resolve this. Others are free to have a go, I know @mcsaucy has been contributing to this scraper a lot. :) If no one takes it up, I will get around to it.

Technical Direction + Design Tenets

Forgive me if we already have such things, but I think it'd be good to come up with a set of design goals/high-level tenets for the various components of the system.

It can help a lot with making a choice when presented with many differing opinions on direction. Since this is a super grass-roots organization that's trying to ramp up quickly, we're bound to end up with many differing opinions on technical direction.

Things like the below. I'm only using them as examples:

  • Understandable code is better than highly performant but unclear code
  • Strive to minimize dependencies and moving parts to increase resilience in the face of dependency outages
  • Treat data gathering and analysis with extreme care. The worst thing we can do is lose public trust in our data as accurate
  • An API that's unreliable erodes trust in the data we provide. If we can't keep the API up, how can the public trust our data's accuracy
  • Keep components isolated with strict interfaces, so as to enable us to swap out backend implementations more easily in the future
  • Backwards-compatibility is paramount - we must not break existing customers of our APIs without extreme reason. Doing so risks breaking other services and tools which depend on our API access patterns

It can help keep team members on the same page, and help ramp new contributors more quickly by helping them understand the high level reasons for why we might've made certain decisions.

Extract PDF Text from Bay County Florida docket attachments

I made a scraper for Bay County Florida in #15, this was merged in #50. One limitation is that Bay County Florida do not show many of the minimum scraped fields requirements on their portal. Instead, these are embedded in PDFs attachments in the case dockets.

My scraper downloads these and appends the case number to the beginning. Eg: 20000113-CASE FILED 01102020 CASE NUMBER 20000113CFMA.pdf

Here's an example case filing PDF. The blue fields contain the arrest date, arresting officer, officer ID, and DOB, all of which are missing from the scraped data at present. The forms are typed, not hand-written, but text is not embedded as a textbox, but a image/scan of the printed form.

By extracting these fields the scraped data will be more complete.
This would require the following:

  1. Classify scraped PDFs into form types
  2. Get data from desired parts of the form using OCR/template matching/other technique
  3. Update fields in scrapped CSV with this new data

I have no experience scraping data from PDFs, some people with experience doing this have discussed this briefly in Slack. Feel free to take on this task or private message me (Oscar) if you are willing to lend advice.

Slack Invite Link is Dead

Unable to join. Please generate a new invite link and update the links in README.md and CONTRIBUTE.md.

Update text that suggests this repo is where code resides

Since we opted to switch from a mono-repo project to a multi-repo project, code related to scrapers has moved to Police-Data-Accessibility-Project/Scrapers.

In light of this, some text has become inaccurate, such as:

image

This repository now seems to primarily consist of organisation and meeting minutes.

Finalize data landscape architecture

Structured, semi-structured, RDBMS or flat files...

Consider layers of the data landscape: ingestion, processing, and serving.

I propose a flat file data lake for raw data. The data will likely have different schemas and therefore raw data should be curated in the processing layer by an ELT/ETL tool. Finally, data should be stored in a performant warehouse for serving.

PII Risk Assessment - seceng

Discussion for building risk assessment for holding PII. I'd like to focus on distilling this towards a format similar to:

Please be creative, threat analyze to your heart's content.

Project Decisions about PII, technical or non-technical - i.e. PDAP will do this approach to PII

  • This action: ....
  • Creates this risk: ....
  • This risk has this effect on the project: .....
  • General idea on how to hedge the risk it: ...
  • This is how we best hedge it: ....
  • This is how we decently hedge: .....
  • This is worst case hedge: ....
  • ELI5, the risk and the fix:

Ex:

  • This action: architecture decisions lead to automated ingestion of raw data from scrapers
  • Creates this risk: data enters our core infra w/ various levels of verifiable safety
  • This risk has this effect on the project: malicious data enters designated secure areas, and ....
  • General Idea: we want to validate the safety of the data entering our pipeline. Safety means it comes from a known or verifiable source, etc....

Thanks!

  • DG

Standardize structure for scraper dirs

'Counties/Florida' is wonky, but we'll definitely hit county name collisions otherwise. Maybe we should have Municipalities/Florida/Bay or something? We should also probably ween ourselves off of spaces (and dashes) as path components, as many languages don't like that.

This could end up looking like:

Municipalities/Florida/Bay/...
Municipalities/Florida/St_Johns/...
Municipalities/New_York/New_York_City/...

Define CSV target schema for web scrapers

Goal

Provide a package for web scrapers to validate processed data.

List of Features

  • Minimal CSV schema that is extensible (research if there are Existing Formats).
  • Definitions of all schema items ('Plain English' terms of what it is with unique identifier.).
  • Data Types. Validation criteria for specific data types (ie: ISO formatted dates / max length / min length).
  • Tests to reduce regressions.
  • Easy distribution method. Location to install package (NPM, PIP, etc)
  • Graceful and useful error handling (ie "Record row 10302 contains invalid FIPS code")
    • Includes suggestions for moving forward to enable devs with action items (thanks @ nicholastmosher)

Decisions

  • TODO: Place upcoming meeting info between #database_general and #scrapers_general here
  • Location of this code (repo/package)
  • Technology (I propose python as it is accessible for many people. Although I would like to do something like Scala, Go, or Java. Rust was also mentioned which I'm OK with)
  • ...

Please add/comment on features scoped to CSV FORMATTING ONLY in this issue.

Out of scope to this discussion

  • Storing RAW Version of webpages for future fidelity checks and snapshotting
  • Data Lake/Data Warehouse/RDBMs discussion
  • ...

I'll update this as we think through what is required for this. Refer to the README for the current list of minimum fields (and information about expected PII)

To project owners: Don't be afraid to step on my toes here and edit/update this issue however you want.

PS If you are reading this and looking to get caught up on info

The #database_general slack channel has a link to the first database meeting and the meeting minutes can be found here: https://github.com/Police-Data-Accessibility-Project/Police-Data-Accessibility-Project/blob/master/Meeting%20Minutes/db_team_minutes/5_31_2020.md

Create Scraper for Bay County Florida (Benchmark Portal)

I'm working on a scraper for Bay County Florida. This portal is based on Benchmark by Pioneer Technology, so the scraper shouldn't require many changes to work on other counties with this same portal.

If anyone would like to assist please let me know. I'm mostly making this issue so that others don't begin work on the same portal system.

As language/libraries have not yet been defined, I am using Python with Selenium as the scraper library.

Data Source Search / Index

Currently: https://airtable.com/shrUAtA8qYasEaepI/tblx8XaKnFTphWNQM

Brief

This is an interface which helps people find what they need in our Data Sources db. Most of it is a table with filters for "faceted search".

As a Scraper coder, I should be able to find a data source and its ID using search filters.

As a data consumer, I should be able to find list of data sources pertaining to a geographic or topic area.

As a PDAP volunteer, I should be able to find out which data sources have been archived, and which have scrapers.

Quick search

Police-Data-Accessibility-Project/data-sources-app#15

Advanced search

Police-Data-Accessibility-Project/data-sources-app#17

Data Source Detail

Police-Data-Accessibility-Project/data-sources-app#18

Start using labels to organize issues

We're moving more things from Slack to github issues (like suggestions and design discussions). It may be helpful to have labels which clearly denote the types of issues that are present.

This also lets us clearly denote issues which are good for folks just getting involved (it's kinda rough presently).

Compile a list of counties using Tyler Technologies based systems

To build a generalised scraper that works on Tyler Technologies outsourced court record portals (or other widely used portal systems), it would be a good first step to know which counties use which portal system.

  • What are the obvious indicators of a Tyler Technologies portal?
    So far I've observed Calaveras Superior's domain (cacalaverasportal.tylerhost.net) has tylerhost in its name, though this is not the case for all Tyler Technologies portals.
    The bottom right of the page has an "Empowered by Tyler Technologies" logo.
    The bottom left of the page says '© 2020 Tyler Technologies, Inc.'.

  • Where should this list be compiled?
    I propose adding an extra column to Privacy_Public Access to Court Records State Links.csv detailing each portal's vendor.

Outreach: email template for supporting the "National Police Misconduct Database"

Suggestion

We could create an email template supporting the National Police Misconduct Database

Slack links

Suggested here

The file I started in Slack

Inspiration

https://www.vice.com/en_us/article/889gva/defund12-tool-emails-city-councilmembers-with-one-click

https://defund12.org/nyc

Help needed

The outreach, recognition, and copywriting teams can help.

Template

To: [AUTOMATICALLY POPULATE RELEVANT CONGRESS PEOPLE]

Subject: [*** INSERT UNIQUE SUBJECT LINE ***]

Message:

Dear Representatives,

My name is [YOUR NAME] and I am a resident of [CITY/TOWN/DISTRICT]. I am writing to you in support of the National Police Misconduct Database introduced in [TBD].

[TODO: WRITE REQUEST FOR THEM TO SUPPORT BILL AS WELL]

[TODO: EXPLAIN WHY THE BILL IS IMPORTANT]

[TODO: ]

Thank you, 
[YOUR NAME]
[YOUR ADDRESS]
[YOUR EMAIL]
[YOUR PHONE NUMBER]

Run `black` on all Python, set up a Python linter action

Having a Python linter would help us maintain good code health. Before we (or immediately after) do that, we should hit everything with black once and manually tidy up what's necessary so we start out healthy.

If we want to use black more going forward, I'd be down for that, but I don't have a strong opinion.

Alameda County Courts Link

For anyone looking at scraping the court records for Alameda County, CA: it seems the link on the spreadsheet is only for civil trials which I am assuming we are ignoring. Criminal trials are available at: https://publicportal.alameda.courts.ca.gov/publicportal/ and they seem impossible to scrap by day.

Currently I think the best option is to go:
Search Hearings > Search type = Judicial Officer; Go through all cases under each individual officer in the drop-down from 2005(oldest available record) to present

This only provides case numbers though which we will have to plug back into their "smart search" to actually get case data from. These case records DO NOT contain any information about the arresting officer so I don't know if it is even worth scrapping.

Trouble joining the slack

I tried joining the slack, but it doesn't seem to be emailing me when it claims it is. Have attempted several times. Is link expired?

Define common 'interface' for all scrapers

Possible things to define:

  • Where data should land when a scraper is run?

    • Should all scrapers handle upload to another service? If so, let's decide where.
    • Preferably, something like a $OUTPUT_DIR system variable should be respected by the scraper and data should be put there. Then some common sidecar process when the scraper is run in the cloud can handle watching that directory and moving the data towards its final destination.
  • Should there be a common way to specify where to start? It isn't clear to me whether this has to be done as whole-sale scraping A-Z, or if certain data sets can be targeted in batches. Somewhat similar to message queue offsets.

Scraping Connecticut Superior Court

Link to portal

My fork

The site has an extensive form we can use to query records. Some desired data, such as

  • ChargeCount
  • ChargeStatute
  • ChargeDescription
  • ChargeDisposition
  • ChargeDispositionDate
  • ChargeOffenseDate
  • ChargeCitationNum
  • ChargePlea
  • ChargePleaDate
  • ArrestDate
  • FilingDate
  • OffenseDate
  • DivisionName

may only exist within court documents (summons, complaint, etc.) and more human-based approaches might be necessary.

Strict DigitalOcean firewall rules required on all assets

Context

Recommended by a cyber security expert on our board.

Requirements

  • Lockdown Digital Ocean network environments with firewall rulesets that deny all traffic by default and only allow traffic we architect into the application

Docs

  • Instructions in staff docs about how to request new routes
  • Guidance about when to open up new routes, what we're using certain routes for

Discuss: Mission Statement for PDAP

What is your view of the PDAP's Mission Statement

If you have an idea of the Mission Statement, please contribute in this form below - succinct and measured is best. If you'd like to contribute dialogue towards what other Mission Statements have been suggested, keep it civil and facts-based.

Example Mission Statement idea:

(Have these two)
Mission Statement: PDAP's Mission Statement is......

Why I think this: 

(Suggested)
What will this Mission Statement stop us from doing:

How does this Mission Statement help us grow:

SIEM/log monitoring and aggregation tool

Establish logging on all server assets in edition to the already existing base DO logging. My plan for this is to setup a Splunk server (free tier) and then put the Splunk Forwarding Agent on all servers

Obtain legal counsel

Please use this issue to discuss obtaining legal counsel and the risks associated with the groups activities.

Add "Organization" page to the Wiki

I've added an adaptation of this post from the #_announcements channel in the team slack to the Wiki as the "Organization" page: https://policeaccessibility.slack.com/archives/C014C0S31A9/p1590889096146300

These changes can be viewed on my fork of the wiki.

I have made some changes that I'd like to point out for the sake of transparency:

  • I replaced all instances of "Product team" with "Engineering team" per the discussion here: https://policeaccessibility.slack.com/archives/C014X1CQ63B/p1591373296396400

  • When referring to the "Product" itself, I changed the text to instead refer to "the PDAP system". Instead of saying the project has a "product focus", I said it has a "software-deliverable focus".

  • I shortened the name of the "Security & Operations" team to "Operations", since the original writeup itself did shorten that name in the description.

  • Instead of saying that "the conversation around the mission statement is being discussed offline", I said "the conversation around the mission statement is being discussed among the leadership". This was done with the purpose of expressing more transparency.

The Slack Invite Link In README and CONTRIBUTING.md Not Working

The links to join the Slack channel in the documentation appear to have expired. They both link to a Slack page with the following message:

This link is no longer active
To join this workspace, you’ll need to ask the person who originally invited you for a new link.

Volunteer Support Doc - seceng

  • What: build 1-sheet recommendation of tool, actions, and points of contact on Security Engineering for volunteers working on PDAP

  • Why: start building communication bridges and trust b/t SecEng and the rest of the project, enable volunteers to safely participate in the project with least amount of possible friction and acceptable risk

  • How:
    --> aggregate ideas here
    --> split into separate docs base on role in the project
    --> Split b/t 'must do,' 'important to do,' 'consider doing,' or similar idea (ranked suggestions)
    --> distribute docs

  • Follow-on work: ID what can be enforced from these recommendations, start using this for threat vector planning, etc.

Centralized Metrics view

We're considering using the Notion API or Django here. Notion is configurable without code, and one less app to deploy. Django is more customizable.

GitHub

DoltHub

Dolt SQL API: https://docs.dolthub.com/dolthub/api

  • PRs merged in the DoltHub repo (by user)
    helpful blog post
  • Number of Datasets documented / Estimated total
  • Number Agencies documented / Estimated total
  • Number of agencies with all known datasets scraped

token published in codebase

saw reddit thread in /r/javascript about this repo and a user in there commented about a token being exposed.
adding this just to highlight it in the actual repo

there's a token exposed in the r.js file under the scraper

Scraping Missouri (courts.mo.gov)

I'm working on a scraper for Missouri with an emphasis on St. Louis (for now). It's a really old Java + JS site, and it's rather arcane. Once I'm done with the 22nd circuit (STL), I'll be able to generalize the scraper and capture the rest of MO.

Slack Invite Link expired

current link[1] in CONTRIBUTING.md is expired - can someone provide a new one? Thanks!

  1. https://join.slack.com/t/policeaccessibility/shared_invite/zt-eji7fh9w-slynNpPJtcGLUUhbhBmbTg

slacklinkbroken

Project Status view

As a data scraper or interested stakeholder, I should be able to get a birds-eye view of how many scrapers are written, active, etc. for the quantity of datasets.

I should be able to drill down and find specific areas of contribution.

This may just be a summary of a few key metrics above, below, or overlaid on the map.

  • List the quantity of datasets with working scrapers (by status) relative to known datasets.
    • total
    • relative to geographic locations
    • relative to data types
    • scoped by data type

Patreon/Support/Donations

Great initiative!

Just wanted to suggest some sort of way people could sponsor any of the compute/api/etc cost to help support this.

Establish a project License

Hey everyone, I've started to see people talk about opening PRs and actually merging code (e.g. scrapers) already, but I'm concerned about the fact that there's no code license in this repository yet. IANAL, but my loose understanding is that if the repository does not have an explicit license, then the authors of each contribution have full copyright over their work, and others (i.e. the project owner) would not be able to assign a license to the project without the explicit approval of all the existing contributors. So I suppose I'd like to open up this issue as a place to discuss licenses for the project.

I've heard talk that it might be desirable to license the software in one way and the collected data in another. @KMK on slack suggested GNU GPLv3 for the software license, which I think would be great for this project. For anyone who's not familiar, here's a good summary of it from https://choosealicense.com/licenses/gpl-3.0/

Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights.

Dataset Status update

Current statii

Screen Shot 2021-05-31 at 1 21 06 PM

We should change the dataset_status table to be these instead, in this order:

Cannot be scraped now

1 invalid URL
2 legal blocker
3 not started
4 scraper out of date


Can be scraped now

5 scraper and sample data exist
6 scraper can be run manually
7 scraper is run automatically

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.