datakind-dc / rcp2 Goto Github PK

Red Cross!

License: MIT License

Jupyter Notebook 99.71% Python 0.29% Makefile 0.01%

rcp2's Introduction

Red Cross Fire Risk Map v2

What areas in the USA are at the highest risk of home fires, and where should the American Red Cross go to install smoke alarms?

In Phase 1 DKDC created 6 models to analyze fire response data, smoke alarm data, and census data to assign a fire risk score to census tracts across the United States. The results from these models helped generate a map of high-risk census tracts across the United States, which informed planning and helped us adjudicate resources. Now, DKDC has been asked to replicate this effort at a census block level (a smaller geographic unit), so that the Red Cross can more efficiently target smoke detector distribution efforts. This phase of work will help ensure fire alarms are handed out where they are most needed.

Phase 2 has three primary objectives:

Refine and update risk model to include smaller geographic areas and new data.
Set up a method so that the model can easily be refreshed by the Red Cross team when new home fire datasets are available.

Documentation

See project documentation for an API reference.

Quickstart Guide

Here is the general overview of the steps you'll need to perform to get started on our project. For a more details on any of the steps read getting-started.md located in the Docs folder.

1. Get on our Slack Channel

https://dkdc.herokuapp.com/

2. Get the data repository link from RCP2_public

3. Download "Master Project Data" folder RCP2 > 02_data > Master Project Data

4. Fork this repo and place 'Master Project Data' into Data folder

5. Python installation (optional but recommended)

Download anaconda ( https://anaconda.org/)

go to command line ( or anaconda terminal) and navigate to this directory (usually documents/github/rcp2)

conda env create -f environment.yml 
conda activate RCP2
jupyter lab

This will create your environment and activate and you'll be ready to go.

6. (optional) download github desktop

make your life easier if you are new to github https://desktop.github.com/

7. (optional) Read Up

in the Google Drive there is a lot of great resources compiled in both Master Project Data and on the drive at 01_project_overview > Additional reading

8. Find a Task

Click on the Projects board (above) and then RCP2 to get a look at all the current tasks

Project Discussion and Materials

Project Discussion is on the DKDC Slack in the #rcp2_public channel.

Data Location and Dictionaries: We have new NFIRS, Red Cross, and ACS data that we would like to incorporate. We would like to consider adding new types of data as well such as climate data.

Phase 1 Map: Fire Risk Map

Phase 1 Blog Post: DataKind Blog.

How to Get Involved and Help Out

Please review the skills we are looking for below, and let us know if you’d like to get involved by emailing a data ambassador or posting in the Slack channel - we’d love your help!

Skills used/needed: There are two main components of the project: data modeling, and visualization. The modeling part requires aggregating, joining, and geocoding large datasets, and modeling fire risk from the variables contained. Python been the main language for the project so far and is reccomended for beginners, but R,tableau,GIS is also welcome if you are more comfortable with using it. The visualization portion of the project needs front-end web development skills, particularly Mapbox GL, D3.js, html and javascript.

Input Data	Folder Name	Geo Type	Description / Comments
American Community Survey	02_inputdata_ACS	census tract	socio-economic variables
American Housing Survey
American Red Cross Preparedness Data			ARC home visits for smoke alarm installation and fire safety instruction. Includes the number of smoke alarms installed, environmental hazards, the number of alarms that existed in a dwelling prior to a ARC visit, etc.
American Red Cross Response Data			ARC home visits for smoke alarm installation and fire safety instruction. Includes the number of smoke alarms installed, environmental hazards, the number of alarms that existed in a dwelling prior to a ARC visit, etc.
American Red Cross Boundary Data		ARC region/chapters, zip codes
Census
Homeland Infrastructure Foundation Level Data (HIFLD) fire station locations			2010 list of >50k fire stations and their latitude & longitude coordinates in the USA [source]. 2017 Census tract & blocks added. Shapefiles can be found at source.
HIFLD emergency medical service locations			2010 list of ambulatory and EMS locations in the USA [source]. 2017 Census tract & blocks added. Shapefiles can be found at source.
NFIRS		Census tract	CSV-file that contains the address, latitude & longitude, Census tract, and Census block information for home fires in the USA from 2009-2016
SVI			2016 CDC’s Social Vulnerability Index which is based off of Census’s American Community Survey data. Includes indexes for socioeconomic, household composition and disability, minority status and language, and housing and transportation that summarizes a population’s resilience to stressors.

DataKind DataCorps

DataKind DataCorps brings together teams of pro bono data scientists with social change organizations on long-term projects that use data science to transform their work and their sector. We help organizations define their needs and discover what’s possible, then match them with a team that can translate those needs into data science problems and solve them with advanced analytics.

We are very proud to re-partner with the American Red Cross!

Project Organization

├── LICENSE
|.  requirements.txt   <- list of python packages currently used in project. 
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├──
│   ├── interim        <- Intermediate data that you have transformed. 
│   ├── Master Project Data <- The final, canonical data sets for modeling.(on google drive)
│   └── raw            <- The original, immutable data dump. ( on google drive)
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks in progress. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials (moved to google drive in Master)                                                                                                project data).
│
|.  ----(not currently implemented) ---- 
|          Future roadmap
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│

Project based on the cookiecutter data science project template. #cookiecutterdatascience

=======

rcp2's People

Contributors

Stargazers

Watchers

rcp2's Issues

add Census tract to NFIRS addresses

Identify Priority Fire Risk Indicators for GEOIDs

Identify Priority Risk Indicators - Review relevant research and identify priority risk indicators that are available in SVI and Census (available at block group level). Document

Count the number of fires per Congressional district

@danielsnyder116

NFIRS Fire Incident Data.csv - Investigate Null Values

NFIRS Fire Incident Data.csv, which is the updated NFIRS fire dataset we're now using, has ~270,000 null values for Fire Station ID (FDID), when the previous version of the dataset didn't have any.

Compare Fire Propensity Model and Fire Severity Model

Determine:

Are the risk factors for severe fires the same for fire propensity.
What is the overlap in the model predictions?

how many blocks are high in severe fires but not total fires?
how many blocks have a high number of fires but not severe fires?

Package the code for cleaning raw nfirs data

Download American Community Survey

@whiteashleye

Map fire stations to GEOIDs, calculate primary fire station responder for each GEOID

Data sources: NFIRS, FD Locations

Fire Station Coverage Areas Step 1 - Determine which Census Block Groups FD Stations have previously responded to.
Fire Station Coverage Areas Step 2 - Determine which Fire Station is the "Primary" responder for GEOID Census Block Group and based on GEOID, which station responds most often.

Calculate rurality of counties, census tracts, and census block groups

Data sources: SVI, GIS, Census Data

County Rurality (Populated Areas) - Use SVI sq mile and population estimate data (E_TOTPOP) to determine rurality of populated areas in counties. Exclude any Census Tracts w/ 0 est pop. Score and rank order.
Census Tract Rurality (Populated Tracts) - Determine (and document) means to measure Census Tract Rurality (see Census Tract Rurality Tab). Exclude zero pop areas, score and rank order.
Census Block Group Rurality (Populated Block Groups) - Based on outputs from census tract rurality, determine means to assign rurality scores to block groups (based on size most likely). Should be relative to tract rurality score

Smoke_alarm_model Missing some NFIRS Geoids

In the Smoke_Alarm Model There are some blocks that have never been visited in the ARC data and so there are a number of blocks that are not included in the current model.

to fix:
Cross-reference ARC data ensuring all blocks exist and add rows with zeros for blocks with no visits.

OctoberDataJam Munging: Housing age

ACS data has several housing categories

house_prc_built_before_1929
house_prc_built_1929_1939
...
house_prc_built_1969_1979

create a feature called house_prc_before_1980 that aggregates all these features into one feature.

Count the number of fires by Census tract

Gather data from ACS and AHS

Data sources: ACS, AHS

Get All Available ACS Data @ BlockGroup - Pull from Census API, 5-year estimates
Get All Available AHS Data @ Tract Level - Pull from Census API, Desire years 2013 (last with smoke alarm data) and most recent available data.

Make a rcp2 package proof of concept.

Purpose

It should be easy for volunteers and Red Cross partners to use the rcp2 project. By packaging our code and investing in project infrastructure, we can make it easier to maintain, document, and reproduce our work.

Objectives

Discuss our needs, guiding principles, and vision for the high level infrastructure.
Implement an extensible proof of concept (POC) for the desired infrastructure.

The POC won't be comprehensive. We can close this issue after, say, setting up the infrastructure for one data source. After that, we can add other issues to build out the package infrastructure.

Describe the fire propensity model.

@dschon, may I have your help documenting the fire propensity model?

The goal is to enable a newcomer to understand the purpose, approach, and basic mechanics of the model so that they could get started working on it. Details welcome. We'll add your content to the documentation where it'll help all users leverage your work. Here's a sneak peek from my own rcp2 fork.

A few key questions. Feel free to add anything else that's pertinent.

What does the model predict?
What scripts / notebooks are involved?
What data sources do you use?
What are the dimensions of the data?
What does each record represent?
What years do you use?
What is the outcome variable?
What features do you use?
What modeling approach do you use?
How do you transform the raw data?
When / how often will more data become available?
Any other important details?

I'm happy to wordsmith or discuss 1:1. Thanks for your help.

Add Congressional district to NFIRS

@danbernstein
@shawnharris
@jessezlotoff

Add state legislative district to NFIRS

Visualize model outputs using Mapbox

Create a Master datasets consisting of:

The three model outputs
the previous model outputs
census geography

upload data to mapbox
distribute visualization and allow fellow members to analyze.

ACS Data Exploration

We recently cleaned all of the ACS data at the block level. We need to do some exploratory data analysis to QC and look at some simple trends within this data.
Data is located in: Master Project Data > ACS 5yr Block Data

Using missingno or pandas create a visualization for the number of Nan's / Missing numbers in each column of the data
Using seaborne (or your favorite plotting tool) create histograms of each column. Are the answers normally distributed? if no, can they be easily clustered ( ex low-income, medium-income, high-income)?
Using seaborne( or your favorite plotting tool) create a correlation matrix of the columns in the ACS data ( see thw_1.0_EDA_NFIRS_SVI in the notebooks page for inspiration). What columns are highly correlated with eachother?
** note ** if the column isn't population adjusted (it isnt' a percentage or <1) do that first or it will be correlated with block size.

ARC home fire campaign impact

broad project to study the impact of ARC's home fire. Suggested datasets:

ARC response data: approximate ARC presence in a region by analyzing the proportion of recorded home fires they attended in a region
NFIRS data: record of fires

Count the number of fires per county

Calculate total and severe fire rates for GEOIDs, both absolute numbers and per-capita

Score and rank order counties based on highest fire rates per-capita using NFIRS Data (aggregate) and SVI population estimates

Analyze Fire Station/GEOID Reporting quality. Apply those quality scores to county, census tract, and blockgroups

Data sources: NFIRS, ARC Response

Fire Reporting Quality Assessment Step 1 - Determine statistical outliers for NFIRS reporting at the County level using year-over-year and month-over-month reporting. Assign confidence score for county reporting consistancy. Take into account "zero" reporters. | NFIRS
Fire Reporting Quality Assessment Step 2 - Determine possible outliers for NFIRS reporting at the county level by looking at annual ARC reported totals and comparing to NFIRS. NFIRS should be greater | NFIRS, ARC Response
Fire Reporting Quality Assessment Step 3 - Determine possible NFIRS reporting deficiencies for census tracts, e.g. FD reports fire consistently, but only in 5 of county's 10 tracts. Determine best way to interpret data. If possible, assign a census tract reporting score to county, e.g. percentage of tracts where fires were reported based on aggregate data.

Add Census block group to NFIRS

completed during DataDive 2019. Data and code are in Data Dive output folder. Moving to QA/QC for review

Parent / Child Relationships between Fire Stations and Fire Departments

There are approx 30k different fire departments across the country, some one one station, some have dozens. Very occasionaly two different fire departments may use the same fire station. The DHS HIFLD dataset and US Geological Survey have a (somewhat dated) dataset that lists 53k fire stations. This task is to create parent / child relationships for the fire department and fire stations. This may involve more than simple joins and fuzzy searches. There are several different lists of fire departments that don't all agree and there are several datasets on fire stations that don't agree. We also might need to hit the Google Places API to pull fire station location data for some of the most questionable data.

ACS NFIRS EDA

Home Fire Area Profiles: Use ACS data to identify common demographic and economic themes for Census Blockareas reporting fires. Document methodolgy and results

Only use NFIRS data from 2013 - 2017 which corresponds to the years for the data collected in the ACS dataset.

See thw_1.0_EDA_NFIRS_SVI for inspiration

Calculate absolute and driving distance from GEOIDs to closest one or more fire stations

Data Sources - FD Locations, GEOID geographies

Assign GEOID (Full FIPS) to FD Locations File - Use FD Location file's Lat Longs to identify Census Block Group Info (GEOID) for all Fire Stations, update master file.
Determine distance between FD and Tracts - Determine as the crow flies distance to nearest fire station from census tract centroid, Document distance in output. Denote if paid or volunteer, if located in same county, and FDID
Determine distance between FD and Block Group - Determine as the crow flies distance to nearest fire station from census block group centroid, Document distance in output. Denote if paid or volunteer, if located in same county, and FDID
Determine Avg Drive time from FD to Census Tract Boundary - Use FD location data to determine closest possible drive distance to ingress at census tract level. Denote if paid or volunteer, if located in same county, and FDID
Determine Avg Drive time from FD to Census Block Group Boundary - Use FD location data to determine closest possible drive distance to ingress at census block group level. Denote if paid or volunteer, if located in same county, and FDID

Determine distance between fire location & fire stations

completed by KSS, moving to QA/QC column for review

Package the code for downloading ACS data.

Incorporate existing functionality into src/ so that it is part of a simple, unified download workflow. Add unit tests and Sphinx documentation for maintainability.

Download American Housing Survey

Gather historical fire loss data or fire risk from insurance companies

Fire Loss Information - Identify possible avenues to obtaining historical fire loss info or fire risk from insurance companies for areas e.g. county, census tract, block group, zip

Add PUMA geography to NFIRS

Describe the smoke alarm model.

@kelsonSS, may I have your help documenting the smoke alarm model?

A few key questions. Feel free to add anything else that's pertinent.

What does the model predict?
What scripts / notebooks are involved?
What data sources do you use?
What are the dimensions of the data?
What does each record represent?
What years do you use?
What is the outcome variable?
What features do you use?
What modeling approach do you use?
How do you transform the raw data?
When / how often will more data become available?
Any other important details?

I'm happy to wordsmith or discuss 1:1. Thanks for your help.

Use SVI and HFC Data to identify factors related to fire alarm rate, fire rate, and lives saved rate

Data Sources: SVI, HFC Home Visits, HVC Lives Saved

Use SVI and HFC data to identify common demographic and economic themes for Census Tract areas where alarms were installed. Document methodology and results
Home Fire Area Profiles - Use SVI and HFC data to identify common demographic and economic themes for Census Tract areas reporting fires. Document methodology and results
Lives Saved Area Profiles - Use SVI and HFC data to identify common themes for Lives Saved Locations. Document methodology and results

Gather and add weather information to NFIRS (e.g. lows and highs for date of fire)

Weather Info - Take zip code info from NFIRS and assign weather information, e.g. lows and highs for date of fire.

Draft guidelines for contributions to the project package.

Establish guidelines that will help developers contribute to a cohesive package. Guidelines should cover:

Module organization.
Documentation and doc string guidance.
Unit testing guidance.
Basic coding style guidance.

Our goal is to share manageable, practical advice that helps teammates collaborate. No need to get too detailed or to go crazy with enforcement.

Document the guidelines in the project Sphinx docs to close out the issue.

Document the project data pipeline.

In the Sphinx documentation, describe and diagram this project's pipeline from raw data sources to model predictions. This work will help new users contribute to the project, and it will help us organize project package code.

Make links between the project repo and project docs.

Add details that link the project repo to the project documentation.

Link to the docs from the README.
Link to the project repo from the docs.
Automate documentation to build when the master branch moves.
Add a badge to show the documentation build status.

Review other fire risk models

Review other Fire Risk Models/ Risk Models in General | Seek out other fire risk models, if possible, document findings, relevant data used etc.

Conda environment setup fails: ResolvePackageNotFound.

Hi @kelsonSS, I tried setting up using the new environment.yml file, and I get a ResolvePackageNotFound error.

$ conda env create -f environment.yml 
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - python.app
  - appnope

This Stack Overflow response suggests installing the problem dependencies using a pip subprocess, though they don't cite a source or explain their reasoning much. I tried it, and while conda successfully solves the environment, pip cannot find requirement python.app.

I'm installing on Ubuntu 18.04.4 LTS.

write script to download google drive directory

I think it would be helpful if we could write a script to download the raw and processed data files from google drive, and then place that into the makefile as the first step. I've been able to download individual files using gdown, but I haven't been able to figure out how to download a whole google drive directory.

If anyone knows of a package or has an approach to download a whole google drive directory as part of a script, that would be great.

Import project documentation to Read the Docs.

Host our Sphinx documentation on Read the Docs so that it is easy for users to access searchable, readable documentation about this project and its codebase.

I already imported the rcp2 documentation from DataKind-DC/rcp2. The docs currently fail to build due to an issue with requirements.txt. I expect we can fix the issue with a Read the Docs config file.

Some resources for completing this issue:

Identify consumer data associated with risk behaviors for GEOIDs

Consumer Data - Identify source for consumer data for risk behaviors at county, census tract, block group and/or zip level. Cigarettes, space heaters, alcohol, etc.

October DataJam: Housing Price

There are a number of housing Variables called
House_val_15K_20K
House_vall_25K_30K
etc.

create a variable called housing_prc_under_250K which aggregates all housing house_vals underneath that value

-[] create a variable called median_house_value that finds which category holds the cumulative 50th percentile and returns the mean of that val as a int
ex. 25-30K = 5%
30-50k = 20%
50-150k =25%
150-250k 25%.
in this scenario the median 50th percentile is in the 50-150k bracket so return 100,000 (the average of 50-150k)

Count the number of fires per Census tract

Aggregate Rankings at Census Tract and County level

Jake wants to preserve as many of the features from the previous map as possible. Looking at the V1 map (link below) It had the ability to switch between tracts and Counties and show the absolute ranking of each geography.

Therefore we will need to be able to aggregate and rank our models across multiple geographies.

For all three models:

(if needed) convert binary classification estimates to probabilities
create a function that outputs the ranks geoid ranks by probability
create a function that takes these probability and a geography level and gives the average probability for that geography level

https://home-fire-risk.github.io/smoke_alarm_map/

Gather Housing age/composition information for GEOIDs

Housing Age/Composition Information - Identify possible avenues to obtaining reliable housing stock info at the census tract and/or block group level