Maxwell Austensen 2017-05-06
The purpose of this project is to to predict serious housing code violations in multi-family rental building in New York City. All data is taken from publicly available sources, and is organized at the borough-block-lot (BBL) level. The plan is to use all data available in 2015 to predict violations in 2016.
All prediction results can be visualized in this Shiny app: https://maxwell-austensen.shinyapps.io/violations-app/
Directory | Description |
---|---|
./ |
Pseudo makefiles for data and analysis/maps |
./analysis |
R Notebook files for main analysis |
./violations-app |
shiny files for app to visualize model predictions |
./maps |
R scripts to create maps and final map images |
./munge |
R scripts to download raw files, clean data, and prep for joining all sources |
./data-raw |
Raw data files, and cleaned individual data sets, including crosswalks (git-ignored due to file size) |
./data-documentation |
Documentation files downloaded for data sources |
./data |
Final cleaned and joined data sets (only samples of data are not git-ignored) |
./functions |
R functions used throughout project |
./presentations |
Slide presentations for class using xaringan , including final presentaiton PDF |
./packrat |
Files for packrat R package management system (do not edit) |
-
Clone repo and open the RStudio project file
edsp17proj-austensen.Rproj
- The package
packrat
will be automatically installed from source files in the repository. Then all the other packages used in this project will be installed from instructions saved in this repo. All installed packages will be saved in the packrat sub-directories of this repo. This allows you to easily get all the packages you need to reproduce this project while not disrupting your own local package library (eg. change versions).
- The package
-
Run
source("make_data.R")
to download and prepare all the data necessary to reproduce all the analysis. -
Run
source("make_analysis_maps.R")
to run all the analysis scripts, rendering .nb.html files and generating map images.
-
Improve logit model using
MASS::stepAIC()
to choose a model -
Plot decision tree (look at
rpart.plot
package) -
Consider changing from classification to regression using adjusted serious violations count
-
Deal with missing data problems
- Impute missing data
- simple mean imputation,
- mean by zip code and/or building type,
- should also see if missing-not-at-random
- look for values in past years of data (older pluto/rpad versions),
- regressions using other variables
-
Add to evaluation of models using tests recommended in Dietrich (1997) reading
- Building permits (DOB)
- Oil Boilers
- Rodent Inspection
- Subsidized Housing Database
- Likely Rent-Regulated Units
- Certificates of Occupancy (DCP - FOIL)
- Open Balance File (Property Tax Delinquency) (DOF - FOIL)
- HPD registration files - corporate owner
- DOF sales data - price and date of last sale
- Tract-level ACS - median rent, poverty rate, etc.
- SBA-level HVS - building quality, pests, etc.