Git Product home page Git Product logo

labs-geosearch-pad-normalize's Introduction

labs-pad-normalize

R script to normalize PAD data into discrete address records. Part of the NYC Geosearch Geocoder Project

Introduction

The NYC Geosearch API is built on Pelias, the open source geocoding engine that powered Mapzen Search. To accomplish this, Labs uses the authoritative Property Address Directory (PAD) data from the NYC Department of City Planning's Geographic Systems Section. However, because the data represent ranges of addresses, the data must be normalized into an "expanded" form that Pelias will understand. This expansion process involves many factor-specific nuances that translate the ranges into discrete address rows.
screen shot 2018-01-18 at 2 48 09 pm

We are treating the normalization of the PAD data as a separate data workflow from the PAD Pelias Importer. This script starts with the published PAD file, and outputs a normalized CSV of discrete addresses, ready to be picked up by the importer.

Data

This script downloads a version of the PAD data from NYC's Bytes of the Big Apple. The Property Address Directory (PAD) contains geographic information about New York City’s approximately one million tax lots (parcels of real property) and the buildings on them. PAD was created and is maintained by the Department of City Planning’s (DCP’s) Geographic Systems Section (GSS). PAD is released under the BYTES of the BIG APPLE product line four times a year, reflecting tax geography changes, new buildings and other property-related changes.

R Script

This script will output a file in the /data directory called final.csv. This is the expanded output. To make sure the script is getting the latest version of PAD, check that the source is pointing to the most updated version of PAD.

Status

The script is incomplete! Find sample output here. Over the coming weeks, it should be finalized.

Deploy

To "deploy" data as the source for the geosearch importer, run npm run deploy. You must have s3cmd configured as it will run that command to upload output files. To setup for Digital Ocean spaces, see: https://www.digitalocean.com/community/tutorials/how-to-configure-s3cmd-2-x-to-manage-digitalocean-spaces.

For a new version of pad, two references to files need to be updated. In download_data ensure that the download link points to the latest PAD version (17D, 18A, etc) and load_data make sure the path to the street name dictionary (snd17Dcow.txt, snd18Acow.txt, etc) reflects the current release.

How to run

Make sure R is installed on your machine. If you just want CLI stuff:

$ brew install R

Install necessary packages

$ R
> install.packages(c("tidyverse", "jsonlite", "downloader"))

(Note: this may take a long time. Go get a coffee or something)

Run the R script to normalize the new PAD data:

$ Rscript ./munge.R

Due to the nature of the PAD dataset, it is very likely that some data processing may be incompatible with new versions. At the very least, it if likely new entries will need to be added to the suffix lookup table data. Do not dispair. Use RStudio to step thru the munging process one step at a time. You'll get there. You got this!

If you're happy with your data, push it to digital ocean using the included shell script:

$ ./push-to-bucket.sh

labs-geosearch-pad-normalize's People

Contributors

allthesignals avatar jasondecastro avatar jtalati avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.