nycplanning / ceqr-app-data-archive Goto Github PK

(DEPRECATED)data pipelines for CEQR app, managed by data engineering

Home Page: https://github.com/NYCPlanning/ceqr-app-data

Python 84.40% Shell 4.43% TSQL 5.38% CSS 1.41% HTML 4.38%

data-pipelines data-engineering etl-pipeline

ceqr-app-data-archive's Introduction

CEQR App Data User Guides

This repository includes ETL pipelines for all the datasets fed into CEQR (City Environmental Quality Review) app. It is managed by NYC Planning's data engineering team.

Building Instructions

Set environmental variables: RECIPE_ENGINE, CEQR_DATA, BUILD_ENGINE,and EDM_DATA under the /ceqr directory. See .env.example.
Run python3 -m venv base to set up the virtual environment.
Run source base/bin/activate to activate the virtual environment. To deactivate once finished, type deactivate.
Run pip3 install -e . to install packages required accross multiple data schema.
Run ceqr run <schema_name> at root directory. For example, ceqr run ceqr_school_buildings, which allows you to build ceqr_school_buildings from scratch

Repo directory structure

The ETL pipelines have been stored under the /ceqr/recipes directory as individual folders named by the datasets.
Each dataset subfolder contains the following items.
- build.py A python script that will transform and integrate source datas into a target table
- config.json A configuration file specifying the input table names, output table name and DDL (output table schemas).
  - It is noted that the DDLs of "nysdec_state_facility_permits", "nysdec_title_v_facility_permits" and "sca_capacity_projects" actually reflect the schemas for their geo_rejects tables
- README.md The metadata about the ETL pipeline
- requirements The required dependencies need to install to run the python script
- runner.sh A shell script, by executing which, you can build a dataset from scratch. or you can execute ceqr run <schema_name> at root directory

├── ceqr
│   ├── recipes
│   │   ├── <schema_name_1>
│   │   │   ├── build.py
│   │   │   ├── config.json
│   │   │   ├── README.md
│   │   │   ├── requirements.txt
│   │   │   └── runner.sh
│   │   ├── <schema_name_2>
│   │   │   ├── build.py
│   │   │   ├── config.json
│   │   │   ├── README.md
│   │   │   ├── requirements.txt
│   │   │   └── runner.sh
...

How to build a new dataset

Create a new folder named by the dataset and put it under the /ceqr/recipes directory
Create config.json, README.md, build.py, requirements.txt and runner.sh as described in the Repo directory structure within this new folder
- For the output table schema, besides the requirements specified by the data users, it also need to follow the CEQR data schema standards.
Follow the Building Instructions to test the ETL pipeline
Output table can be found in EDM_DATA under a schema named by the dataset.

ceqr-app-data-archive's People

Contributors

Stargazers

Watchers

Forkers

sptkl

ceqr-app-data-archive's Issues

CEQR Air Quality: Major or large emission source

‘Major or large emission source’
Inputs:
Title V State Permits

Facility State Permits

Steps:

load to recipes
Geocode
Merge
Deduplicate

Odor-producing facilities are included in these data

CEQR Air: Work with EARD to determine which fields to display and field aliases for each dataset

CEQR Air: remove rail from A-typical roadways

If this is an issue with the file on the network drive, make sure rail is removed from dataset

CEQR Schools: Capacity projects - take lat and long from Centerline Level Information

If you don't get a lat and long from Property Level Information, but a lat / long exists from the Centerline Level Information take the lat long from the Centerline Level Information

CEQR Schools: Generate SCA Capital Projects tables with just Capacity Projects

Generate SCA Capital Projects tables with just Capacity Projects
Input data tables come from sections 6 and 9 here http://www.nycsca.org/community/capital-plan-reports-data#Local-Law-167-Reports-352

Touch base with Amanda before starting to agree on output data schema

CEQR Air Quality: Ambient Air Quality

Source data
State has stations monitoring air quality

Understand how source data are used
If it's determined to be useful, build data pipeline

CEQR source data - schools

@pichot @mshapiro1
Could you share with us the raw inputs for the following three datasets which are in the ceqr_data?

Ideally, we should be able to load all these datasets from a csv or url into our recipe_data

Thank you!

only generate current school year school_buildings

SCHOOL_YEAR = f'{datetime.date.today().year-1}-{datetime.date.today().year}'
print(f'current school year is {SCHOOL_YEAR}')

Improve Geocoding / address cleanup for DEP air quality permits

1. Where GRC code or GRC code 2 is 42 or 41, output subset of records and send to GRU, because addresses may be missing in PAD
2. Add keywords such as Cross and Crs to geocoding intersection function
3. Update JFK international airport so that it gets a hit from Geosupport
4. Explore on-to-from formats and build out function in Geocoding suite
5. Where reason code is EE data engineering will fix addresses if it's clear what it should be

If any component of the address is missing (house number (i.e. 13B), street name, borough) we cannot fix the address
Do not fix addresses where the error code indicates there is an extra hyphen (GRC 13 3)
Most likely cannot fix addresses where GRC = 13

Apply standard data schema on ctpp and school data

ctpp_censustract_centroids: centriods-->geom
sca_capital_projects: geometry-->geom, start_date::date, planned_end_date::date
doe_significant_utilization_changes: vote_date::date

Fix typo in sca_capacity_projects_tcu input data

Reload to recipe, then rerun sca_capacity_projects

CEQR Air quality

file_name	description
dot_traffic_cameras.zip	DOT traffic cameras
nysdot_functional_class.zip	NYS DOT Functional Class
nysdot_aadt.zip	NYS Traffic Volume Data
nysdot_traffic_counts.zip (2015, 2016, 2017)	General Highway Data
nysdec_air_monitoring_stations.zip	Air monitoring stations
facilities_garages.zip	Garages
tunnel_ventilation_tower_map.geojson	Ventilation Outlets
area_of_concern.geojson	Areas of Concern
atypical_roadways.geojson	Atypical Roadway
dep_cats_permits.zip	Manufacturing or processing facilities, boilers
nysdec_facility_permits.zip	Title V State Permits and Facility State Permits

set up helper functions

exporter
engines
config_loader

CEQR Schools: sca_e_projections

~~Exclude PK (PS should only include K-5)~~
Use IS sub district percentages to calculate IS e_projections
Correct years (include 2018 in final output)
Reach a consensus on rounding

Apply Standard data schema on Air Datasets

Standard data schema: https://docs.google.com/spreadsheets/d/1Z41fgiU_mi1KltlS783kUZpPcC8Sn7hUxTqhaNxQFg8/edit?pli=1#gid=0

dep_cats_permits: update borough values to proper case, add geo_ to all fields return by geosupport, geometry-->geom, expirationdate::date, issuedate::date, add geo_address, (geo_latitude,geo_longitude)::double precision
nysdec_facility_permits: add geo_ to all fields return by geosupport, geometry-->geom, , issue_date::date, expire_date::date
dot_traffic_cameras: geometry-->geom
atypical_roadways: l_zip-->left_zipcode, r_zip-->right_zipcode
facilities_garages: add geo_ to all fields return by geosupport, geometry-->geom
nysdec_air_monitoring_stations: geometry-->geom, (Add geo_ in front of longitude, latitude if they are from geosupport), map county-->borough (remove outsiteNYC by setting region='2'), borocode::integer, (geo_latitude,geo_longitude, geo_x_coord, geo_y_coord)::double precision
nysdot_functional_class: geometry-->geom
nysdot_aadt: geometry-->geom
nysdot_traffic_counts: should we rename county_code to borocode and remove records outside NYC?

Purge the DEP CATS permtis

Purge the dep_cats_permits using the following logic

All Cancelled permits
All G-prefix permits
All C-prefix + Registration (including Registration, Boiler Registration II, and Registration Inspection)
All CA-Prefix + Work Permit + Expired permits

CEQR: Schools SCA housing pipeline

Work with Capital Planning and HED to create data processing pipeline to produce the SCA housing pipeline that is used in CEQR

Output tables:

sca_housing_pipeline_by_boro
sca_housing_pipeline_by_sd

CEQR Air: Find state spatial dataset to join to for state traffic counts dataset

There is an application with the locations of the traffic count locations

type conversion

http://initd.org/psycopg/docs/usage.html#adaptation-of-python-values-to-sql-types

Schools: Capacity projects update

Generate February version and deliver update to Labs

Replace capacity projects in process with the updated table from CP
Add TCU as a new input
Continue using the November tables for capacity projects from the 2015-2019 capital plan

CEQR Air Quality: Areas of Concern (shapefiles)

Work with EARD to generate more accurate maps of:

Downtown Brooklyn
Long Island City
Manhattan (details in technical manual)

Add urls to nysdec_title_v_permits

https://www.dec.ny.gov/dardata/boss/afs/issued_atv.html

webscrape from the above source

ceqr_school_buildings clean up

The ETL pipeline is almost ready, except:

We still need the following four fields for doe_lcgms: org_level, ps_enrollment, ms_enrollment, hs_enrollment
The 2019 table has doubled its number of records compared to the 2018 and 2017 tables. Based on my observation, this issue might get solved after applying the following filter.

SELECT * FROM ceqr_school_buildings."2019"
WHERE source = 'bluebook'
AND (org_level = 'PK'
OR org_level = 'PS'
OR org_level = 'PSIS'
OR org_level = 'IS'
OR org_level = 'ISHS'
OR org_level = 'HS'
);

CEQR Air Permits

Geocode the three open datasets (dec_title_v_facility_permits, dec_state_facility_permits and dep_cats_permits)
merge dec_title_v_facility_permits and dec_state_facility_permits
deduplicate between the three datasets on facility_name and facility_location

Include PK in sca_enrollment_pct_by_sd table

Include PK in output and republish November version

CEQR Air: Manually research and create addresses for non geocoded permits

For records that are part of the list of 52 records that did not get geocoded, and the record is in NYC, correct the address so that it can geocode. When reporting records that didn't get geocoded only include records that are in NYC

CEQR Air Quality: DOT Traffic Cameras

Add URL field that links to traffic camera for that location

CEQR Air: Include all records for CATS records

Remove filter on the CATS data to include all records
ERD will inform us how the data should be symbolized and broken out into group layers

CEQR Schools: School buildings lists of schools to exclude

Confirmed lists

Specialized high schools

The Bronx High School of Science (X445)
The Brooklyn Latin School (K449)
Brooklyn Technical High School (K430)
High School for Mathematics, Science and Engineering at City College of New York ((M692)
High School of American Studies at Lehman College (X696)
Queens High School for the Sciences at York College (Q687)
Staten Island Technical High School (R605)
Stuyvesant High School (M475)

Citywide G&T schools below:

1. N.E.S.T. (M539)
1. The Anderson School (M334)
1. Tag Young Scholars (M012)
1. Brooklyn School of Inquiry (K686)
1. The 30th Avenue School (Q300)

From here: https://insideschools.org/

CEQR Schools: Capacity projects - take centroid of line string geometries

If we're going to use the Segment function of GeoSupport, take the centroid of the linestring that's generated so that we consistently have only points in the output dataset

If there's a more efficient method than just taking the centroid of the linestring feel free to implement it, so long as the outcome is the same

CEQR Air: Connect tunnel ventilation dataset with DOT study

Here is the link to the 2016 New York City Bridge Traffic Volumes Report: http://www.nyc.gov/html/dot/downloads/pdf/nyc-bridge-traffic-report-2016.pdf

direct users to this link

Feb update: regenerate doe_significant_utilization_changes table

CEQR Air: corrections to make after QAQC

dep_cats_permits

make values in geo_bbl NULL and not 0 ( or can we not do this because the field is a bigint?)

atypical_roadways

don't include records where FeatureTyp = 1

nysdot_functional_class

There is a segment showing up in Albany. Remove the segment from the dataset. It has urban_area_code 63217 and borough 'Queens,' which is probably causing the problem

QA/QC reports

Generate report that reports the geocoding / mapping success rate for each CEQR dataset
Output the records that are part of the target subset that did not geocode for each dataset, including grc code

Determine appropriate date format for SCA capacity projects

Currently converted to full timestamp. Input data has years, so the default is Jan 1 at midnight. Is this appropriate?

doe_bluebook and doe_lcgms merge

https://gitlab.com/nycplanning/ceqr-app-data/issues/5
Currently, we provide schools data in two schemas on CEQR Data: doe_lcgms and sca_bluebook. This does not make intuitive sense on the app side, as these are just sources for the ultimate need of the app: a row representing one school.

This new schema would be named something like ceqr_school_buildings and look like:

CREATE TABLE ceqr_school_buildings."2017" (
    district integer,
    subdistrict integer,
    borocode integer,
    bldg_name character varying,
    excluded boolean,
    bldg_id character varying,
    org_id character varying,
    org_level character varying,
    name character varying,
    address character varying,
    ps_capacity integer,
    ps_enroll integer,
    is_capacity integer,
    is_enroll integer,
    hs_capacity integer,
    hs_enroll integer,
    source text, -- either 'bluebook' or 'lcgms'
    geom geometry(Point,4326)
);

This table would replace bluebook and lcgms tables in the app moving forward. The yearly data generation task would be to combine bluebook and lcgms to provide the most accurate list of schools open in the current school year, with their enrollments and capacities.

load both datasets into RECIPE_ENGINE
change column names for sensitive info
create build.py to merge the two

CEQR Air Quality: Ventilation Outlets

DOT data for ventilation outlets

Identify source data
This refers to the tunnel ventilation towers for the 4 major vehicle tunnels in NYC (Midtown, Lincoln, Holland, Hugh L. Carey Tunnel aka Battery)
Create a spatial dataset of the vents on either side of the tunnel

Create sca_e_projections for 2017 and 2018

Using the data in the current CEQR database, generate sca_e_projections so that data are reported at the sub district level

Package and ship updated air permits data

For the latest ceqr updates, we added intersection geocoding logic which improves the geocoding result moderatively. Three datasets are affected by this update, and some new fields are added into their shapefiles and geo_rejects.csv:

dep_cats_permits:
- address - address after programmetic cleaning,
- streetname_1 - parsed cross street part 1,
- streetname_2 - parsed cross street part 2,
- geo_x_coord - x coordinate returned by Geosupport,
- geo_y_coord - ycoordinate returned by Geosupport,
- geo_function - the Geosupport function being applied
nysdec_state_facility_permits:
- streetname_1 - parsed cross street part 1,
- streetname_2 - parsed cross street part 2,
- borough - backfilled borough via zipcode_borough look up table,
- geo_x_coord - x coordinate returned by Geosupport,
- geo_y_coord - ycoordinate returned by Geosupport,
- geo_function - the Geosupport function being applied
nysdec_title_v_facility_permits:
- streetname_1 - parsed cross street part 1,
- streetname_2 - parsed cross street part 2,
- borough - backfilled borough via zipcode_borough look up table,
- geo_x_coord - x coordinate returned by Geosupport,
- geo_y_coord - ycoordinate returned by Geosupport,
- geo_function - the Geosupport function being applied

@aferrar the latest dataset can be found here, and please feel free to hide any above fields on your side and we plan to ship them to EARD by today. Let me know if you have any questions. Thanks!

CEQR Air Quality: Garages

Garages

Sanitation
School Buses
MTA bus garages
Uses Transportation trigger (> 85 spaces ?)

Tie to improvement of FacDB

Refactor the sca_e_projections by using sca source data

Mismatch in the number of DEP CATS permits

Please investigate the number of DEP CATS permits comparing to what EARD has purged with the following categories:

All Cancelled permits
All G-prefix permits
All C-prefix + Registration (including Registration, Boiler Registration II, and Registration Inspection)
All CA-Prefix + Work Permit + Expired permits

CATS Permits raw 010720.zip

CEQR Air Quality: Manufacturing or processing facilities, boilers

Source:
DEP CATS Permits

Steps:

load into recipes
only select records that have CURRENT in their status
Geocode

Add metadata for Air datasets

Work with GIS to determine what template to use and generate metadata

CEQR Air Quality: DOT data sources

Create data layer for each source. We'll need to investigate to see what data are available on Open Data or other sources.

NYC DOT Traffic Cameras: https://www1.nyc.gov/html/dot/html/motorist/atis.shtml#traffic
NYS DOT Functional Class: https://gis3.dot.ny.gov/html5viewer/?viewer=FC
NYS Traffic Volume Data: https://gis3.dot.ny.gov/html5viewer/?viewer=tdv (some data points haven't been updated in 10 years so there might be a better source)
General Highway Data: https://www.dot.ny.gov/divisions/engineering/technical-services/highway-data-services/hdsb

Create sca_e_projections by boro

Aggregate HS data to boro level

Create a new folder named sca_e_projections_by_boro under the ceqr_app_date/ceqr/recipes directory
Create config.json, README.md, build.py, and requirements.txt
use sca_e_projections.2019 as the input data
target schema @bfreeds please confirm the schema for this CEQR table

CREATE TABLE sca_e_projections_by_boro."2019" (
    school_year integer,
    borocode integer,
    hs integer
);

build its ETL pipeline using sca_e_projections as an example
- convert the input table from a wide table to a long table using a pandas syntax called melt
- calculate the hs projections by summing up projected = [9,10,11,12], don't use projected = '9-12 Total' from sca_e_projections.2019 which has incorrect values
- write a python function or dictionary to map district to its matchingborocode
- aggregate the data to boro level
export the output to EDM_DATA
push your code to another branch

CEQR Air Quality: Atypical Roadway

What is an Atypical Roadway? A roadway that is either elevated or depressed compared to the surrounding environment. Not on the same level. If site is 200 feet from an Atypical Roadway then this needs to be considered in the analysis.

Determine best source for extracting Atypical Roadways (LION, Planimetrics, or...?)
Publish subset of data
Create line dataset (convert to roadbed?)

update cook to allow version specification

add additional parameter to allow custom versioning

version must be in YYYY/MM/DD format
if not specified, default to current date

CEQR Air: Split Title V and State permits into 2 files

CEQR Air Quality: Medical, chemical, or research labs, manufacturing or processing facilities

Work with EARD to determine how to filter PLUTO by land use or building type to output the desired subset
Add fields to clarify data (i.e. type of facility, based on use group)
Mostly covered from CATS data (because generators), might not need a seperate layer from PLUTO based on Land Use

Improvement to how we process sca_e_projections

Don't use "total" rows from sca_e_projections and SUM the grade rows instead
based on errors found by Capital Planning