nickmorales / orie4741project Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 2.0 64.12 MB

Jupyter Notebook 52.42% HTML 47.20% TeX 0.38%

orie4741project's People

Contributors

Watchers

Forkers

sourabhchakraborty kddubey

orie4741project's Issues

Peer review

This project is trying to what kinds of building permits will be approved in NYC. A variety of datasets contain information about the characteristics of the surrounding area such as crime rates, access to public transit, proportion of residential buildings, what permits have been approved historically, etc. The end goal is to identify which features are the most influential in determining if a permit is approved or not.

I like the variety of datasets/features being potentially included into the model. I do think that all these features have an impact on the success of a permit being issued. However, I would be interested in knowing how exactly you will define features like availability of restaurants/stores (will you be using a firm 1-mile radius or would easy access to the subway increase the radius to something like within a 5-min subway ride?).

I also like the larger question posed, because I do think that knowing what areas of the city are under heavier development could definitely influence future development and also if people/business move to those areas. However, despite building permits being indicators of development, the mentioned features seem to predict more if permits are requested rather than their successful approval. I think that the approval of a permit is more objective in terms of laws and regulations so it could be helpful in your model to include other data on building-specific information (especially since different types of buildings could have different restrictions).

I'd also be interested in knowing if there is any temporal factor that could be incorporated into the datasets since it is entirely possible that 10 years ago, development was a completely different landscape and/or restrictions were different, etc. so weighting older data differently could be useful.

Overall, I like how this model general is and could be applied to other cities/areas to help predict development there.

Peer Review -mt788

This project is really interesting because of the economic focus that the question answers in the sense of the big picture. It would be interesting to look at the regional effect and understand if certain permits are skewed to certain regions.
What probably makes it more interesting is the fact that this is being related to other factors like the crime rate and subway presence. This makes understanding more rigid and interesting.

However, I do feel that it may require a lot of work to go through this process of trying to relate all the factors mentioned later in the proposal. I'd suggest coming up with many hypotheses and testing them one by one. This ensures that the team may not get lost in the process of trying to correlate too many factors and end up losing time.

Potential Hypothesis include

The region does have an effect on how the construction permits are given, and this depends on the type of business as well
The crime rate in the location plays a role in how the permits are given, for a business type and region.
and may more...

This way the team can make steadfast progress while ensuring work getting done in the process.

Overall, the project is definitely ambitious and requires effort and I hope the team continues the work even after the end of the course.

Final Report Peer Review

The project aims to predict whether a permit should be issued and how long for a permit to be issued. The data is collected from SF Permit Data, Treasurer & Tax Collector’s Office Business, restaurant violations Health Data and Fire Data.

One of the things I like about the project is that they provide a lot of visualizations of the data. This gives me an overview on how the data is distributed. Another thing I like is that they collect a lot of different datasets to help them to analyze the data. This is very helpful for training more general model. Finally, I think they build a lot of different models to answer different kind of questions. Foe example, they try to classify whether or not a permit will be issued by training on both issued and unissued permit data; classify whether or not an issued permit will be issued the same day it’s filed (such permits are referred to as “same-day” permits from now on); predict the number of days it takes for non-same-day permits to be issued. 

One thing I don’t like about this project is that they only use misclassification rate to judge the performance of a model. They should also consider other metrics such as F1 score and confusion matrix for classification. Another thing I don’t like is that they don’t mention how they do the train-test split, so I don’t know whether the dataset is imbalanced or not. Finally, I think they should try to reduce some content of the report since it is a little bit too long for me to follow.

Overall, the group make great effort to produce the final report and provide meaningful insights for answering this question.

Final Peer Review ell65

This group aims to analyze building permit applications in San Francisco and see if the model they develop on this dataset is also applicable to a similar dataset on permit applications in NYC. I am initially intrigued by this project proposal because of the vast differences in the two cities including geography and climate, which would seem to affect building permits significantly. I have worked for a contractor before and have dealt with these municipal entities firsthand and am therefore very interested in the possibility that such a model could allow builders to analyze the feasibility of a project without waiting long amounts of time and spending money on applications only to be denied.

Some things I liked:

This type of model could be useful to improve efficiency in businesses as well as evaluate whether a government is responsibly issuing permits.
The group noticed some significant limitations to their data such as the issue of recently filed permits that are experiencing longer approval times not showing up in the data due to still being in the process. However, they also noticed that simply removing every entry where a permit has not been issued would be inherently problematic as well. I thought this section was an astute analysis of the trends in their data.
The group tried many models on their data to find what worked best with their dataset. I found the analysis of the error metrics to be very well-explained and the group got some fantastic error rates in my opinion!

Suggestions:

I found the addition of the fire department, business, and restaurant datasets interesting, but would have appreciated a bit more insight about why the group thinks that fire department calls, businesses, and restaurant inspections would predict the number of permits in a zipcode. I am wondering if these things would almost have more of an effect on the days to issuance than the number of permits in a zip code.
For the zip code analysis, it might be worth exploring the differences in those zip codes to see if there is anything about certain areas that would alter the results.
While there were certainly a handful of graphics in the report, I imagine there may be a visual representation other than a histogram or scatter plot that could provide insight about this problem

Overall fantastic project!

Final Report Peer Review - hg426

Overall I really liked the project idea and the report written - it is a problem that I don't have much domain knowledge on but one that I think would be useful for real estate or other building development people. I will give my feedbacks here mostly from a data science perspective.

The part that I really liked about your project as a whole is the fact that you did not settle on just one objective or one particular dataset. You explored the problem in NY, SF, and tried to ask different insightful questions to see if a model can be used to answer those questions. Each model's objective was clear and there were dedicated materials that went into how these models would or would not become helpful. I think you did a great job finding additional dataset, join them, and create a wealth of features and data for you guys to work with.

The parts that I thought you guys should elaborate more on are as follows. First, I would like to know more about what feature engineering you guys did. For example, there were mentions about data cleaning and joining and other manipulations, but there were not as many talks about how you exploration the transformation or even the creation of other useful features. Maybe that is because you already have enough features to work with, but it is still useful to have some discussion here. There were over 20-30 features in your dataset - were any of them correlated, or could you use some dimension reduction method to lower your feature space dimensionality (ex. PCA)? Another big thing that I would like to see in the report is how you choose the models - SVM, Random Forest, etc, which were not discussed much. Why does these models work well with this situation, and how are we performing better than the other ones. You could also include some cross-validation here to test whether your model is overfitting, and how you plan to address that.

Peer-review gt332

NYCBuildingPermit Midterm Review

The project is about NYC (now converted to SF) building permits and the likelihood of getting approval given data about the applications region. The group is considering data from several SF government resources including data on: DOB permit issuance, subway entrances, crime rates, wifi spots and taxi travel. The groups objective is to identify factors that impact building permit approval, potentially providing recommendations to home and business owners.

One thing I like about this project so far is the diverse feature transformations the group has utilized in order to model future building permits. The group has considered many of the transformations we learned in the airBnB homework which have turned out to be very useful in future predictions. Another aspect of the project that I liked how the group dealt with selection bias which could have greatly influenced the models prediction accuracy. Finally, I like that the group considered more than one model for presentation as it helps the reader better follow the reasoning of the researchers and how they were trying to best approach modeling the given dataset.

One area for improvement on the project would be the description and reasoning for the explanatory variables considered in the model. This helps the reader align their thoughts with how the researchers are approaching their model. Another area for improvement would be a more detailed explanation for why the group used MSE as an error metric versus other possible error metrics that penalize outliers differently. One last area (which is very subjective to the reader) would include a reason that this study is important. Maybe even just a figure or estimate like $Xbn a year is spent on attempting to get building permits which gives the reader a sense for the implications of a good predicting model for this question.

Overall I enjoyed reading the report and good luck on the rest of the project.

Midterm Report Peer Review - cl2567

The project group is trying to identify factors which affect whether a building permit will be issued, and how long it will take in days for a permit application to move from the filing stage to the issued stage.

Things that look really good:

The objective is very clear and the group has a very detailed, feasible methodology to achieve the goal. For instance, they split their project into two seperate parts. The first part solves a classification problem that classifies whether or not a permit will be issued by training on both issued and unissued permit data. The second part predicts the number of days for a permit to be issued by training on permits that we know were eventually issued.
They identified the physical limitations of the project and provided feasible ways to solve them.
3.The preliminary predictions look very good and provides implications for future steps.

Things that can be considered in the next step:

The group talked about the unbalanced categorical features problem. Maybe they can use some resampling techiques (SMOTE) to make the classification more reliable.
For the feature engieering part, maybe they can test the correlation among features and drop some of the correlated features.
Try some nonlinear models since the group talked about the problem of underfitting.

Peer review comments

The project attempts to use various geographic/communal features to predict which active construction permits in the NYC areas have a high chance of being approved, using the various open data sets from NYC that are available to the public.

I like how the project’s main focus was framed in a broader context of understanding different rates in growth in different areas of a city. It was also useful seeing specific examples of features which might affect development. I’m interested in seeing which features turn out to have relatively more influence than others on approval rate of permits. This may even potentially suggest certain actions that cities can take to increase development in particular zones (e.g. monitor crime rates, create more commercial zones, etc.).

Peer Review - vvs24

The project that I am reviewing is about predicting construction permits. The datasets are used are numerous and include several datasets from the site data.cityofnewyork.us This site includes datasets about permits issued in New York, crime statistics, park zones, etc. The objective of this project is to predict whether a permit (both residential and commercial) will be approved based on various features including average income, crime, park zones, etc in the area.

Things I liked abut the proposal:

The proposal identifies a large number of datasets that the project can draw from. There is an abundance of features as well as a large number of data entries in each dataset which means that there are a lot of different directions this project can be taken in.
The problem is very relevant. I'm sure many companies and home owners would be interested in how likely they would be to get a permit based on various conditions. I think its interesting that the project proposal identified the permit prediction problem as a subproblem under the larger question of how different features affect the overall development of an area.
I like that a potential reach question for your project relates development to time. I think there is a lot of potential here with analyzing the time series data in your datasets.

A few concerns that I have:

The datasets have a large number of features but the proposal did not list any plan on how the datasets will be used or how the different features in the datasets will be incorporated.
Some of the features in the datasets will be hard to embed / vectorize / represent as inputs to a classification model. For example, I wish there was more description about how the park zone maps would be used as inputs to the problem.
Be careful of using features that might result in a model that could be offensive to certain groups of people. Your final model should not classify based on stereotypical features.

Overall great work!

Peer review

NYCBuildingPermit Project Proposal

The project is about NYC building permits and the likelihood of getting approval given data about the applications region. The group is using data from several NYG government resources including data on: DOB permit issuance, subway entrances, crime rates, wifi spots and taxi travel. The groups objective is to identify factors that impact building permit approval, potentially providing recommendations to home and business owners.

One thing I like about the groups proposal is the wide array of data sets that they are listing to use. I think this could provide a rich hypothesis space for developing potential functions used to give recommendations. Specifically I thought the wifi spot data was interesting. Another thing I liked about the proposal is the large audience who would find this research helpful. Sometimes in these courses we get caught up in questions that will make no difference to anyone around us, but this question clearly has an impact on a large group of people in NYC. Finally, I liked that the proposal lends itself directly into topics we have covered in this class thus far, which makes me believe this group can find solutions to their given proposal. I will be interested to see how they chose to incorporate feature engineering into their algorithm.

In regards to areas for improvement I will try to state three. I think if I were to get hypercritical, one area for improvement could be the use of other academic sources who have already tried to tackle a problem like this and reference their materials. It would be interesting to see if they came up with similar conclusions and whether or not the timeline of data made any sort of influence. Secondly, I think there could be some challenges determining whether or not some of these variables will have spurious correlation and whether or not they should be included in the model. I like that fact that the group has offered so many potential variables, but I am wondering if there will be any issue having to many variables and knowing which ones carry the most predictive power. Finally, I think the dataset provided by NYC will be a valuable resource, however it will likely give a very biased result if the group is interested in applying what results they find more broadly across the world. I think this issue could be bridged by finding a larger dataset from different cities and using them as well. I wonder if this would result in models that need to be retrained based on the city we are considering or if there are factors that can be used at large for any location.

Final Report Peer Reveiw - ys447

The project was on the analyses of building permits in San Fransisco, and a total of 6 models are fitted based on the dataset you have. I really like the perspectives you chose, i.e., the number of days for a building permit to be approved and the number of total permits approved in San Fransisco, which I believe are the two factors that most accurately reflect the building permit conditions in the city. The models considered in the project are exactly those we have explored in class, and their analyses demonstrated that they have truly understood the techniques we discussed in each of the models.

What I really liked about the report is that they provide very clear and illustrative graphs. They have attached the distribution of the fire, health and business across different locations in San Fransisco using heat maps, and analyzed the distributions of the factors they are considering using histograms and scatter plots. Also, they provided clear analyses and insights into the features they are looking at, including how to deal with missing values and how to correct the zero-inflation.

However, the following are what I think they can further improve on:

In terms of model comparison, model (5), the AR model predicting the number of permits have a higher training error than testing error. You might want to look into the algorithm/dataset to check why this is the case and give some possible explanations.
It would be more illustrative if you provide more explanations on the coefficients you get from linear regressions, and try to analyze how the predictions would change given changes in a single feature.
You could provide more reasonings on why certain features are chosen over others. For example, why did you choose permit type, existing number of units in the building, the type of planset, and whether or not this permit was reserved for fire use when training the SVM?
It may be helpful if you look more into the generalization part of the project--how well would the model based on San Fransisco data perform on NYC data? Are the cities similar in any way?

Midterm Report Peer Review - bw436

The goals for this group are to identify factors that affect the issue decision for a permit and predict how long it will take for an application to move from the filing stage to the issued stage. They elected to use the SF data for model training and development.

Initial thoughts:

Report is well-written and thorough.
Plots are great and complement the explanations well.
Great project idea and great start.

Moving forward:

Maybe take in account if you prefer more false positives or false negatives when scoring models?
Make sure to look into the features that have a dominant value and see if they are problematic.
Remember to incorporate concepts/algorithms from class.

Mid-term Peer Review -- zh378

This project aims to use big data techniques to predict whether a building permit will be issued, and how long it will take in days for a permit application to move from the filing stage to the issued stage. The data sets used are the SF Permit Data and the NYC Permit Data, which contain a large number of records and highly useful features. The results of the projects can benefit many groups of people, such as homeowners, property developers, real estate agents, etc.

I like many aspects of the project and the team’s achievements. First, the team provided detailed discussions of the physical limitations of the data sets and their plans to overcome these challenges. Second, the team produced nice visualizations of the data sets and made legible analyses of the plots. Third, the team has run great preliminary analyses of the data set, and promising future plans to improve the results have been put forward.

Here are several suggestions that I hope can be useful for the team:

It would be better if a more detailed and clearer description of the data sets can be provided, such as how many features are present, what is the type of each feature, etc.
It is mentioned in the report that there are about 1300 data points that are missing location data, so it is advisable for the team to discuss how they plan to handle these missing values.
The team mentioned at the end of the report that some additional data sets, such as crime rates data and transportation access data, will be needed. I would recommend the team to introduce how they plan to get these data, and what potential challenges in using these data sets might exist.

Final Peer Review - zh378

This project aims to use big data techniques to analyze the permit approval process of a building. In particular, the team builds models to predict the number of days for a permit to be issued and the number of permits in a zip code. Besides the primary SF Permit Data, the project uses three additional datasets — the fire department dataset, the business dataset, and the restaurant inspection dataset. The models produced have impressive predictive power, and the predictions made are of high quality, fair and free of the risks of math destruction. I believe these results will be useful for business owners to optimize their operations, as well as for cities to further their commercial development.

Many aspects of this project are fairly impressive. First, the team implemented effective feature engineering and incorporated several other useful datasets to extend the predictive power of their original dataset. Second, the team used linear and non-linear models appropriately, and provided detailed reports of prediction accuracy and analysis of the strengths and weaknesses of the model. Third, the team presented in-depth interpretations of their results and discussion of fairness and other social effects.

Here are several suggestions that I hope can be useful for the team:

As mentioned at the beginning of the final report, it would be better if the team can discuss whether the results can generalize onto the buildings of other cities, including but not limited to New York City.
For model (1) and (2), it is advisable for the team to introduce the reasoning to choose SVM rather than other linear (or non-linear) models.
As we know, when using the Random Forest (and other parametric models), the choice of parameters is critical for preventing underfitting and overfitting, as well as for producing the optimal results. I would recommend the team to describe what parameters were used and how these parameters were determined.

Peer Review -- av529

Summary

The project focuses on NYC Building Permits, and aims to identify zones and neighborhoods that are more likely to have building permits approved in a larger attempt to explore the development of different regions in the city. The primary data is from the NYC Open Data DOB Permit Issuance and is complemented by other NYC Open Data datasets that collect relevant transit, social, and economic features. The end goal is to identify factors that impact building permit approval, and formulate it in such a way that can benefit home and businees owners along with City Officials.

What I like about the project

Their proposal of combining permit approval data along with external data is exciting and may lead to interesting observations. It also will lead to a holistic picture of whats going on within different areas of the city.
I liked how the project's main focus is understanding regional rates of growth by studying building permits. It is a quaint and immensely informative lens with which to view the development of a city.
The outcomes of this project will lead to interesting conclusions that will be applicable to cities, and useful for local governments across the world.

Areas of improvement / Queries

I would recommend adding tax data (income and corporate) to get a better idea of what's going on within the city neighborhoods.
From what I have observed building permits are usually approved if they simply abide by laws and regulations, and therefore are less subjective. If this is the case, then I don't understand how the outcomes of the project should be interpreted.
Given the temporal nature of the problem, there are innumerable factors that may influence the results. I am uncertain whether the given data will be sufficient to come to a satisfactory conclusion.

Midterm Review yl737

The objective of the group is to identify factors which affect whether a building permit will be issued, and how long it will take in days for a permit application to move from the filing stage to the issued stage. The data is collected from NYC and SF official website.

I like the detailed data analysis of the group, the visualization and outlier analysis specifically introduce the dataset to us. The group explores each features' distribution such as permit status, zipcodes, permit types etc... More importantly, the group can access great amount of data to support their model construction.

Following are some points that the team could improve on:
The preliminary analysis are predicting on the number of days to issue through linear regression using different combinations of features. The group could adopt various combinations rather than adding new features to the previous set of features for regression. Moreover, the group could adopt different models rather than trying linear or non-linear regression models.