Wake County Housing Market Tableau Public Link
Team Seven - Marla, Yolanda, Robert
Modeling Wake County Real Estate - Model to compare and analyse price based on living area, bedrooms, bathrooms, and lot size. Using property types and zip codes as categorical bounds, we analysed factors driving the total sale price of a property, and the differences in weight of those features based on the property type and location.
Using API data from recent Zillow sales in Wake County, North Carolina, we established a database to integrate the feature components of recent sales for our machine earning model.
Our multiple linear regression model includes four features:
-living area
-lot size
-bedroom
-bathrooms
These variables are modeled to determine their impact on sales price.
Incorporating ZIP code as a categorical variable allowed able to determine the geographical impact of the features across the county. Splitting the data by property type also allowed us to model the implied value of the land itself without the building improvments - a useful tool in a location with some aging housing stock that may be candidates for 'knockdown' redevelopment.
As a baseline, our model was executed against all samples in our dataset regardless of property type or ZIP code.
Summary stats for our model:
Our model showed and effective fit with an R-Squared of 81.3%.
Given our results for our total sample and some of our segmented data, we can make well informed estimates to predict a sale price range for a property based on the four feature inputs used in our model. If a hypothetical 2268 square foot single family home with 3 bedrooms, 3 bathrooms on a .71 acre lot was input to the model, we would expect a sale price of $642,045.
Reminder of our feature weights and confidence intervals:
Our model caculated the price by weighting the coefficeints of the model (coef) again the hypothetical values listed above.
To explode our model and show the math we have the following fomula to generate price estimate
Estimated Price = constant + (number of bedrooms * -4.487e+04) + (number of bathrooms * -1.415e+04) + (acres of lot * 2.335e+05) + (square feet of living area * 200.1412)
Now the same formula with our hypothetical sample property (and eliminating the pesky scientific notation):
$642,045 = 199,400 + (3 * -44,870) + (3 * -14,150) + (0.71 * 233500) + (2268 * 200.1412)
The model can be adapted to drill down to isolate particular zip codes and property types, but as those slices are made from the total sample, we lose strength of fit in the model. As indicated by such a large span on the confidence of the bathroom feature, our model struggles to narrow the range on this feature. Cleaner input data would help the model, as would being able to avoid some of the data fill operations mentioned below.
Unfortunately the precision of this model is such that no one could use this tool to realistically bid on a property, but as a proof of concept, this model was able to execute the goal of the project.
Our machine learning model required no nulls, so a choice was required to fill null values. Given unlimited resources of time or a pristine dataset, we coud have avoided some distortion in our data. Many null values were able to be filled with zeros correcty, but in certain cases this was less than ideal. In lieu of intervening on the data in a call by cell fashion, we opted to make the fill in this project, and address that concern as an area that could be improved in future iterations of the model. The distortion shown below had little impact on the model, but due to the small number of entries in this subset of data, there was an impact in the model
Evidence of the fillNA operation can be seen by the cluster of results sitting on the "0" Living Area below: