We are consultants hired by a realtor company to analyse the house data in King County and provide recommendations on their next investment strategy and the kind of prices that they can command.
They ask us the following questions:
- Can you show us a map of the prices in different areas of Seattle's King County?
- What are the most impactful features on the price?
- Can we predict the price of a house based on its most important features?
Our data are located in the "kc_house_data.csv" file. In order to prepare our dataset to be used for machine learning we are using python libraries such as pandas, numpy, matplotlib and seaborn to format and normalize our data. This is done in the Jupyter file "Module 1 Final Project - King County Houses Data Analysis.ipynb" file.
Our data has 20 columns with the description as follows:
- id - unique identified for a house
- dateDate - house was sold
- pricePrice - is prediction target
- bedroomsNumber - of Bedrooms/House
- bathroomsNumber - of bathrooms/bedrooms
- sqft_livingsquare - footage of the home
- sqft_lotsquare - footage of the lot
- floorsTotal - floors (levels) in house
- waterfront - House which has a view to a waterfront
- view - Has been viewed
- condition - How good the condition is ( Overall )
- grade - overall grade given to the housing unit, based on King County grading system
- sqft_above - square footage of house apart from basement
- sqft_basement - square footage of the basement
- yr_built - Built Year
- yr_renovated - Year when house was renovated
- zipcode - zip
- lat - Latitude coordinate
- long - Longitude coordinate
- sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
- sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors
To built our machine learning model we chose to use the SciKit Learn library and the StatsModels library
In order to use those libraries we transform bedrooms, bathrooms, floors, waterfront, view, condition and grade columns into categorical columns.
We observe that the price doesn't follow a normal distribution. It is right skewed.
We create a column log_price using the numpy log() function to transform prices to a log scale that will be better fitted for machine learning.
Then we use the LabelEncoder method from sklearn.preprocessing to normalize the values for those columns and drop price,zipcode,lat and long columns.
Finally we can run the Ordinary Least Squares (OLS) method from statsmodels.api with the target y being the log_price column and X being our features columns.
The model shows that the bedrooms feature has a P value of 0.52 so we chose to drop the feature from our model.
This first model achieve an r-squared of 81.4%
We then built two other models more useful for our client using only the top 3 and top 2 features
- Can you show us a map of the prices in different areas of Seattle's King County?
We can show a heat maps to the realtor company in order to help them chose the best location for their project.
Here are the 3 different heat maps we created: * Price as per location * Price per sqft lot as per location * Price per sqft living as per location
You will find the heat maps in the Jupyter notebook and also in the slides folder.
- What are the most impactful features on the price?
Our first model scores a r2 of 0.814, meaning that 81,4% of the variation in price is predicted by features.
Ranking from highest to lowest impact on the prices are the following features:
-
- grade
-
- sqft_living
-
- zip_mean_price
- Can we predict the price of a house based on its most important features?
Using only the 3 features above as arguments we built the function "predict()" that can give an idea of the price of a house in King County depending on its square footage, zipcode and grade