Project King County Data Analysis

Our objectives are defined as follow:

We are consultants hired by a realtor company to analyse the house data in King County and provide recommendations on their next investment strategy and the kind of prices that they can command.

They ask us the following questions:

Can you show us a map of the prices in different areas of Seattle's King County?
What are the most impactful features on the price?
Can we predict the price of a house based on its most important features?

Data Cleaning Process

Our data are located in the "kc_house_data.csv" file. In order to prepare our dataset to be used for machine learning we are using python libraries such as pandas, numpy, matplotlib and seaborn to format and normalize our data. This is done in the Jupyter file "Module 1 Final Project - King County Houses Data Analysis.ipynb" file.

Our data has 20 columns with the description as follows:

id - unique identified for a house
dateDate - house was sold
pricePrice - is prediction target
bedroomsNumber - of Bedrooms/House
bathroomsNumber - of bathrooms/bedrooms
sqft_livingsquare - footage of the home
sqft_lotsquare - footage of the lot
floorsTotal - floors (levels) in house
waterfront - House which has a view to a waterfront
view - Has been viewed
condition - How good the condition is ( Overall )
grade - overall grade given to the housing unit, based on King County grading system
sqft_above - square footage of house apart from basement
sqft_basement - square footage of the basement
yr_built - Built Year
yr_renovated - Year when house was renovated
zipcode - zip
lat - Latitude coordinate
long - Longitude coordinate
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

Building our model

To built our machine learning model we chose to use the SciKit Learn library and the StatsModels library

In order to use those libraries we transform bedrooms, bathrooms, floors, waterfront, view, condition and grade columns into categorical columns.

We observe that the price doesn't follow a normal distribution. It is right skewed.
We create a column log_price using the numpy log() function to transform prices to a log scale that will be better fitted for machine learning.
Then we use the LabelEncoder method from sklearn.preprocessing to normalize the values for those columns and drop price,zipcode,lat and long columns.
Finally we can run the Ordinary Least Squares (OLS) method from statsmodels.api with the target y being the log_price column and X being our features columns.
The model shows that the bedrooms feature has a P value of 0.52 so we chose to drop the feature from our model.
This first model achieve an r-squared of 81.4%
We then built two other models more useful for our client using only the top 3 and top 2 features

Answering our initial questions

Can you show us a map of the prices in different areas of Seattle's King County?

We can show a heat maps to the realtor company in order to help them chose the best location for their project.
Here are the 3 different heat maps we created: * Price as per location * Price per sqft lot as per location * Price per sqft living as per location
You will find the heat maps in the Jupyter notebook and also in the slides folder.

What are the most impactful features on the price?

Our first model scores a r2 of 0.814, meaning that 81,4% of the variation in price is predicted by features.
Ranking from highest to lowest impact on the prices are the following features:

1. grade
1. sqft_living
1. zip_mean_price

Can we predict the price of a house based on its most important features?

Using only the 3 features above as arguments we built the function "predict()" that can give an idea of the price of a house in King County depending on its square footage, zipcode and grade

Presenting our findings to the Realtor company

Here is the link to our presentation to the realtor company

locsta / king-county-data-analysis Goto Github PK

king-county-data-analysis's Introduction

Project King County Data Analysis

Our objectives are defined as follow:

Data Cleaning Process

Building our model

Answering our initial questions

Presenting our findings to the Realtor company

king-county-data-analysis's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent