Git Product home page Git Product logo

king-county-data-analysis's Introduction

Project King County Data Analysis

Our objectives are defined as follow:

We are consultants hired by a realtor company to analyse the house data in King County and provide recommendations on their next investment strategy and the kind of prices that they can command.

They ask us the following questions:

  1. Can you show us a map of the prices in different areas of Seattle's King County?
  2. What are the most impactful features on the price?
  3. Can we predict the price of a house based on its most important features?

Data Cleaning Process

Our data are located in the "kc_house_data.csv" file. In order to prepare our dataset to be used for machine learning we are using python libraries such as pandas, numpy, matplotlib and seaborn to format and normalize our data. This is done in the Jupyter file "Module 1 Final Project - King County Houses Data Analysis.ipynb" file.

Our data has 20 columns with the description as follows:

  • id - unique identified for a house
  • dateDate - house was sold
  • pricePrice - is prediction target
  • bedroomsNumber - of Bedrooms/House
  • bathroomsNumber - of bathrooms/bedrooms
  • sqft_livingsquare - footage of the home
  • sqft_lotsquare - footage of the lot
  • floorsTotal - floors (levels) in house
  • waterfront - House which has a view to a waterfront
  • view - Has been viewed
  • condition - How good the condition is ( Overall )
  • grade - overall grade given to the housing unit, based on King County grading system
  • sqft_above - square footage of house apart from basement
  • sqft_basement - square footage of the basement
  • yr_built - Built Year
  • yr_renovated - Year when house was renovated
  • zipcode - zip
  • lat - Latitude coordinate
  • long - Longitude coordinate
  • sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
  • sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

Building our model

To built our machine learning model we chose to use the SciKit Learn library and the StatsModels library

Scikit Learn StatsModels













In order to use those libraries we transform bedrooms, bathrooms, floors, waterfront, view, condition and grade columns into categorical columns.

We observe that the price doesn't follow a normal distribution. It is right skewed.
We create a column log_price using the numpy log() function to transform prices to a log scale that will be better fitted for machine learning.
Then we use the LabelEncoder method from sklearn.preprocessing to normalize the values for those columns and drop price,zipcode,lat and long columns.
Finally we can run the Ordinary Least Squares (OLS) method from statsmodels.api with the target y being the log_price column and X being our features columns.
The model shows that the bedrooms feature has a P value of 0.52 so we chose to drop the feature from our model.
This first model achieve an r-squared of 81.4%
We then built two other models more useful for our client using only the top 3 and top 2 features

Answering our initial questions

  1. Can you show us a map of the prices in different areas of Seattle's King County?

We can show a heat maps to the realtor company in order to help them chose the best location for their project.
Here are the 3 different heat maps we created: * Price as per location * Price per sqft lot as per location * Price per sqft living as per location
You will find the heat maps in the Jupyter notebook and also in the slides folder.

  1. What are the most impactful features on the price?

Our first model scores a r2 of 0.814, meaning that 81,4% of the variation in price is predicted by features.
Ranking from highest to lowest impact on the prices are the following features:

    1. grade
    1. sqft_living
    1. zip_mean_price
  1. Can we predict the price of a house based on its most important features?

Using only the 3 features above as arguments we built the function "predict()" that can give an idea of the price of a house in King County depending on its square footage, zipcode and grade

Presenting our findings to the Realtor company

Here is the link to our presentation to the realtor company

king-county-data-analysis's People

Contributors

locsta avatar ravidahiya74 avatar

Watchers

James Cloos avatar  avatar

Forkers

ravidahiya74

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.