- Dataset Content
- Business Requirements
- Project hypothesis and validation
- Rationale to map the business requirements to the Data Visualizations and ML tasks
- ML Business Case
- Dashboard Design
- Unfixed Bugs
- Deployment to Heroku
- Main Data Analysis and Machine Learning Libraries
- Credits
- The dataset is sourced from Kaggle. We created then a fictitious user story where predictive analytics can be applied in a real project in the workplace.
- The dataset has almost 1.5 thousand rows and represents housing records from Ames, Iowa; indicating house profile (Floor Area, Basement, Garage, Kitchen, Lot, Porch, Wood Deck, Year Built) and its respective sale price for houses built between 1872 and 2010.
Variable | Meaning | Units |
---|---|---|
1stFlrSF | First Floor square feet | 334 - 4692 |
2ndFlrSF | Second floor square feet | 0 - 2065 |
BedroomAbvGr | Bedrooms above grade (does NOT include basement bedrooms) | 0 - 8 |
BsmtExposure | Refers to walkout or garden level walls | Gd: Good Exposure; Av: Average Exposure; Mn: Mimimum Exposure; No: No Exposure; None: No Basement |
BsmtFinType1 | Rating of basement finished area | GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement |
BsmtFinSF1 | Type 1 finished square feet | 0 - 5644 |
BsmtUnfSF | Unfinished square feet of basement area | 0 - 2336 |
TotalBsmtSF | Total square feet of basement area | 0 - 6110 |
GarageArea | Size of garage in square feet | 0 - 1418 |
GarageFinish | Interior finish of the garage | Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage |
GarageYrBlt | Year garage was built | 1900 - 2010 |
GrLivArea | Above grade (ground) living area square feet | 334 - 5642 |
KitchenQual | Kitchen quality | Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor |
LotArea | Lot size in square feet | 1300 - 215245 |
LotFrontage | Linear feet of street connected to property | 21 - 313 |
MasVnrArea | Masonry veneer area in square feet | 0 - 1600 |
EnclosedPorch | Enclosed porch area in square feet | 0 - 286 |
OpenPorchSF | Open porch area in square feet | 0 - 547 |
OverallCond | Rates the overall condition of the house | 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor |
OverallQual | Rates the overall material and finish of the house | 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor |
WoodDeckSF | Wood deck area in square feet | 0 - 736 |
YearBuilt | Original construction date | 1872 - 2010 |
YearRemodAdd | Remodel date (same as construction date if no remodeling or additions) | 1950 - 2010 |
SalePrice | Sale Price | 34900 - 755000 |
As a good friend, you are requested by your friend, who has received an inheritance from a deceased great-grandfather located in Ames, Iowa, to help in maximizing the sales price for the inherited properties.
Although your friend has an excellent understanding of property prices in her own state and residential area, she fears that basing her estimates for property worth on her current knowledge might lead to inaccurate appraisals. What makes a house desirable and valuable where she comes from might not be the same in Ames, Iowa. She found a public dataset with house prices for Ames, Iowa, and will provide you with that
- 1 - The client is interested in discovering how the house attributes correlate with the sale price. Therefore, the client expects data visualizations of the correlated variables against the sale price to show that.
- 2 - The client is interested to predict the house sales price from her 4 inherited houses, and any other house in Ames, Iowa.
- We suspect that the distribution of the sale prices is skewed to the right which might lead to a problem when it comes to predicting high sale prices. To validate the project hypothesis about the shape of the distribution, we plot a combined boxplot/histogram of the sale price.
- Business requirement 1: Correlation study and data visualization
- As a client I want to inspect the house records data so that I can get an idea of which variables are important for the sale price.
- As a client I want to display a heatmap of the spearman correlation coefficients so that I can order the variables by importance concerning the sale price.
- As a client I want to plot the important variables against the sale price so that I can visualize how such a variable is correlated with the sale price.
- Business requirement 2:
- As a client I want to display the inherited houses records data so that I can easily find a house attribute.
- As a client I want to use an ML model so that I can predict the price of my four inherited houses in Ames, Iowa.
- As a client I want to use the ML model so that I can predict the price of any other house in Ames, Iowa.
- What are the business requirements?
- The client is interested in discovering how house attributes correlate with sale prices. Therefore, the client expects data visualizations of the correlated variables against the sale price.
- The client is interested in predicting the house sale prices from her 4 inherited houses, and any other house in Ames, Iowa.
- Is there any business requirement that can be answered with conventional data analysis?
- Yes, we can use conventional data analysis to investigate how house attributes are correlated with the sale prices.
- Does the client need a dashboard or an API endpoint?
- The client needs a dashboard
- What does the client consider as a successful project outcome?
- A study showing the most relevant variables correlated to sale price.
- Also, a capability to predict the sale price for the 4 inherited houses, as well as any other house in Ames, Iowa.
- Can you break down the project into Epics and User Stories?
- Information gathering and data collection.
- Data visualization, cleaning, and preparation.
- Model training, optimization and validation.
- Dashboard planning, designing, and development.
- Dashboard deployment and release.
- Ethical or Privacy concerns?
- No. The client found a public dataset.
- Does the data suggest a particular model?
- The data suggests a regressor where the target is the sale price.
- What are the model's inputs and intended outputs?
- The inputs are house attribute information and the output is the predicted sale price.
- What are the criteria for the performance goal of the predictions?
- We agreed with the client on an R2 score of at least 0.75 on the train set as well as on the test set.
- How will the client benefit?
- The client will maximize the sales price for the inherited properties.
The dashboard consists of five pages:
- The first page describes the project dataset and states the business requiremnents.
- The second page fullfills the first project requirement. It starts with stating the requirement in an info box. Three checkboxes implement the user stories relating to the first project requirement (see Business Requirements). When checked they display:
- A table showing the dataset.
- A heatmap of Spearman correlation coefficients.
- Scatterplots of correlated variables against sell price. The page also has a description of the meaning of the variables and a general conclusion.
- The third page fullfills the second project requirement (see Business Requirements). It has two tables showing the client's inherited houses data and predicted sale prices respectively. The sum of the sale prices is also displayed. The second part of the page has two input widgets and a button that enables the user to predict the sale price based on the inputs.
- The fourth page states the project hypothesis and its validation. It shows the distribution of sale price. Finally there is a paragraph about the model's limitation and how it may be connected to the project hypothesis.
- The fifth page starts with a general conclusion about the performance of the ML model. The pipeline steps are then presented followed by a bar plot showing the importance of each feature in the train set. The remaining two parts evaluate the ML model by computing the R2 score and three different error measures and by displaying a scatterplot of predicted versus actual sale price (which is the target).
- There is no unfixed bugs.
- The App live link is: https://housepricesfarid.herokuapp.com/
- The project was deployed to Heroku using the following steps:
- Create a Procfile which tells Heroku how to run the project
- Create a setup.sh file containing the streamlit configuration requirements
- Heroku needs the requirements.txt which contains all external libraries used in the project
- Log in to Heroku and create an App
- At the Deploy tab, select GitHub as the deployment method
- Select your repository name and click Search. Once it is found, click Connect
- Select the branch you want to deploy (main), then click Deploy Branch
- One may also enable automatic deploys so that the app is updated for every push to Github. Click now the button Open App on the top of the page to access your App
The libraries used in this project are:
- numpy==1.19.1
- pandas==1.1.2
- matplotlib==3.3.1
- seaborn==0.11.0
- pandas-profiling==3.2.0
- streamlit==1.10.0
- feature-engine==1.0.2
- scikit-learn==0.24.2
As an example, Seaborn was used to creating the heatmap of correlation coefficients, pandas-profiling to explore the variables in the dataset by showing their distribution, how many missing data they contain etc.
My main inspiration came from the Code Institute Churnometer walkthrough project and the Scikit-learn lesson in the Data Analysis and Machine Learning Toolkit module (also from the Code Institute). The structure of the project and most of the code is taken from there and adapted to this project.