Git Product home page Git Product logo

airbnb_price_prediction's Introduction

Airbnb Price Prediction with Random Forest Regressor

Personal mini-project

Python3

Description

I found a comprehensive airbnb data set for listings in NYC which sparked my curiousity and inspired me to apply a random forest regressor for a price prediction.

After manipulating the features and the random forest regressor (RFR) attributes multiple times, I came up with a pretty solid prediction model based on 6 features and price range. The data-set includes qualitative and quantitative data, said qualitative features have been changed into byte-vectors with one-hot encoding method from pandas library ( get_dummies ). Only valid features have been considered for this predictive model, as well as, a change in the RGR's attribute- min_samples_split. The reason for picking this attribute over others is further explained in this repo.

This price predictive model can be useful for individuals that own airbnb listings and or for individuals interested in seeing what their listing would be worth. This predictive model in my opinion could be greatly improved if the amount of rooms, number of guests, number of bathrooms, and other quantitative data was offered. However, even with such a restricted number of features, the model still manages to be good, proving the efficiency of RFRs.

Next Step

Finding a correlating data set that includes space measurements of the listings with more quantitative description regarding the listings. However, this shows the strength of using RFRs on data sets that do not seem very promising.

The data set uses 48,900 entries and model fitting depends on price range.

GUI

The gui takes in 6 entries and price ranges with a lower and upper limit. This was my biggest hurdle, because I had to learn how to implement it with my python code. I used PyQt Designer for most of the GUI.

GUI

Difficulties

The difficulty arise when dealing with the price ranges in NYC because of extremely flamboyantly priced luxury appartments. However, this is dealt with by specifiying price ranges of appartments and training a dummy model based on the price ranges and not the entire data.

Code Explained

Data set used comes from https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data.

Data set is divided by these columns:

Column features from .csv file

The set-up of the data frame looks like this:

Data frame setup from .csv file

The .csv file for the airbnb data has both qualitative and quantitative data, which is not the most ideal when working with random forest regressors, because the qualitative data in itself does not hold any mathematical meaning. For example, if we were to give Manhattan=1 and Bronx=2, the model will make false predictions based on the value that these two values hold. Which is not a sound when dealing with qualitative data. Hence, it was imperative to change, pick, and choose which of the features would be most ideal, in order to train our model to predict a listing's price. The most coherent features were taken into consideration. For example; number of total reviews is more important in this case than the number of reviews per month, as it does not add to the model's accuracy and only slows down the model.

In order to take into consideration the qualitative data it was converted to a binary vector using pandas' version of one hot encoder, get_dummies. This allows data such as type of neighbourhood: ["Bronx" , "Brooklyn", "Manhattan", ...] to be stored as 0's and 1's. Zero if the neighbourhood does not belong and one if it does. This was applied to two qualitative features; the neighbourhood group and the room type: ["Entire apt/home", "Share room", "Private room"].

Method:

house_data= pd.get_dummies(data=house_data, columns=['neighbourhood_group','room_type'])

The desired features were then split in order to create a training data set for the model and a test data set.

Split data

Two functions were created in order to find the best attribute to change for the RFR. Forest tree regressor's attribute chosen were max_leaf_nodes, which allocates a maximum amount of leaves per node and min_samples_split, which is the minimum number of samples required to split an internal node. These two functions have a parameter max_leaf_nodes or min_split that allows looping in the main() in order to find the best value, and returns the mean absolute error for that model.

functions

Little graphs:

This graph shows the position of listings with color coded price ranges for values up to $300. price per location

Another fancy graph, that is very much self-explanatory. neighbourhoods

In order to test if the model works, I compared attributes from an actual listing online to what my model would predict.

This is the posting I used to find the price range.

website

Which resulted in these attributes.

attributes

We can also see that as the price range becomes more specific the model converges towards a better answer.

With the above attributes and a model fitted and trained by the .csv file, with min_samples_split=75, n_estimator=10 the model predicted these values;

Price range Predicted price
0:400 $225
0:300 $172.52
0:200 $145.23
100:200 $150.75

With a price range of $100:$200 the prediction of the $150 listing was $150.75 !!!!

We can clearly see that as the price range becomes more specific the price prediction becomes more accurate. In this case our prediction was $0.75 off, but the mean average error was +/- $25. As the upper and lower limit of price is made more specific the training model is mostly trained with listing in that price range and is able to determine more accurately.

Setup

Special thanks to https://github.com/ademilly for his help on the set-up tools.

Using the virtual environment package venv:

    $ python3 -m venv venv
    $ source venv/bin/activate
    $ pip install -r requirements.txt

run

    $ python3 airbnb_predict.py

If you find any errors or whatnot let me know, I am open to all and any comments

email: [email protected]

airbnb_price_prediction's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.