Airbnb-Data-Science

This repository consists of following two projects

Predictive Price Modeling for Airbnb listings

The project aimed at predicting the price of an Airbnb listing given a number of features. The project involved exploratory data analysis, data pre-processing, feature selection, Model Fitting, Model Comparison and deploying the containerised Webapp on AWS using CI/CD Pipeline.

Project Resources

AWS Webapp for this project
View Project Report
Airbnb Dataset used in the project
Run this code on Google Colab
View Source on Github
Docker Container for the project: dkarkala01/airbnb-predict

Alternate Search Rankings for Airbnb

The project involved coming up with alternate search rankings based on Listing Vibe, Aesthetic Quality of listings photos and using A/B testing for comparing different search rankings.

Project Resources

AWS Webapp for this project
View Project Report
Airbnb Dataset used in the project
Run this code on Google Colab
View Source on Github
Docker Container for the project: dkarkala01/airbnb-search

Project Report: Predictive Price Modeling for Airbnb listings

Project Goals and Objectives

The Short Answer: Assisting Airbnb hosts to set appropriate price for their listings

The Problem: Currently there is no convenient way for a new Airbnb host to decide the price of his or her listing. New hosts must often rely on the price of neighbouring listings when deciding on the price of their own listing.

The Solution: A Predictive Price Modelling tool whereby a new host can enter all the relevant details such as location of the listing, listing properties, available amenities etc and the Machine Learning Model will suggest the Price for the listing. The Model would have previously been trained on similar data from already existing Airbnb listings.

Project Overview

The project involved the following steps,

Exploratory Data Analysis: Explore the various features, their distributions using Histograms and Box-plots
Pre-processing and Data Cleaning: Normalisation, filling missing values, encoding categorical values
Feature Selection: Study the correlation with response variable (Listing Price) and determine which features are most useful in predicting the price.
Model Fitting and Selection: Training different models, tuning hyper-parameters and studying Model performance using Learning Curve.
Model Serving: Using FLASK to deploy and serve Model predictions using REST API
Container: Using Docker to containerise the Web Application
Production: Using AWS CI/CD Pipeline for continuous integration and deployment.

End Result

The screen capture of the entire application in use is shown below. Users can enter all the relevant details of their listings, the trained Predictive Model will then predict and return the price of the listing given all the features. The Webapp can be explored here.

About Dataset

The Dataset used in this project was obtained from public.opendatasoft.com. There are a total of 494,954 records each of which contains details of one Airbnb listing. The total size of dataset is 1.89 GB.

The dataset has a large number of features which can be categorised into following types,

Location related: Country, City, Neighbourhood
Property related: Property Type, Room Type, Accommodates, Bedrooms, Beds, Bed Type, Cancellation Policy, Minimum Nights
Booking Availability: Availability 30, Availability 60, Availability 90, Availability 365
Reviews related: Number of Reviews, Reviews per Month, Review Scores Rating, Review Scores Accuracy, Review Scores Cleanliness, Review Scores Checkin, Review Scores Communication, Review Scores Location, Review Scores Value
Host related: Host Since, Host Response Time, Host Response Rate, Calculated host listings count, Host Since Days, Host Has Profile Pic, Host Identity Verified, Is Location Exact, Instant Bookable, Host Is Superhost, Require Guest Phone Verification, Require Guest Profile Picture, Requires License
Amenities: TV, Wireless Internet, Kitchen, Heating, Family/kid friendly, Washer, Smoke detector, Fire extinguisher, Essentials, Cable TV, Internet, Dryer, First aid kit, Safety card, Shampoo, Hangers, Laptop friendly workspace, Air conditioning, Breakfast, Free parking on premises, Elevator in building, Buzzer/wireless intercom, Hair dryer, Private living room, Iron, Wheelchair accessible, Hot tub, Carbon monoxide detector, 24-hour check-in, Pets live on this property, Dog(s), Gym, Lock on bedroom door, Private entrance, Indoor fireplace, Smoking allowed, Pets allowed, Cat(s), Self Check-In, Doorman Entry, Suitable for events, Pool, Lockbox, Bathtub, Room-darkening shades, Game console, Doorman, High chair, Pack ’n Play/travel crib, Keypad, Other pet(s), Smartlock

The price of the listing will serve as labels for the regression task. The goal of this project would be to predict these price of the listings.

Exploratory Analysis

To get a better insight into where the listings are located, the number of listings in various cities and countries are plotted in the figure below. In the given dataset, United States has the most number of listings followed by European countries and Australia. In terms of cities, Paris, London, New York, Berlin, Los Angeles are some of the cities with most number of listings.

Airbnb offers three types of listings,

Entire home/apartment
Private Room
Shared Room

Entire home is by far the most popular type of listing offered followed by Private Rooms and then a small share of the total listings are shared rooms.

Intuitively it is reasonable to expect that the listing location and the listing property type are two of the most important factors in determining the price of the listing. The following plots shows the distribution of listing prices across various cities and the difference in price amongst the three property types.

Few noticeable observations from the above plots,

Netherlands, US, Switzerland, Ireland, UK have amongst the highest average listing price.
In terms of cities, 10 of the top 12 cities with the highest average listing price are in the US. Clearly Airbnb listings are more expensive in the US compared to other European cities.
As expected, the cities with the highest listing prices are all major tourist attractions. Outside of the US, Amsterdam and Venice are the cities with highest average listing price.
As expected, Entire homes have the highest prices followed by Private Room and then Shared Rooms.

Feature Engineering: What features will be useful in predicting the listing price ?

Although the dataset consists of large number of features for listings, not all of them will help in predicting the listing price. In fact, different features will have different influences in .Feature Engineering refers to selecting a subset of features or adding new features which will aid in better prediction of the response variable which is the Listing Price in this project.

The following figures show the distribution of various features against the listing price. This will aid in determining which features are correlated with the listing price and can thereby result in the Models making better predictions.

As expected, the most important factors that determine the price of a listing are Number of people accommodated, Number of bedrooms, Number of beds, all of which have a Pearson's Correlation Factor of more about 0.45 with the Listing Price. Amenities like TV, AC also show a slight positive correlation. But it is clear that there are no hidden features which plays a major role in determining the listing price, bigger the home/apartment with more bedrooms and beds, higher is the listing price.

Data pre-processing and cleaning

Before feeding these features as input to the Machine Learning Model, the data will need to be pre-processed and cleaned. The following block diagram shows the Data Pipeline with the operations involved in pre-processing and data splitting.

Data Pre-processing

The pre-processing operations involved are listed in the following table.

Name	Feature Type	Operation
Imputer	Numerical	Replace NULL values with Median
Standard Scaler	Numerical	Standardise input data to have 0 mean and unit variance
Ordinal Encoder	Categorical	Encode discrete values into integers

The following code snippet shows the pre-processing pipeline implemented using the Python library Scikit learn.

# Preprocessing pipeline for Numerical and Categorical features

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer


############ Pipeline for numerical features #############
# Numerical features
numerical_attribs = [   'Accommodates', 'Bedrooms', 'Beds', 'Minimum Nights',
             'Availability 30', 'Availability 60', 'Availability 90',
             'Availability 365', 'Number of Reviews', 'Reviews per Month',
             'Review Scores Rating', 'Review Scores Accuracy', 'Review Scores Cleanliness',
             'Review Scores Checkin', 'Review Scores Communication', 'Review Scores Location',
             'Review Scores Value', 'Host Response Rate']

# Pipeline for numerical features
# 1. SimpleImputer: Replace NULL values with median
# 2. FunctionTransformer: Add extra features
# 3. StandardScaler: Normalise values
numerical_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")), 
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])


########## Pipeline for categorical features #############
# Categorical features
categorical_attribs = ["Country", "City", "Neighbourhood Cleansed",
                    "Property Type", "Room Type", "Bed Type", "Cancellation Policy",
                    "Host Response Time"]

# OrdinalEncoder(): Encode categorical features as integers
categorical_pipeline = Pipeline([
                            ('ordinal_encoder', OrdinalEncoder()),
                            ])


########## Combined Pipeline for all features ############
preprocessing_pipeline = ColumnTransformer([
        ("categorical", categorical_pipeline, categorical_attribs),
        ("numerical", numerical_pipeline, numerical_attribs),
    ])

# Labels
label = ["Price"]

# Combine numerical and categorical features
df_attribs = df[categorical_attribs + numerical_attribs + label].copy()
# Fit Preprocessing pipeline
df_prepared = preprocessing_pipeline.fit_transform(df[categorical_attribs + numerical_attribs])

# Save preprocessing pipeline
save_model(model=preprocessing_pipeline, save_path="preprocessing_pipeline.pkl")

After pre-processing, the dataset is divided into 3 splits the details of which are listed in the following table.

Data	Purpose	Split Ratio	Number of Samples
Training	To fit Model	0.8	270,058
Validation	To tune hyperparameters	0.1	33,757
Test	To evaluate model performance	0.1	33,757

Modeling: Training Machine learning models, Model Selection

Model Evaluation Metric

Since this is a Regression task (predicting the price of listing), various evaluation metrics such as Variance Explained Score, Mean Absolute Error, R2-score, RMSE (Root Mean Squared Error) can be used. In this project, RMSE is used to evaluate and compare different Machine Learning Models.

Regression Models

The following Regression Models were explored in this project,

Baseline Models
- Average Neighbourhood Price
- K-Nearest Neighbours Regression
Linear Regression
Decision Tree Regression
Random Forest Regression
XGBoost Regression

Baseline Models

Before trying various Machine Learning Models, it is important to set baseline performances based on simple heuristics or simple models. Accordingly the following two models were used as baseline to compare the other Machine Learning Models.

Average Neighbourhood Price: Estimate the listing price to be the average price of all the listings in the neighbourhood.
K-Nearest Neighbours Regression: As defined , the target is predicted by local interpolation of the targets associated of the k-nearest neighbors in the training set.

Linear Regression

As defined here, LinearRegression fits a linear model to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

Linear Regression: Feature Importance

The Importance of features as determined by the Linear Regression coefficients is shown in the following plot. As expected Room Type, Number of people accommodated and Number of bedrooms are the most important features in determining the price of the listing.

Linear Regression: Model Learning, Performance and Stability

The following code snippet shows how Scikit Learn's learning_curve method can be used to study the Model learning, performance and stability.

from sklearn.model_selection import learning_curve

# sklearn's learning_curve function can be used for following purposes
# 1. Learning curve: To study how training and validation error varies with more training examples
#      train_scores, valid_scores vs train_sizes
# 2. Model scalability: To study the time required to fit model as training data size increases
#      fit_times vs train_sizes
# 3. Model performance: To study how training error changes with time required to fit
#      train_scores vs fit_times

train_sizes, train_scores, valid_scores, fit_times, score_times = learning_curve(model, df_features, df_labels,
                                                                                 train_sizes=[0.25, 0.5, 0.75, 1], cv=10,
                                                                                 scoring="neg_mean_squared_error", 
                                                                                 return_times=True)

The learning, performance and stability for the Linear Regression Model are shown in the following figures. Learning curve shows how the Model predictions improve as it sees more training examples. The fact that the training and validation RMSE converge to similar value, shows that the Model is not overfitting.

Model scalability can be studied by plotting the time it takes to fit the Model as the number of training examples increases. The Model performance can be examined by plotting the Model Evaluation Metric (RMSE) against the time it takes to fit the Model. Together these curves will be very useful in comparing various Models and selecting a final Model for predictions.

Decision Tree Regressor

Decision Tree Regressor: Feature Importance

The following figure shows the Top5 features sorted by the Feature Importance as determined by the Decision Tree Regressor. As expected, the number of bedrooms, people accommodated, room type and location of the listing (country, city) are the most important features in determining the listing price.

Decision Tree Regressor: Hyper-parameter tuning

In order to get the best possible results from any Model, it is vital to determine the right combination of hyper-parameters to be used. This process is known as hyper-parameter tuning. It involves training the Model with different values for a set of parameters. The Model performance is then computed using the Validation Dataset. The parameter combination which yields the best performance is the one that will eventually be selected while comparing various Models. The code snippet to do this using Grid Search and Randomized Search Cross Validation is shown here,

# Tune hyperparameters using Grid search and Randomised search Cross Validation

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Grid Search Cross Validation
# Specify discrete values for hyperparameters
param_grid = [
    {'max_depth': [1, 20, 100], 'max_features': [1, 5, 15, 20],
     'max_leaf_nodes': [5, 50, 100]},
]
search = GridSearchCV(estimator=model, param_grid=param_grid, cv=10,
                      scoring='neg_mean_squared_error', return_train_score=True)

# Randomised search Cross Validation
# Specify distribution for hyperparameters
param_distribs = {
  'max_depth': randint(low=1, high=10),
  'max_features': randint(low=1, high=20),
}
search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions,
                            n_iter=5, cv=10, scoring='neg_mean_squared_error',
                            random_state=42, return_train_score=True)

To determine the best possible values for parameters of , the following different values of parameters were tried using Grid Search Cross Validation.

Hyper-parameter	Values
Maximum Depth	[1, 20, 100]
Maximum Number of Features	[1, 5, 15, 20]
Maximum number of leaf nodes	[5, 50, 100]

The best estimator for was determined to be with the following values of hyper-parameters, Max Depth=20 , Max Features=20 and Max Leaf Nodes=100.

Decision Tree: Learning Curve, Model Scalability and Performance

Model Evaluation and Comparison

The final step of Model Selection is to compare the Prediction RMSE of all the tuned classifiers on Test Dataset. In this step, it is important to consider not just the Mean or Median RMSE, instead consider the entire range of RMSE obtained over different samples in the Test Dataset or over different splits of Cross Validation Data. The box plots of RMSE on Test DaTa for different Regression Models are shown in the following plot,

It can be observed that amongst the Models considered here, Random Forest seems to be the one with the lowest Median RMSE, also with the lowest IQR (Inter-Quartile Range). The Median RMSE error for Random Forest is less than 20 USD, so the Model is successful to a large extent in predicting the price of the listing.

Deployment, Serving and Production: CI/CD Pipeline

Model deployment using FLASK Application

A FLASK Webapp is developed in order to serve the Model Predictions and showcase the capabilites of the project. The following code block shows how the model is used to get inference for a given input image.

# FLASK Webapp for serving price Predictions for Airbnb listings 
import os, sys
sys.path.append(".")
import webapp_predict_price

import pickle
from flask import Flask
import flask
import sklearn
import joblib
import pandas as pd

# Load pre-trained machine learning model.
BASE_PATH = "webapp_predict_price/"
model = joblib.load(BASE_PATH + "best_model.pkl")


def create_app(test_config=None):
    # create and configure the app
    app = Flask(__name__, instance_relative_config=True)
    app.config.from_mapping(
        SECRET_KEY='dev',
        #DATABASE=os.path.join(app.instance_path, 'flaskr.sqlite'),
    )

    if test_config is None:
        # load the instance config, if it exists, when not testing
        app.config.from_pyfile('config.py', silent=True)
    else:
        # load the test config if passed in
        app.config.from_mapping(test_config)

    # ensure the instance folder exists
    try:
        os.makedirs(app.instance_path)
    except OSError:
        pass

    # Landing page
    @app.route('/', methods=['GET', 'POST'])
    def hello():
        # return 'Hello, World!'
        
        # Return landing page
        if flask.request.method == 'GET':
            return(flask.render_template('base.html'))

        # Return prediction output
        if flask.request.method == 'POST':

            # Create input to Model from form data
            df_input = pd.DataFrame([[country, city, neighbourhood, propertytype, roomtype, bedtype,
                                    cancellationpolicy, hostresponsetime, accommodates, num_bedrooms, num_beds,
                                    min_nights, availability_30, availability_60, availability_90, availability_365,
                                    num_reviews, reviews_per_month, review_scores_rating, review_scores_accuracy, 
                                    review_scores_cleanliness, review_scores_checkin, review_scores_communication,
                                    review_scores_location, review_scores_value, host_response_rate,
                                    ]], dtype=float)

            # Inference: Get prediction from Model
            prediction_price = model.predict(df_input)[0]
            prediction_price = round(prediction_price)

            return(flask.render_template('base.html', result=prediction_price))

    return app


# if this is the main thread of execution first load the model and
# then start the server
if __name__ == "__main__":
    print(("* Loading Scikit-learn model and Flask starting server..."
        "please wait until server has fully started"))
    app = create_app()
    app.run(host='0.0.0.0', port=5000)

Serving Model Predictions: REST API as Web Service

The Model predictions can also be served as a Web Service by using REST API. The following code snippet shows how this can be accomplished. The model output is returned as a JSON object.

# REST API Service to get Price Predictions for Airbnb listings
@app.route("/predict", methods=["POST"])
def predict():
    
    # initialize the data dictionary that will be returned from the view
    data = {"success": False}

    # ensure an image was properly uploaded to our endpoint
    if flask.request.method == "POST":
        data["predictions"] = []

        parser = reqparse.RequestParser()
        parser.add_argument('country', type=str, help='Country')
        parser.add_argument('city', type=str, help='City')
        parser.add_argument('neighbourhood', type=str, help='Neighbourhood')
        parser.add_argument('roomtype', type=str, help='Room type')
        args = parser.parse_args()

        # Create input to Model from form data
        df_input = pd.DataFrame([[country, city, neighbourhood, propertytype, roomtype, bedtype,
                            cancellationpolicy, hostresponsetime, accommodates, num_bedrooms, num_beds,
                            min_nights, availability_30, availability_60, availability_90, availability_365,
                            num_reviews, reviews_per_month, review_scores_rating, review_scores_accuracy, 
                            review_scores_cleanliness, review_scores_checkin, review_scores_communication,
                            review_scores_location, review_scores_value, host_response_rate,
                            ]], dtype=float)

        # Inference: Get prediction from Model
        prediction_price = model.predict(df_input)[0]
        prediction_price = round(prediction_price)

        # Add prediction results to JSON data
        r = {"prediction_price": prediction_price, "features": features}
        data["predictions"].append(r)
        # indicate that the request was a success
        data["success"] = True

    # return the data dictionary as a JSON response
    return flask.jsonify(data)

The following figure shows how the Model predictions can be obtained using the above REST API.

Model in Production: FLASK, Docker, AWS CI/CD Pipeline

The production pipeline consists of the following components,

FLASK Webapp: Webapp and REST API to serve Model Predictions
Docker: Containersied FLASK Webapp which can then be deployed in any environment
AWS: CI/CD Pipeline

ECR Repository: The Docker Image is stored in this repository. Any changes to this image will trigger changes in the rest of the pipeline and the updates to the image will then be deployed to the Web Application.
CodeCommit : The pipeline is configured to use a source location where the following two files are stored,
- Amazon ECS Task Definition file: The task definition file lists Docker image name, container name, Amazon ECS service name, and load balancer configuration.
- CodeDeploy AppSpec file: This specifies the name of the Amazon ECS task definition file, the name of the updated application's container, and the container port where CodeDeploy reroutes production traffic.
CodeDeploy: Used during deployment to reference the correct deployment group, target groups, listeners and traffic rerouting behaviour. CodeDeploy uses a listener to reroute traffic to the port of the updated container specified in the AppSpec file
- ECS Cluster: Cluster where CodeDeploy routes traffic during deployment
- Load Balancer: The load balancer uses a VPC with two public subnets in different Availability Zones.

Alternate Search Rankings for Airbnb

The project involved coming up with alternate search rankings based on Listing Vibe, Aesthetic Quality of listings photos and using A/B testing for comparing different search rankings.

Project Goals

Alternate Searches: Come up with novel alternate ways of searching Airbnb lisings with an aim of making it easier for users to find listings of their most appropriate choice.

Listings Vibe: Determine vibe of listing based on Topic Modeling on listing description.
Image Aesthetics: Sort listings based on image aesthetics as determined by Deep Learning Image Assessment Model

About Dataset

The dataset has a large number of features which can be categorised into following types,

Location related: Country, City, Neighbourhood
Property related: Property Type, Room Type, Accommodates, Bedrooms, Beds, Bed Type, Cancellation Policy, Minimum Nights
Booking Availability: Availability 30, Availability 60, Availability 90, Availability 365
Reviews related: Number of Reviews, Reviews per Month, Review Scores Rating, Review Scores Accuracy, Review Scores Cleanliness, Review Scores Checkin, Review Scores Communication, Review Scores Location, Review Scores Value
Host related: Host Since, Host Response Time, Host Response Rate, Calculated host listings count, Host Since Days, Host Has Profile Pic, Host Identity Verified, Is Location Exact, Instant Bookable, Host Is Superhost, Require Guest Phone Verification, Require Guest Profile Picture, Requires License
Text Features: Listing Description, House Rules, Neighbourhood Description
Image URL: Link to listing image (one per listing)

For this project, the following features will be used,

Text Features such as Listing Description, House Rules, Neighbourhood Description in order to determine Listing Vibe based on Topic Modelling.
Listing Images to search by Image aesthetics

Search by Image Aesthetics: Using pre-trained Deep Learning Model to assess image quality

Why search by image quality and aesthetics ?

Users of online home listings portal such as Airbnb have to rely solely on information provided by hosts. It is vital that the images posted by the host is clear and an accurate depiction of reality. In this regard, it makes sense that users would want to prefer listings with very good image quality and aesthetics. Currently there is no easy way for users to search by image quality, in this project a deep learning model is used to assess the image posted by hosts. A image quality score is assigned to each image and the users can then sort the listings by this score such that the listings with the best image quality will appear at the top and making it easier for users to find what they are looking for.

Pipeline: Search by Image Aesthetics

The Deep Learning Model used to assess image quality is Google's Neural image assessment model. It is based on Convolutional Neural Networks (CNN). This implementation of the model was used to assign scores to photos of listings.

Images with Best Aesthetics

Images with Poor Aesthetics

The results indicate how the Deep Learning model has accurately assigned high aesthetic scores to brightly lit images of rooms with clearly visible amenities. Whereas images shot in low light, with poor clarity are assigned lower scores. This feature would be very useful for users to eliminate such listings and encourage more hosts to upload pictures of better quality.

Classifying listings based on Topic Modeling of listing description

Why search by Listing Vibe ?

Airbnb lets users search for listings based on a number of criteria such as Location, Price, Room Type, Number of people accommodated. However one of the aspects missing in this is What type of guests are most welcome in the listing ? Generally Airbnb guests fall in one of the below categories,

Family: Guests looking for accommodation for family vacation. Such guests usually look for a place in good neighbourhood, which is safe for kids amongst other criteria.
Friends: When travelling with friends, some of the important criteria for booking a listing are Tolerance to loud talking, Permission for Parties, Close to restaurants, bars etc
Solo/Budget travel: Typically these guests search for Budget friendly options, even if the place is relatively small and does not include all the amenities.
Business Visit: Guests visiting for Business purposes typically look for a place in Good Neighbourhood, Close to Downtown for socialising with excellent amenities.

However the Airbnb webpage does not support searching listings based on the above characteristics. There is no way for users to search for listings with a particular theme like the ones listed above. One of the objectives of this project is to come up with an option for users to search based on Listing Vibe. The next few sections will describe how this is achieved.

Pipeline: Search by Listing vibe

The listing descriptions, neighbourhood descriptions for each listing are extracted from the dataset. This will then be fed to NLP Pipeline which converts words and sentences into a set of features. These features will then be used to perform Topic Modelling. This process will generate a set of topics (based on grouping words that have frequently occurred together). Every listing will then be assigned to one of the topics. Users can then filter the listings based on these specific categories.

NLP Pipeline for Topic Modelling

The NLP Pipeline involves converting a sentences into words and then ultimately into set of features which can then serve as input to Machine Learning Model. This process consists of the following steps,

Input: The listing and neighbourhood descriptions will together form Input Data
Tokenise Words: Converts sentences into list of words, removes punctuations.
Stop Words Removal: A lot of words which occur frequently in all the documents do not add much value in extracting useful features. As a result, such words are removed.
Bigrams Models: Identify words that commonly occur together with each other (Example: Art deco. In addition to individual words, these serve as additional features.
Lemmatization: Reducing various inflected forms of a word to the basic root word. For example, after this transformation the word located will be reduced to locate.
Term Document Frequency: Counts the number of occurrences of each word in every document.
Topic Modelling using Latent Dirichet Allocation (LDA) Model: The corpus of Id to Word Mapping, number of occurrences of various words in each of the documents will serve as inputs to the LDA Model which will then generate a predefined number of distinct topics based on grouping of similar words which would have occurred together across different documents.

Topic Modelling: Identifying and Labelling Topics

The LDA Model will return the predefined Number of Topics set of closely related words. After processing the Airbnb Listings and Neighbourhood description through the above pipeline, the following set of words were returned,

[(0, '0.036*"walk" + 0.031*"restaurant" + 0.027*"block" + 0.025*"place" + ' '0.025*"train" + 0.024*"away" + 0.024*"subway" + 0.022*"minute" + ' '0.021*"close" + 0.019*"good"'), (1, '0.021*"guest" + 0.020*"stay" + 0.015*"space" + 0.015*"share" + ' '0.014*"available" + 0.014*"home" + 0.012*"use" + 0.012*"private" + ' '0.012*"access" + 0.011*"need"'), (2, '0.028*"full" + 0.022*"large" + 0.018*"size" + 0.014*"tv" + 0.014*"private" ' '+ 0.013*"include" + 0.013*"fully" + 0.011*"space" + 0.011*"building" + ' '0.011*"high"')]

It is now up to the ML practitioner to assign individual labels to each set of these words. Since the objective in this project is to assign Listing Vibe, the following figure shows the topics that were assigned based on the words present in each group.

The above figure also shows an example listing description for each of the three topics that were assigned. It is interesting to note that the LDA Model did return 3 sets of words which roughly correspond to the three types of listings that were mentioned earlier: Family/Kids, Friends, Solo, Business Visits/.

Topic Modeling: Visualisation

The following screen capture shows the visualisation of the 3 topics. This was done using the library. Each circle corresponds to a topic. The three big circles in different quadrants indicate that the topics identified are specific and distinct. Hovering on each topic (circle) will show the most dominant words present in that topic.

Screenshot: Search by Listing Vibe

The following screen capture of the Webapp illustrates how search by Listing Vibe works. Users can now filter listings based on the Topic assigned to each listing. This should hopefully add a new dimension to searching for accommodation, make it easier to find the type of listings user is looking for, thereby reduce the booking time and improve the conversion rate.

How to study the effectiveness of newly added search features ? - A/B Testing

A/B Testing: What is it and its purpose ?

So far this project introduced two alternate ways of searching for listings on Airbnb namely,

Sort listings by Image Aesthetics
Search by Listing Vibe

The next obvious question is how can we test the effectiveness of these newly introduced features ? The answer to this is through a process called A/B Testing, which can be used to compare the existing version of the website against the version with the newly introduced changes. The methodology used is described in the following section.

A/B Testing Methodology: How to do it ?

The A/B Testing methodology consists of following steps, each of which are described in detail in the following sections.

Research, Define Goals and Set up Metrics
Hypothesis Formulation
Create Variation
Run A/B (Split) Testing
Collecting Data and Statistical Analysis
Analyse Results and Draw Conclusions

Step 1: Research, Define Goals and Set up Metrics

The first step before getting started on A/B Testing is to do prior research. To study how the current website works, insepct how effective the current features are. To serve this purpose, a number of metrics should be logged and monitored: Number of site visitors, Amount of time taken in various pages, Time to Booking, Conversion Rate (Fraction of total users completing a booking).

The above analysis and the metrics collected will help in understanding which parts of the website can be improved in order to increase sales or engagement. Based on this, a few specific metrics can be chosen to be improved through A/B Testing. For this project, the following metrics were chosen to be optimised,

Time to complete booking
Conversion Rate

Step 2: Hypothesis Formulation

The next step is to formulate hypothesis. For every metric we want to improve on, a Null and an Alternate Hypothesis need to be introduced. The Null Hypothesis indicates that the newly introduced feature did not make any change compared to the existing version whereas the Alternate Hypothesis suggests that there was a change in metrics (may be better or worse) due to the newly introduced feature. For the two metrics chosen in this project, following are the Null Hypothesis and Alternate Hypothesis.

Hypothesis 1: Time to complete booking
- Null Hypothesis: Mean time to complete booking is the same for both control and variation
- Alternate Hypothesis: Mean time to complete booking is different for control and variation
Hypothesis 2: Conversion Rate
- Null Hypothesis: Booking Conversion Rate is same for control and variation
- Alternate Hypothesis: Booking Conversion Rate is different for control and variation

The goal of A/B Testing is to conclude based on statistical analysis, if the newly introduced feature resulted in any change to the defined metric. In case, there is a significant change, then the Null Hypothesis can be rejected. Further if the change is an improvement in metric then the newly introduced feature can be delpoyed permanently as part of the website. If the change resulted in worse metrics, then the new feature can be discarded. This way A/B Testing provides a quantitative approach to measure the effectiveness of any new feature.

Step 3: Create Variation

Once the goals, metrics are defined and hypothesis formulated, the next step is to add the new feature which needs to be tested. This version of the webpage is referred to as the Variation and the existing version is referred to as the Control. The following figure shows one possible option for Control and the Variation versions of the webpage for this project.

Control (Existing version)

Variation (Version with new features)

Step 4: Run A/B (Split) Testing

After the Control and Variation versions of the webpage are setup, the next step is to run the split tests. For this purpose, the visitors to the webpage will be split and redirected to the two different versions. This means that a portion of the visitors will see the Control version whereas the rest will see the Variation version. The following test parameters will need to be defined before running the tests,

Parameter	Value	Description
Split Ratio	0.5	The ratio of the visitors who will see the Control and Variation versions.
Test Duration	10,000 sessions	The duration to which the test needs to be run. This is a trade-off between two factors. The test needs to be run long enough to establish statistical significance and draw any meaningful conclusions. At the same time, if the new feature (variation) results in degradation of sales or engagement then it is important to make sure that test is not run for too long in order to minimise loss in revenue.
Sample Distribution	Normal	The distribution of values for metrics need to be assumed in order to use a suitable Test statistic. For example, the values for metric Booking Time can be assumed to be Gaussian.
Test Statistic	Z-test	A Z-test is any statistical test for which the distribution of the test statistic under the Null hypothesis can be approximated by a normal distribution. It measures how far the test statistic is from the mean of the normal distribution under Null Hypothesis. Higher the value, less likely it is for the test statistic to be under Null Hypothesis, making it possible to reject Null Hypothesis with greater confidence.
Significance Level (p-value)	0.01	A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference.

</div>

The following animation illustrates how the Z-test statistic and p-value varies for different distributions of Variation as compared to the Control.

Step 5: Collecting Data and Statistical Analysis

For the purpose of this project, sample simulated data will be used in order to perform statistical analysis of A/B Testing. The values for the two pre-defined project metrics are as shown in the following table.

Version	No. of Sessions	Average Booking Time (Time)	Standard Deviation (Time)	Conv-ersion Rate
Control	10,000	300 seconds	85 seconds	1.50 %
Variation	10,000	296 seconds	93 seconds	2.00 %

The following code snippets show how the test statistics can be obtained for simulated data presented in the table above.

# A/B testing: Z-test for Booking time (Normal Distribution)
from scipy.stats import norm

mu_A, std_A, n_A = 300, 105, 10000
mu_B, std_B, n_B = 296, 120, 10000

Z = (mu_A - mu_B)/np.sqrt(std_B**2/n_B + std_A**2/n_A)
pvalue = norm.sf(Z)

# A/B Testing: Conversion Rate: Distance between chi-squared distributions
from scipy.stats import chi2

T = np.array([165, 165, 9835, 9835])
O = np.array([150, 180, 9850, 9800])

D = np.sum(np.square(T-O)/T)

pvalue = chi2.sf(D, df=1)

The following figures show the distribution of test statistics, Z-test score for Booking Time Hypothesis, Distance for Conversion Rate Hypothesis and the corresponding p-values.

Hypothesis 1: Time For Booking

Hypothesis 2: Conversion Rate

Step 6: Analyse Results and Draw Conclusions

The final step in A/B Testing is to analyse the results of statistical analysis and based on that to draw conclusions.

Can the Null Hypothesis be rejected with confidence ?
- Hypothesis 1: Booking Time - Was there significant decrease in mean booking time ? The Z-test score is 2.51 which corresponds to p-value of 0.0061 which is lower than the pre-defined Significance level = 0.01. So we can conclude that Null Hypothesis can be rejected and that the Mean Booking Time reduced by 4seconds in the Variation as compared to the Control version of the webpage.
- Hypothesis 2: Conversion Rate - Was there significant increase in Conversion Rate ? The Distance metric is 2.87 which corresponds to p-value of 0.089 which is higher than the pre-defined Significance level = 0.01. So we cannot reject the Null Hypothesis and hence we cannot conclude that there is any significant improvement in Conversion Rate in the Variation as compared to the Control version of the webpage.
What was the eventual impact on business metric (was there a significant increase in revenue) ? If the test is done methodologically, the result should be evident at the end of A/B Testing. However at times, it is possible that although the new feature improved the business metric over short-term, the same might not be true over long term. Hence it is important to constantly monitor the metrics, set up continuous testing and keep learning from changing customer behaviour.
Learn from user behaviour, set up and continue with more A/B Testing: One of the additional benefits of A/B Testing is that at times, there will be unexpected and surprising results or insights that will be observed which are not related to the Metric trying to be optimised. So even if the new feature does not add significant improvements in sales or engagement, a number of other insights can be used to set up further tests in the future.

Deployment, Serving and Production: CI/CD Pipeline

A FLASK Webapp was developed in order to demonstrate the Search by Image Aesthetics and Listing Vibe features. Using this, the users can sort the listings based on Image Aesthetics and also filter listings based on Listing Vibe. The webapp was containerised using Docker and was deployed on AWS Cloud. A CI/CD Pipeline was setup in order to facilitate continuous integration and deployment. The following block diagram shows all the components in the entire pipeline. The Deployed Webapp can be accessed here.

The production pipeline consists of the following components,

FLASK Webapp: Webapp and REST API to serve Model Predictions
Docker: Containerised FLASK Webapp which can then be deployed in any environment
AWS: CI/CD Pipeline

ECR Repository: The Docker Image is stored in this repository. Any changes to this image will trigger changes in the rest of the pipeline and the updates to the image will then be deployed to the Web Application.
CodeCommit : The pipeline is configured to use a source location where the following two files are stored,
- Amazon ECS Task Definition file: The task definition file lists Docker image name, container name, Amazon ECS service name, and load balancer configuration.
- CodeDeploy AppSpec file: This specifies the name of the Amazon ECS task definition file, the name of the updated application's container, and the container port where CodeDeploy reroutes production traffic.
CodeDeploy: Used during deployment to reference the correct deployment group, target groups, listeners and traffic rerouting behaviour. CodeDeploy uses a listener to reroute traffic to the port of the updated container specified in the AppSpec file
- ECS Cluster: Cluster where CodeDeploy routes traffic during deployment
- Load Balancer: The load balancer uses a VPC with two public subnets in different Availability Zones.

deepak-karkala / airbnb-data-science Goto Github PK