Git Product home page Git Product logo

customer-churning-analysis-and-prediction's Introduction

Customer Churning Analysis and Prediction

• Customer churn is a major problem faced by companies in the telecom industry, where customers have the freedom to switch to other companies to cater to their communication and internet needs.
• The churn rate, which represents the number of customers that cancel or do not renew their subscription with the company, directly affects the company's revenue. Therefore, it is crucial for companies to analyze and understand the factors that contribute to customer churn and build strategies to improve customer retention.
• In this study, we aim to classify potential churn customers based on numerical and categorical features using machine learning algorithms.
• We explore the use of K-means clustering for customer segmentation and logistic regression for churn prediction.We also perform data preprocessing techniques such as normalization, standardization, and data balancing to improve the performance of our models. We discuss our results and provide recommendations for companies to improve their customer retention strategies.

EXPLORATORY DATA ANALYSIS

A. Dataset

The dataset used for this study is from IBM, published 5 years ago on Kaggle. The data attributes are as follows – • customerID : Customer ID.
• gender : Whether the customer is a male or a female.
• SeniorCitizen : Whether the customer is a senior citizen or not (1, 0).
• Partner : Whether the customer has a partner or not (Yes, No).
• Dependents : Whether the customer has dependents or not (Yes, No).
• tenure : Number of months the customer has stayed with the company.
• PhoneService : Whether the customer has a phone service or not (Yes, No).
• MultipleLines : Whether the customer has multiple lines or not (Yes, No, No phone service).
• InternetService : Customer’s internet service provider (DSL, Fiber optic, No).
• OnlineSecurity : Whether the customer has online security or not (Yes, No, No internet service).
• OnlineBackup : Whether the customer has online backup or not (Yes, No, No internet service).
• DeviceProtection : Whether the customer has device protection or not (Yes, No, No internet service).
• TechSupport : Whether the customer has tech support or not (Yes, No, No internet service).
• StreamingTV : Whether the customer has streaming TV or not (Yes, No, No internet service).
• StreamingMovies : Whether the customer has streaming movies or not (Yes, No, No internet service).
• Contract : The contract term of the customer (Month-to-month, One year, Two year).
• PaperlessBilling : Whether the customer has paperless billing or not (Yes, No).
• PaymentMethod : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).
• MonthlyCharges : The amount charged to the customer monthly.
• TotalCharges : The total amount charged to the customer.
• Churn : Whether the customer churned or not (Yes or No).

The dataset consists of 7043 records. Out of the 21 attributes present, customerID is dropped since it has no real use for this study. Only three numerical columns namely tenure, MonthlyCharges, TotalCharges are present. The rest of the data attributes are categorical. TotalCharges has 11 missing values and these records were dropped. The categorical values are transformed into numerical values using LabelEncoder from sklearn.preprocessing.

For visualization purposes, we have divided the features into 3 groups.
• Customer Information – gender, Senior Citizen, Partner, Dependents.
• Services Signed up for – PhoneService, MultipleLines, InternetService, StreamingTV, StreamingMovies, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport.
• Payment Information – PaymentMethod, Contract, PaperlessBilling.

B. Target Variable Visualization (Churn)

image The target variable is “Churn” which takes 0 & 1 which results in a binary classification problem and the dataset is unbalanced in a near about 3 : 1 ratio for Not-Churn: Churn customers.

C. Categorical Features vs Target Variable (Churn)

image image image image image image image

Some of the Inferences Obtained –
• Customer churning for male & female customers is very similar to each other
• Similarly, number of SeniorCitizen customers is pretty low! Out of that, nearly about 40% of churn are SeniorCitizen customers.
• Customers who are housing with a Partner churned less as compared to those not living with a Partner.
• Similarly, churning is high for customers that don't have Dependents with them.
• For PhoneService, despite having no phone service, more customers were retained as compared to the number of customers who dropped the services.
• A high number of customers have displayed their resistance towards the use of Fiber optic cables for providing the InternetService. On the contrary, from the above graph, customers prefer using DSL for their InternetService!
• StreamingTV and StreamingMovies display an identical graph. Irrespective of being subscribed to StreamingTV & StreamingMovies, a lot of customers have been churned.

D. Categorical Features vs Positive Target Variable

image image image image image

Some of the Inferences obtained –
• We can observe a clear-cut 50% - 50% split between the male and female customers that have switched their services. Hence, the reason for switching is something related to the service or some process to which the customers reacted badly.
• 75% of the churned customers are not SeniorCitizen.
• Customers living by themselves have cut off the services. From Partners & Dependents data, an average of 73.4% of customers churned out were living by themselves.
• Despite providing PhoneService, a high percentage of customers have switched.
• Similarly, the availability of MultipleLines did not matter, as customer unsubscription was carried out regardless.
• Customers did not like the approach of Fiber Optic cables for providing InternetService with 70% opting out of the services.
• For StreamingTV & StreamingMovies, customers without these services definitely canceled their subscriptions, however, an average of 43.7% of customers switched despite consuming the streaming content.
• Month-to-Month Contract duration has the dominating share when it comes to churning with 88.6% of customers!
• PaperlessBilling does not seem to be liked by the customers.

E. Numerical Features Distribution

image Tenure and MonthlyCharges create a bimodal distribution with peaks present at 0 - 70 and 20 - 80 respectively. TotalCharges displays a positively or rightly skewed distribution.

F. Numerical Features vs Target Variable

image image image

Inferences obtained –
• Considering tenure, a high number of customers have left after the 1st month. This high cancellation of services continues for 4 - 5 months but the churn of customers hasreduced since the 1st month. As the tenure increases, customers dropping out decreases.
• This results in low customer churning as the tenure increases.
• For the MonthlyCharges group, the churn rate is high for the values between 65 (13x5) - 105 (21x5). This MonthlyCharges range of values caused the customers to switch.
• A very high number of customers opted out of the services for the TotalCharges below 500. This customer churning continues for a TotalCharges range of values from 0 (0x500) - 1000 (2x500).

G. Outliers

image There are no outliers in the data.

H. Summary of EDA (exploratory data analysis)

exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

Order / Values of features for customer churn cases:

1. Categorical Features (Order):

• gender : Male = Female
• SeniorCitizen : No SeniorCitizen > SeniorCitizen
• Partner : No Partner > Partner
• Dependents : No Dependent > Dependent
• PhoneService : PhoneService > No PhoneService
• MultipleLines : MultipleLines > No MultipleLines > No PhoneService
• InternetService : Fiber Optic > DSL > No InternetService
• OnlineSecurity : Absent > Present > No InternetService
• OnlineBackup : Absent > Present > No InternetService
• DeviceProtection : Absent > Present > No InternetService
• TechSupport : Absent > Present > No InternetService
• StreamingTV : Absent > Present > No InternetService
• StreamingMovies : Absent > Present > No InternetService
• Contract : Month-to-Month > One year > Two year
• PaperlessBilling : Present > Absent
• PaymentMethod : Electronic check > Mailed check > Bank Transfer (automatic) > Credit Card (automatic)

2. Numerical Features (Range):

• tenure : 1 - 5 months
• MonthlyCharges : 65 - 105
• TotalCharges : 0 – 1000

FEATURE ENGINEERING

A. Data Transformation and Scaling

Since the numerical features are not distributed normally, we tried using many techniques to make them normal but only box-cox transformation has shown promising results only for TotalCharges. So, TotalCharges is transformed using the same and StandardScaler is used for standardizing the numerical features.
image

B. Correlation Matrix

There’s no collinearity between the features observed. MultipleLines, PhoneService, gender, StreamingTV, StreamingMovies, and InternetService do not display any kind of correlation. Dropping the features with a correlation coefficient between (-0.1,0.1). Rest displays a significant positive or negative correlation.
image

MODELING & ANALYSIS OF RESULTS

A. Linear Regression

• Linear regression assumes a linear relationship between the independent variables and the dependent variable, and the residual errors should be normally distributed with constant variance.
• However, in the given dataset, there are categorical variables, and the relationship between the independent variables and the dependent variable is not necessarily linear. Therefore, linear regression is not suitable for modeling this data.

B. Logistic Regression

• Logistic regression can be used to predict the probability of churn for customers based on various features such as their tenure, monthly charges, and among others. Logistic regression is a binary classification algorithm that is used to model the relationship between a set of features and a binary target variable, in this case, whether a customer will churn or not.
• The logistic regression model will estimate the probability of churn based on the input features and output a binary value of 1 (churn) or 0 (not churn) for each customer. By setting a probability threshold, businesses can use the logistic regression model to identify the customers who are at high risk of churning and take proactive measures to retain them.Since the dataset is imbalanced in a 3:1 ratio for Not-Churn:Churn customers, the model may tend to predict the majority class more often and ignore the minority class. Class weights assign a higher weight to the minority class and a lower weight to the majority class, allowing the model to give more importance to the minority class during training. This can lead to a better performance of the model in predicting the minority class.
image image image • Our Logistic Regression model has an accuracy of 0.74, a precision of 0.5 for the positive class, a recall of 0.78 for the positive class, and an F1-score of 0.61 for the positive class. These metrics indicate that the model is able to identify a high percentage of the churned customers, but it also misclassifies a significant number of non-churned customers as churned.However, the area under the ROC curve is 0.83, which indicates that the model performs well in terms of distinguishing between the positive and negative classes.

C. K-Means Clustering

• K-means clustering is an unsupervised machine learning algorithm used to group similar data points together in a dataset. It tries to identify natural groupings in the data by minimizing the sum of squared distances between data points and their assigned cluster centroid. The algorithm works by randomly initializing k centroids, assigning data points to the nearest centroid, computing the new centroids based on the mean of the assigned data points, and repeating the process until convergence. The optimal number of clusters is typically determined through trial and error or using a statistical method.
image • From the graph, 4 is the optimal number of clusters for our data. After fitting the data for our 4 clusters, we obtained a similar number of data points in each cluster. Also, we can observe that 0 and 2 had a higher churn rate compared to Clusters 1 and 3.
image image image image

We found out that customers in 0&2 are typically –
• Having shorter tenure than in other clusters
• In the Month-to-Month contract
• More Likely to be not elderly
• Uses Fiber Optic (cluster 2 customers also use DSL)
• Doesn’t have Online Security

Cluster analysis is helpful for placing customers into segments using data, which allows businesses to decide which segment(s) to target from distinct marketing mixes that will satisfy the needs and wants of each targeted Cluster. If we have a new customer we can easily determine which cluster he falls in and this can be used to customize plans for the customers. We can also use the clustering algorithm to cluster the data at different time intervals, and then compare the clusters to see if any customers have moved from one cluster to another.

MEASURES FOR REDUCING CUSTOMER CHURN & INCREASE REVENUE

• 3 types of customers should be targeted : SeniorCitizen, Living with a Partner, living all alone.
• The number of SeniorCitizen customers is low but their lower-limit of MonthlyCharges is higher than the other customers. Thus, SeniorCitizen customers are ready to pay more money but they need to be catered with that level of service.
• In order to create a strong foundation of customers, Telco Company needs to create an easy and affordable entry point for their services. For the tenure of 1st 6 months, it needs to focus extensively on OnlineSecurity,OnlineBackup,DeviceProtection & TechSupport as this period is the most critical and uncertain for the customers.
• Once they build a solid support services for customers, they need to push the usage of MultipleLines & Fiber Optic cables for the PhoneService & InternetService respectively. Try to decrease the entry point at least after which prices can be increased.
• StreamingTV and StreamingMovies need to be made affordable. The content of these services should be targeting all types of customers. This needs to be followed up with an easy and hassle-freePaymentMethod.
• Put an end to the Electronic check for payment purposes due to it's high churn and focus entirely on Bank Transfer (automatic) & Credit Card (automatic)
• Once the MonthlyCharges for any single service hits the 70 mark, customers become very conscious about their MonthlyCharges. Make it low by providing offers for a certain period of time.

CONCLUSION

• This is a great dataset that gives an opportunity to peak into the real-world business problem and can be dealt with the Data Science techniques.
• Insights gained from the Data Analysis are very valuable for understanding the effectiveness of the existing systems that are in place. They also assist in drawing up plans & measures to counter the problems or be in an infinite loop for improvement.
• We have successfully developed multiple models which help us in predicting the probability of customer churning (which helps telecom in acting) and by determining which cluster our customer falls in the 4 clusters formed which helps in the type of action plan required to prevent churning.

customer-churning-analysis-and-prediction's People

Contributors

nayandeep20028840 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.