Git Product home page Git Product logo

lead-scoring-model's Introduction

Lead Scoring Model

This project aims to develop a lead scoring model to predict the probability of conversion for potential clients into paying customers. The model is trained using data provided by the client from their Event Management SaaS application.

Directory Structure

.       
├── README.md                           # Project documentation.        
├── data                                # Contains raw and processed data files.        
│   ├── interim             
│   │   ├── full_dataset.csv        
│   │   └── leads_data_cleaned.csv              
│   ├── processed               
│   │   └── full_dataset.csv                
│   └── raw             
│       ├── leads.csv               
│       └── offers.csv              
├── docs                
├── models              
├── notebooks               
│   └── data_preprocessing.ipynb                
├── reports             
│   ├── leads_report.html               
│   ├── model_training.log      
│   └── offers_report.html              
├── requirements.txt                    # Required dependencies for the project.        
└── src                                 # Contains source code for the project.        
    ├── app.py                            
    ├── data 
    ├── models              
    │   ├── __init__.py             
    │   ├── predict_model.py                
    │   └── train_model.py              
    └── utils               
        └── logger.py    

Data Description

There're 2 datasets that reflect the phases in the client’s sales process: leads.csv (data of all potential clients) and offers.csv (clients that got at least to the demo meeting). The target column will be Status from the offers dataset, which indicates whether a client has bought the product.

leads.csv

  • Id: Unique identifier for the lead.
  • First Name: Lead's first name.
  • Use Case: Type of use case for the potential client.
  • Source: Lead source (e.g., Inbound, Outbound).
  • Status: Current status of the lead.
  • Discarded/Nurturing Reason: Reason for lead discard or nurturing.
  • Acquisition Campaign: Acquisition campaign that generated the lead.
  • Created Date: Lead creation date.
  • Converted: Target variable, indicating whether the lead converted (1) or not (0).
  • City: City of the lead.

offers.csv

  • Id: Unique identifier for the offer.
  • Use Case: Type of use case for the offer.
  • Status: Current status of the offer.
  • Created Date: Offer creation date.
  • Close Date: Offer closing date.
  • Price: Offer price.
  • Discount code: Applied discount code.
  • Pain: Customer potential's pain level.
  • Loss Reason: Reason for offer loss.

Data Preparation and Modeling

  1. Data Fusion: The leads.csv and offers.csv datasets were merged using the unique identifier (Id).

  2. Handling Missing Data:

    • Removed rows from leads.csv where there were null values in the Id column.
    • Drop First Name, Use Case, Created Date, Status, Converted columns from leads.csv.
    • Drop Id, Discarded/Nurturing Reason, Acquisition Campaign columns from merged-data.
    • Imputed missing values in Loss Reason column according Status values from merged-data.
    • Imputed missing values in categorical columns with the mode from merged-data.
    • Imputed missing values in numerical columns with the mean from merged-data.
  3. Mapping target column values:

    • Group the minority classes into a new class called Other.
  4. Label Encoding:

    • Applied Label Encoding to categorical columns: Source, City, Loss Reason, Pain, Discount code, Status, Use Case.
  5. Data Scaling:

    • Used StandardScaler to scale numerical features in the dataset.
  6. Model Selection:

    • Explored various classification algorithms, including Random Forest, Adaboost, Extra Trees, Bagging Classifier, Gradient Boosting, Decision Tree, Naive Bayes, KNN, Logistic Regression, SGD Classifier, MLP Classifier, and SVM.
  7. Model Evaluation:

    • Compared model performance using cross-validation scores.

Dependencies

Ensure you have the necessary dependencies installed by running:

pip install -r requirements.txt

Usage

  • To pre-process data, run the data_preprocessing.py script.
python3 src.eda.data_preprocessing.py
  • To train and evaluate the models, run the train_model.py script. This will load the processed dataset, split it into training and testing sets, train multiple models, and display their cross-validation scores.
python3 -m src.models.train_model
  • To run application with Shimoku API
python3 src/app.py

Results

  • Data Transformation: Classes in the "Status" column were grouped into three categories - "Closed Won," "Closed Lost," and "Other" to address class imbalance.

  • Best Model: The GradientBoosting model achieved the highest cross-validation score (0.91) and was selected for further evaluation.

  • Classification Results:

    • Accuracy: 90.4%
    • Precision, Recall, F1-score: Detailed classification metrics for each class are provided in the updated classification report.

lead-scoring-model's People

Contributors

carolinajimenez avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.