Introduction

This Code Pattern is part of a series on Red Hat Marketplace operator FP Predict Plus. Please refer to the Prerequisites section for Getting Started.

How to build a Machine Learning Regression model using FP Predict Plus

Machine learning is a large field of study that overlaps with and inherits ideas from many related fields such as artificial intelligence. The focus of the field is learning, that is, acquiring skills or knowledge from experience. Most commonly, this means synthesizing useful concepts from historical data. As such, there are many different types of learning that you may encounter as a practitioner in the field of machine learning from whole fields of study to specific techniques.

Regression in machine learning and statistics is a supervised learning approach in which the computer program learns from the data given to it and make new observations or predictions. In this technique, the target variable has continuous values ranging from zero to infinity.

Examples of regression problems include:

Given historical data, predict the temperature
Given historical data, predict the sales
Given historical data, predict the house price
Given historical data, predict customer spend (this code pattern)

We will focus on predicting customer spend using historical data and demonstrate the automated process of building models using FP Predict plus operator from Red Hat Marketplace.

When the reader has completed this code pattern, they will understand how to :

Quickly set up the instance on OpenShift cluster for model building.
Ingest the data and initiate the FP Predict Plus process.
Build different models using FP Predict Plus and evaluate the performance.
Choose the best model and complete the deployment.
Generate new predictions using the deployed model.

Architecture Diagram

User logs into FP Predict Plus platform using an instance of FP Predict plus operator.
User uploads the data file in the CSV format to the Kubernetes storage on the platform.
User initiates the model building process using FP Predict Plus operator on OpenShift cluster and create pipelines.
User evaluates different pipelines from FP Predict Plus and selects the best model for deployment.
User generates accurate predictions by using the deployed model.

We will be using FP Predict Plus operator from Red Hat Marketplace to solve this usecase. Please refer to the content under Included components section to know more about FP Predict Plus operator and Red Hat Marketplace.

Prerequisites

We need to install and set up the FP Predict Plus operator on Open Shift cluster as per the instructions given below.

Install and setup FP Predict Plus operator on Red Hat Marketplace

Included components

Red Hat Marketplace: A simpler way to buy and manage enterprise software, with automated deployment to any cloud.
FP Predict Plus: An Automated, self learning, and Multi Modeling AI that handles Discrete Target variable, Continuous Target variable and Time series data with no need for coding.
Red Hat OpenShift Container Platform: Empower developers to innovate and ship faster with the leading hybrid cloud, enterprise container platform.

Featured technologies

Artificial Intelligence: Any system which can mimic cognitive functions that humans associate with the human mind, such as learning and problem solving.
Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
Analytics: Analytics delivers the value of data for the enterprise.
Machine Learning: Machine learning is a form of AI that enables a system to learn from data rather than through explicit programming.

Steps

Follow these steps to setup and run this code pattern using FP Predict Plus.

Add the data
Create a job
Review the job details
Analyze results
Download the Results & Model file
Prediction using new data
Create predict job
Check job summary
Analyze results of predict job
Download predicted results

1. Add the data

Launch the FP Predict Plus platform and sign in using the default credentials. Lets begin by adding datasets. Clone this repo and navigate to data folder to download the datasets onto your local file system.

Click on Dataset Management which is the third option on the left navigation pane and select Datasets on the top.

Click on Browse and select the three csv files for upload. The datasets gets uploaded to the platform in a minute. The upload time is dependent on the size of the datasets.

Note :- Only csv format is supported and the dataset need to have a column with unique values. In these csv files, we have added a Row_num column to be unique. The datasets needs to be split into training, testing and holdout (validation) datasets before hand. Citation is needed to use these datasets for other projects.

2. Create a job

We need to create a new job in the platform. Click on Dashboard which is the first option on the left navigation pane and hit Start on the top right hand side.

We need to click on No as our data does not contain Date or Timestamps. The platform will understand to create a Predict job with no Date or Timestamps and if the dataset has Date or Timestamps, then platform will create a Forecast job.

Lets go ahead and create a new job by filling in the details per below. We will update the name, select the task as Model + Predict as that is what we will be doing. We can select Model and it will build a model for us and select Predict if we have a model file ready for doing predictions. Set the dataset location to Cloud and select the train and test datasets using Browse for upload. Select the Target Variable as Spend and Unique Identifier as Row_num. Under Advanced settings, Operation Mode should be set to Automated. Hit Run to get the job started.

The job will take a couple of minutes to complete. We can observe how many models are getting created and the different scenarios evaluated in the process.

Note :- The regression model will take a little longer to complete when compared with a binary classification model.

The model will try to use different scenarios like all variables/few variables etc for generating predictions. We should see the job status per below.

3. Review the job details

Lets review the job which has been created for us by clicking on Dashboard which is the first option on the left navigation pane and select the job with name build-model. The number preceding before the job name is to identify how many jobs have run so far and can be ignored. We can observe the model distribution where model M-9 has scored for 23 records in the test data and model M-3 & M-5 have scored for one record each in the test data.

We can observe the complete job details like Description, Modelling and Prediction. The system built 7 models using 479 records from the training data in 290 seconds and generated predictions on 25 records from testing data.

4. Analyze results

Lets review the model performance in detail. Click on Predicted vs Actual option to see the model performance. We can observe that predicted values are very close to actual values. The model was able to learn from the data and generated predictions with good accuracy. Note :- It was observed that the model is sensitive towards outliers in the dataset. We can deal with outliers in different ways by identifying the root cause and exclude them if necessary. We can also retain outliers in the dataset for further analysis and treat them accordingly.

Click on Models to understand how many predictions are done by each model. We can observe that model M-9 has scored for 23 records from the testing data.

Click on Variables to understand the significance of each variable in predicting the outcome. We can observe that all variables except LoanAmount were used by most of the models.

Click on Variables of Models to understand different scenarios explored by different models. We can observe that models M-3 and M-9 used all variables where as M-5 used ten variables for building different models.

5. Download the Results & Model file

We can download the results which has all the model details mentioned above for further analysis. Lets go ahead and click on Download Results and Download Model File under Download Files option and save them in your local system. The Results file is named as build-model-report and the model file is named as build-model-file.models.

In the Results file under the tab Prediction Result, we can see the model performance where the Mean Absolute Percentage Error (MAPE) is 5.59% which means the model accuracy is ~ 94%. These results are great given that we have used relatively smaller datasets. We can also review other details of the model in the excel file.

The Results and Model File for this experiment are also available in this repo under reports and model folders. The Model file can be uploaded onto cloud using the Dataset Management option as described earlier.

6. Prediction using new data

In this section, we will learn how to do predictions using the model on new dataset. We will be using the saved model from previous step to generate predictions using new records from the holdout data. The hold out data file will need to have target variable column Spend (without any values) failing which the system prompts an error stating headers do not match between the training data and the holdout data.

7. Create predict job

Lets create a new job for prediction by clicking on Dashboard option in the left navigation pane and hit Start. Update the job name, description, select Predict under Tasks as we have already built the model in previous steps. Upload the model file and holdout data from cloud or local whichever is convenient for you and select Unique Identifier as Row_num.

The predict job will start per below. We should get a message stating Job Completed Successfully in couple of minutes.

8. Check job summary

Lets look at job summary by clicking Dashboard and selecting generate-predictions job. We can observe that, models M-3 & M-9 have scored for two records each in the holdout data.

9. Analyze results of predict job

We can get more details in the next step where we can observe that prediction was made on all four records from the holdout dataset and the trained model file has used 479 records for building seven models.

Note :- Predicted vs Actual option is not clickable because there's no actual value to be compared. We have generated predicted values given a set of input parameters. We can review Models, Variables & Variables of Models as part of model evaluation.

10. Download predicted results

We can get all the details about model performance by clicking on Download Results under Download Files option per below.

In the file under Predicted Result tab, we can observe four values under Predicted Value which are the generated predictions for the holdout data. The Results file by name generated-predictions-report is also available under reports folder for reference.

Summary

With this, we have come to an end of this code pattern. We have learnt how to use FP Predict Plus platform for building AI models using Regression technique and also explored how to generate predictions on the new dataset. This platform will be extremly beneficial for developers, data scientists to build AI solutions quickly under different domains.

Citation for data :

The dataset which is referenced in this tutorial is prepared by R.K.Sharath Kumar, Data Scientist, IBM India Software Labs.

ghas-results / predict-customer-spend-using-fppredictplus Goto Github PK

predict-customer-spend-using-fppredictplus's Introduction