Build your Machine Learning models the easy way with SPSS

Introduction

This tutorial explains how to graphically build and evaluate machine learning models by using the SPSS Modeler flow feature in IBM® Watson™ Studio. IBM Watson SPSS Modeler flows in Watson Studio provide an interactive environment for quickly building machine learning pipelines that flow data from ingestion to transformation to model building and evaluation, without needing any code. This tutorial introduces the SPSS Modeler components and explains how you can use them to build, test, evaluate, and deploy models.

As with the other tutorials in this learning path, we use a customer churn data set that is available on Kaggle.

Prerequisites

To complete this tutorial, you need:

An IBM Cloud account

Estimated time

It should take you approximately 60 minutes to complete this tutorial.

Steps

If you do not have an IBM Cloud account please create one or login into your existing account by clicking here
Please make sure that you have cloned/downloaded this repository

Setup your environment

Create an IBM Cloud Object Storage service.
Create an IBM Watson Studio project.
Provision IBM Cloud services.
Upload the data set.

You must complete these steps before continuing with the learning path. If you have finished setting up your environment, continue with the next step, creating a model flow.

1. Create a Watson Studio Service

Once logged in, search for Watson Studio and create a lite instance.

2. Create IBM Cloud Object Storage service

To create an IBM Cloud Object Storage service, search for Object Storage and create a lite instance.

3. Create a Watson Machine Learning Service instance

To create a Watson Machine Learning instance, search for Machine Learning and create a lite instance.

4. Upload data set

Clone this repository or download this repository

git clone https://github.com/fawazsiddiqi/WatsonSPSS

In Watson Studio, go to the assets tab, and drag and drop the customer-churn-kaggle.csv from the Data folder from the repository which you just downloaded/cloned

Create model flow

To create an initial machine learning flow:

From the Assets page, click Add to project.
In the Choose asset type page, select Modeler Flow.
On the Modeler page, select the ‘From File’ tab.

Download the model flow that is named ‘customer-churn-flow.str’ from the data folder which you downloaded earlier or you can download the repository here.
Drag the downloaded modeler flow file to the upload area. This also sets the name for the flow.
Change the name and provide a description for the machine learning flow (optional).
Click Create. This opens the Flow Editor that can be used to create a machine learning flow.

You have now imported an initial flow that we’ll explore in the rest of this tutorial.

Under the Modeling drop-down menu, you can see the various supported modeling techniques. The first one is Auto Classifier, which tries several techniques and then presents the results of the best one.

The main flow itself defines a pipeline consisting of several steps:

A Data Asset node for importing the data set
A Type node for defining metadata for the features, including a selection of the target attributes for the classification
An Auto Data Prep node for preparing the data for modeling
A Partition node for partitioning the data into a training set and a testing set
An Auto Classifier node called ‘churn’ for creating and evaluating the model

Additional nodes have been associated with the main pipeline for viewing the input and output. These are:

A Table output node called ‘Input Table’ for previewing the input data
A Data Audit node called ’21 fields’ (default name) for auditing the quality of the input data set (min, max, standard, and deviation)
An Evaluation node for evaluating the generated model
A Table output node called ‘Result Table’ for previewing the results of the test prediction

Other input and output types can be viewed by selecting the Outputs drop-down menu.

Assign data asset and run the flow

To run the flow, you must first connect the flow with the appropriate set of test data available in your project.

Select the three dots of the Data Asset node to the left of the flow (the input node).
Select the Open command from the menu. This shows the attributes of the node in the right part of the page.

Click Change data asset to change the input file.
On the next page, select your .CSV file that contains the customer churn, and click OK.
Click Save.
Click Run (the arrow head) in the toolbar to run the flow.

Running the flow creates a number of outputs or results that can be inspected in more detail.

Understanding the data

Now that you have run the flow, take a closer look at the data.

Select the Input Table node at the top of the flow diagram.
Select the three dots in the upper-right corner and invoke the Profile command from the pop-up menu.

The last interaction might run part of the flow again but has the advantage that the page provides a Profile tab for profiling the data and a Visualization tab for creating dashboards.

Now, let’s take a closer look at each of the data columns, such as the values for their minimum, maximum, mean, and standard deviation:

Click one level back in the bread crumb list at the top of the page to return to your flow.

Select the View outputs and versions command from the upper-right portion of the toolbar.
Select the Outputs tab.

Double-click the output for the “data audit” node named “21 Fields.” Alternatively, select the three dots associated with the output and select Open from the pop-up menu.

This gives you an overview like the one in the following image.

For each feature, the overview shows the distribution in graphical form and whether the feature is categorical or continuous. For numerical features, the computed min, max, mean, standard deviation, and skewness are shown as well. From the column named Valid, you can see that there are 3333 valid values, which means that no values are missing for the listed features and you do not need to bother further with this aspect of preprocessing to filter or transform the columns with lacking values.

Data preparation

You can change the initial assessment of the features made by the import by using the Type node, which happens to be the next node in the pipeline. To achieve this:

Go back to the Flow Editor by selecting customer-churn-flow in the toolbar.
Select the Type node.
Select the Open command from the pop-up menu.

This provides a table that shows the features (such as fields), their kind (for example, continuous or flag), and role, along with others.

The Measure can be changed if needed using this node and it is also possible to specify the role of a feature. In this case, the role of the churn feature (which is a Flag with True and False values) has been changed to Target. The Check column might give you more insight into the values of the field.

Click Cancel to close the property editor for the Type node.

The next node in the pipeline is the Auto Data Prep node. This node automatically transforms the data, such as converting categorical fields into numerical ones. To view its results:

Select the Auto Data Prep node in the flow editor.
Select Open from the pop-up menu.

This node offers a multitude of settings, for example, for defining the objective of the transformation (optimize for speed or for accuracy).

The previous image shows that the transformation has been configured to exclude fields with too many missing values (threshold is 50) and to exclude fields with too many unique categories. Assume that the latter applies to the phone numbers and don’t worry about them.

The next node in the pipeline is the Partition node, which splits the data set into a training set and a testing set. For the current Partition node, an 80-20 split has been used.

Training the model

The next node in the SPSS Modeler flow is the Auto Classifier node named “churn.” This node trains the model based on various build options, such as how to rank and discard generated models (using threshold accuracy).

If you Open the node and select the BUILD OPTIONS option from the drop-down menu, you see the property Number of models to use is set to 3, which is the default value. Feel free to change it to a higher number, and then click Save to save the changes.

NOTE: Remember to rerun the flow if you change any build settings.

Evaluating the model

To get more details about the generated model:

Select the yellow model icon.
Select View Model from the drop-down menu.

This overview section gives you a list of classifier models and their accuracy. In this example, I set the Number of models to use to 10.

As you navigate through this overview section, you’ll notice that the number of options and views that are associated with each estimator varies. In some cases, a hyperlink is provided to dig down into more details.

For example, take a look at the poor performing ‘C&R’ Tree Model by clicking the name in the table.

On the next page, select the Tree Diagram link to the left to get the tree diagram for the estimator.

You can now hover over either one of the nodes or one of the branches in the tree to get more detailed information about a decision made at a given point.

Go back by clicking the left arrow in the upper-left part of the page. Then, select the MPL Neural Network link to get the details for that estimator. Note that has different options than the tree model.

Click the Feature Importance tab.

This graphs the relative performance of each predictor in estimating the model.

Click the Confusion Matrix tab.

The table compares what is predicted versus what it observed. The numbers of correct predictions are shown in the cells along the main diagonal.

If you would like to get the confusion matrix for the complete data set, you can add a Matrix Output node to the canvas.

Go back to the flow.
Add a Matrix node from the Outputs menu.

Attach the matrix node to the specified model output node.

NOTE: To attach the new node, click the right-side bubble of the existing ‘churn’ model output node and drag the connector to the new matrix node.

Open the Matrix node.
Put the target attribute ‘churn’ in the Rows and the binary prediction ‘$XF-churn’ in the Columns.

For Cell contents, select Cross-tabulations.
Click Appearance and select Counts, Percentage of Row, Percentage of Column, and Include row and column totals.

Click Save.
Run the Matrix node.
Select View Output and Versions in the upper-right corner.
Open the output for the Matrix node (named ‘churn x $XF-churn’) by double-clicking it.

The main diagonal cell percentages contain the recall values as the row percentages (100 times the proportions metric that’s generally used) and the precision values as the column percentages. The F1 statistics and weighted versions of precision and recall over both categories would need to be manually calculated. The results shown are the combined results applying all three algorithms. If you want to see the results just for the Random Forest, go back to the Auto Classifier node. Open it and uncheck the boxes for all models other than Random Forest. Then, rerun the flow.

If you want to just get the confusion matrix, open the Matrix Output node and unselect ‘Percentage of Row’ and ‘Percentage of Column’ in the appearance section. Then, repeat steps 7-11 above.

A more graphical way of showing the confusion matrix can be achieved by using SPSS visualizations. For that purpose, you need to select the Result Table output node, then select the Profile option in the drop-down menu.

Click the Visualizations tab. Then, click more options (the double arrow icon) to view the types of charts available. Select the Treemap chart.

Set the Columns values to churn and $XF-churn, and select Count in the Summary.

Notice that the current pipeline performs a simple split of test and training data using the Partition node. It’s also possible to use cross-validation and stratified cross-validation to achieve slightly better model performance, but at the cost of complicating the pipeline. See the article k-fold Cross-validation in IBM SPSS Modeler for details on how this can be achieved.

There are two more ways of viewing the results of the evaluation.

Go back to the flow editor for the Customer Churn Flow.
Select View outputs and version from the top toolbar.
Double-click the output named Evaluation of [$XF-churn] : Gains to select it.

The generated outputs for the model appear.

Saving and deploying the model

After you create, train, and evaluate a model, you can save and deploy it.

To save the SPSS model:

Go back to the flow editor for the model flow.
Select the Predicted Output node and open its pop-up menu by selecting the 3 dots in the upper-right corner.
Select Save branch as model from the pop-up menu.

A new window opens.

Type a model name (for example, ‘customer-churn-spss-model’).
Click Save.

The model is saved to the current project.

The model should now appear in the Models section of the Assets tab for the project.

Select your model in the Assets tab in the project and select Promote to Deployment Space

Select New Space +

Create a new Deployment Space and select your Machine Learning service and Object Storage and click create

Once the deployment space is created, select it and click Promote

Click on the hamburger menu and select Deployment spaces you should see your created deployment space there; access the deployment space

Select the model and click on create deployment

You will be prompted with the following; Select Online and give your deployment a name and click Create

Wait for the status to be set to Deployed

Testing the model

Now, the model is deployed and can be used for prediction. However, before using it in a production environment it might be worthwhile to test it using real data. You can do this interactively or programmatically using the API for the IBM Machine Learning Service. For now, we test it interactively.

The UI provides two options for testing the prediction: by entering the values one by one in distinct fields (one for each feature) or by specifying the feature values using a JSON object. We use the second option because it is the most convenient one when tests are performed more than once (which is usually the case), and when a large set of feature values is needed.

To get a predefined test data set:

Download the test data from GitHub in the file customer-churn-test-data.txt.
Open the file and copy the value.

Notice that the JSON object defines the names of the fields first, followed by a sequence of observations to be predicted, each in the form of a sequence:

{"input_data":[{"fields": ["state", "account length", "area code", "phone number", "international plan", "voice mail plan", "number vmail messages", "total day minutes", "total day calls", "total day charge", "total eve minutes", "total eve calls", "total eve charge", "total night minutes", "total night calls", "total night charge", "total intl minutes", "total intl calls", "total intl charge", "customer service calls"], "values": [["NY",161,415,"351-7269","no","no",0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4]]}]}

Note that some of the features, such as state and phone number, are expected to be in the form of strings (which should be no surprise), whereas the true numerical features can be provided as integers or floats as appropriate for the given feature.

To test the model at run time:

Select the deployment that you just created by clicking the deployment name (for example, ‘customer-churn-spss-model-web-service’).
This opens a new page that shows an overview of the properties of the deployment (for example, name, creation date, or status).
Select the Test tab.
Select the file icon, which then lets you enter the values using JSON.
Paste the JSON object in the downloaded Customer Churn Test Data.txt file into the Enter input data field.
Click Predict to view the results.

The prediction result is given in terms of the probability that the customer will churn (True) or not (False). You can try it with other values, for example, by substituting the values with values taken from the customer-churn-kaggle.csv file. Another test is to change the phone number to something like “XYZ” and then run the prediction again. The prediction result should be the same, which indicates that the feature is not a factor in the prediction.

If interested in seeing other examples for using the SPSS Modeler to predict customer churn, look at the tutorial Predict Customer Churn by Building and Deploying Models Using Watson Studio Flows

Conclusion

This tutorial covered the basics of using the SPSS Modeler flow feature in Watson Studio, which included:

Creating a project
Provisioning and assigning services to the project
Adding assets to the project, such as data sets
Creating a Modeler flow
Using the Modeler flow editor to run and examine the model
Training and evaluating the model
Deploying the model as a web service
Scoring the machine learning model with test data

Using the SPSS Modeler flow feature of Watson Studio provides a non-programming approach to creating a model to predict customer churn.

Workshop Resources

Sign up for IBM Cloud
The slides used can be found here
Replay can be found here

fawazsiddiqi / watsonspss Goto Github PK

watsonspss's Introduction

Build your Machine Learning models the easy way with SPSS

Introduction

Prerequisites

Estimated time

Steps

Setup your environment

Create model flow

Assign data asset and run the flow

Understanding the data

Data preparation

Training the model

Evaluating the model

Saving and deploying the model

Testing the model

Conclusion

Workshop Resources

watsonspss's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org