click-through-machine-learning-final-project

(30million rows *24 col), AWS R studio

Data Understanding - Data Description

The principal concept of this machine learning project is to firstly recognize it as a traditional classification problem. The original training data has over 30 million rows with 24 columns which contained 1 id column, 22 categorical variables and 1 dependent variable indicating whether or not a customer clicked on the ad. The goal of the project is to predict whether a customer would click on the ads based on all of the categorical variables in different fields.

Modeling

After transforming our data, there were 21 predictor variables left in the training data. It was therefore not necessary to implement PCA or other specific feature selection methods. From our tree model, however, we are able to include feature importance, which enables us to gauge an idea of relative importance between features. To be specific, we can group some categories together based on the relevance of their coefficients with click in the logistic regression analysis individually. In order to preserve a reasonable computation time for creating our models, we tuned the parameters in a relatively small data set (1 million observations) which was randomly drawn from the whole training data. After that, we fit the models and parameters on a spreadsheet (1 million records) as mentioned in Data Preparation, and predict on the validation data (about 12 million records) to check its performance. After comparing different models on the same validation data, we combined the best models and all the training data and ensembled to come up with final predictions on the test data. We used 3 models for our analysis: logistic, random forest, and neural network.

Model 1: Logistic Regression

First, we changed the levels into ordinal values so that the model can recognize and process order in the categories. The numbers are randomly selected using label encoder. However, if we use ordinal encoding, it will assign orders to the categories and logistic will misinterpret the results and not make sense. The log loss for this model was therefore 0.441. Secondly, we used one hot encoder to encode every category into a dummy variable. The output of using one hot encoder is a sparse matrix, which does not require lots of computation power. The log loss from this model was 0.399. However, one hot encoder includes all of the variables without dropping one level for each variable. Large amount of dimensions may cause curse of dimensionality and high correlation between variables without dropping one level can cause multicollinearity. Lastly, we decided to use the most common method, which is creating dummy variables. As mentioned in the data preparation, we combined several levels in selected categories so that the laptop can handle the computation in a timely manner. We first try to use the default parameters, using C as 1, and l2 penalty, in the logistic regression function, and got a log loss of 0.41050. We then try to use C as 1 and l1 penalty, and got a log loss of 0.41053. As a result, we decided to use l2 penalty To tune the models, we created a for loop to iterate through different C parameters, using l2 penalty. The table below shows the log loss and accuracy of models with different C parameters. Log loss and accuracy are similar without large differences for each model, all log loss center around 0.41, and accuracy around 0.83. As a result, we pick the parameters that has the smallest log loss which is C equals 1, the default setting.

Model 2: Random Forest

We also tried a tree model. We quickly found that using a random forest and using ensemble generated a lower log loss. In order to try and further decrease loss, we played with the parameters. This table is computed from validation data of 400,000 records demonstrating how log loss is gradually improved.

We intended to use GridSearch at first but it was too computationally intensive and took too long to run. We therefore used a ‘for’ loop instead to tune from the most important features to the least important ones. The list of log_loss derived from it is easy to compare. Testing around with different n_estimators, which is number of trees in the forest, did not affect our results much. It saved time and provided good results to assign 25 to it. The next step was to tune max_depth and min_samples_split. RF 4 turned returned a log loss of 0.401. It is also important to note that we consider other performance measures using 0.5 as a threshold as well. As log loss is minimized through the process, accuracy increases to 0.83 (about the share that non-clicks make up), precision increases while recall, f1-score and AUC drops. The feature importance printed out for RF 4 shows that site_id and site_domain are the most important attributes to classify data into clicks and non-clicks. However, the importance measure can be biased towards the features with many categories and site_id and site_domain in fact have more than 2000 levels respectively. Thus we decide to treat feature importance as a reference rather than a strict standard to filter variables. After applying RF 4 to the training and validation data we prepared, RF 4 works best on the validation set among the models we have. Then we implemented this model on each 1 million chunk and ensemble the predictions to construct final predictions for the test data. In addition, some variables with a variety of levels may be problematic. We keep the most prevalent levels that make up over 80% of the data and treat other levels as ‘1111111’ for variables containing over 100 categories. Nevertheless, it does not lower log loss in tree models, so we stick to all the levels for every variable.

Model 3: Neural Networks

We also attempted to model the data using a neural network, although we did not end up using it. The table below breaks down the models and their results.

We created 5 neural network models by varying scoring techniques, number of hidden layers, nodes, and activation functions individually. To be specific, we applied binary_crossentropy, which is just the negative of the log-likelihood function. Adam optimizer was also tried, which can replace classical stochastic gradient descent procedure and optimizes the network weights by iteratively updating them based on the training data. Last but not least, we utilized one hot encoding to transform the training data. Since this provided us with a matrix containing dummy variables (0 or 1), the data was already standardized. As can be seen from the table above, NN3 is blank because its result is similar to NN2. Surprisingly, adding just one hidden layers or reducing the nodes in each layer changed the measures including precision, recall, f1-score and AUC a lot. NN4 is the best among the 5 models using a sample of 1 million. Thus we tested its performance again on validation data to compare with the logistic and random forest models. To avoid an extremely large matrix and the problem of too many categories, we reduced the categories before running NN models. In the validation data, the top 80 categories are picked from variables if their levels exceed 100 because we found from our category analysis that top 80 categories can cover over 75% data. Any other categories will be treated as ‘1111111’. After that, we assigned ‘1111111’ to any levels not found in the training and applied one hot encoding to both. The log loss we obtained for this model was 0.41, which is higher than the random forest model.

Final Model Selection & Predictions

Ensemble is critical in final predictions. We tried to aggregate the predicted values from different models or different training data. The best one with lowest log loss turned out to come from the mean predicted values obtained from applying the same Random Forest model to different chunks. When 5 chunks are used, the log loss measure was forced down to about 0.3972 on validation data. Therefore, we encoded each training chunk and test data as a pair to maintain the same levels of ordinal encoding and run models to predict for test. The mean of 31 sets of predicted values were submitted as final predictions.

Evaluation

Below is the confusion matrix from the ensembled model by using 5 chunks of training data to predict 12M validation data. In the confusion matrix, 0 represents no click, and 1 represents click. The log loss is 0.3972. It’s apparent that lower log loss is better and the model does perform well based on this measurement. However, it is not enough to only look at one performance criterion. The accuracy is 83%, meaning that the model predicts correctly 83% of the time. Since it is also obvious to note that 83% of the data records contain 0’s, accuracy measurement is not a good criterion when there are imbalanced classes in the dataset, as is the case in this project. AUC is 0.53, only slightly better than the random classifier 0.5. Since the AUC of all the models we created are mostly 0.53, we do not think there are large differences in the ROC curves. When looking at the confusion matrix, precision is 0.60 for 1, and recall is 0.08 for 1. We got a relatively good precision, but really low recall, meaning that this model did not clearly predict 1 when they are actually 1. The reason could be because the model is in some sense conservative in predicting 1, so it predicts fewer 1’s. Another extension of this project should look at the cost-benefit matrix. In the context of ads click, we are more concern about recall than precision since we would like to understand under which condition will a user click the ads rather than predict them accurately. As a result, low log loss does not guarantee the best model and we should consider the main goal when selecting models.

akaboshi900306 / click-through-machine-learning-final-project Goto Github PK