BookRating

Predict book average rating with some simple models.

files

map.py: clear data
reduce.py: transform publication_date feature.

Train_data.csv --> Train_data_1.csv
process_isbn13_to_nation.py: extract nation info from isbn13

Train_data_1.csv --> Train_data_2.csv

Test_data.csv --> Test_data_2.csv
preprocessing.ipynb: Code is almost the same as preprocessing.py. Transform features language, nation, author, publisher, date with Bayesian Target Encoding.

Train_data_2.csv --> train_processed_language_author_publisher_nation_date.csv

Test_data_2.csv --> test_processed_language_author_publisher_nation_date.csv
date_score_relation.py: used to plot date graph.
plot.py: plot some other graph.

Preprocessing Steps

Run python map.py | reduce.py
Run python process_isbn13_to_nation.py
Run python preprocessing.py

Then, you can get processed files under processed_file direction.

Preprocessing detail

Clear data

Use mapreduce to clear the noise and transform the date feature.

Feature selection

author

The authors feature is string type data. According to the experience, author is an important aspect which can affect the rating score significantly. Every single book may have several different authors. So, when we select authors as one of the necessary features, the impact of author number should be considered. According to our preprocessing, there are 8442 different authors in training data.
isbn13

The feature isbn13 contains some implicit information, such as nation, language and publisher. Nation and language is related to the existing feature language_code, and publisher is already existing in our raw data. So, isbn13 will be abandoned.
language

Language_code is a related feature to the final average rating. It's also a string type feature, so we need to transform it into numeric value before utilizing it. There are 26 different languages in training data, and their average rating are shown below. We see that there are some gaps between the average rating of different languages. So, language can be a useful feature for us to predict the test data.
num_pages

According to experience, the page number of a book can affect its rating score to some extent. Therefore, it's useful in our prediction models. Before using this feature, we should normalize it and other features into the same magnitude.
ratings_count

The ratings_count feature can also affect the average rating of a book. As we known, if the rating count is too small, the book's rating score is easier to be extreme, such as near 0 or near 5. Thus, we regard ratings_count as a useful feature to predict the book average rating.
text_review_count

Text review is always related to the rating level. We will not abandon it here. The value of text_review_count may be too big when compared with some other numeric feature, so the normalization step is necessary.
publication_date

Publication_date is a date data. In order to confirm whether it can be useful for our models or not, we plot the total date data with rating score. It's shown as follow. We cannot detect any useful information. Because the average rating is almost steady from begin to the end. But it may be existing some implicit period information. so we plot the month average rating from year 2000 to 2004. As shown below. Unfortunately, we still cannot observe period mode from figure 3. So we decided to abandon publication_date feature in the end.
publisher

The publisher feature is similar to authors. There may be several publishers take in charge one book. So we need to consider the effect of publisher number. According to our preprocessing, there are 2130 different publishers in training data.
isbn

Just abandon it since there is isbn13 here.
bookID

Abandon.
title

Abandon.

Brief summary

The features we used here are authors, language_code, num_pages, ratings_count, text_review_count and publisher. And the normalization step is necessary. The encoding method we used in string type data is Bayesian Target Encoding.

Models

Linear Regression
Decision Tree
MLP
AdaBoost
Bagging with MLP
SVR
Random Forest

Performance

Based on 10-fold cross validation.

Feature importance

Based on Random Forest model. RF model can explore the importance of features.

marschildx / bookrating Goto Github PK

bookrating's Introduction

BookRating

files

Preprocessing Steps

Preprocessing detail

Clear data

Feature selection

Brief summary

Models

Performance

Feature importance

bookrating's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent