Predict book average rating with some simple models.
-
map.py: clear data
-
reduce.py: transform publication_date feature.
Train_data.csv --> Train_data_1.csv
-
process_isbn13_to_nation.py: extract nation info from isbn13
Train_data_1.csv --> Train_data_2.csv
Test_data.csv --> Test_data_2.csv
-
preprocessing.ipynb: Code is almost the same as preprocessing.py. Transform features language, nation, author, publisher, date with Bayesian Target Encoding.
Train_data_2.csv --> train_processed_language_author_publisher_nation_date.csv
Test_data_2.csv --> test_processed_language_author_publisher_nation_date.csv
-
date_score_relation.py: used to plot date graph.
-
plot.py: plot some other graph.
- Run
python map.py | reduce.py
- Run
python process_isbn13_to_nation.py
- Run
python preprocessing.py
Then, you can get processed files under processed_file
direction.
Use mapreduce to clear the noise and transform the date feature.
-
author
The authors feature is string type data. According to the experience, author is an important aspect which can affect the rating score significantly. Every single book may have several different authors. So, when we select authors as one of the necessary features, the impact of author number should be considered. According to our preprocessing, there are 8442 different authors in training data.
-
isbn13
The feature isbn13 contains some implicit information, such as nation, language and publisher. Nation and language is related to the existing feature language_code, and publisher is already existing in our raw data. So, isbn13 will be abandoned.
-
language
Language_code is a related feature to the final average rating. It's also a string type feature, so we need to transform it into numeric value before utilizing it. There are 26 different languages in training data, and their average rating are shown below. We see that there are some gaps between the average rating of different languages. So, language can be a useful feature for us to predict the test data.
-
num_pages
According to experience, the page number of a book can affect its rating score to some extent. Therefore, it's useful in our prediction models. Before using this feature, we should normalize it and other features into the same magnitude.
-
ratings_count
The ratings_count feature can also affect the average rating of a book. As we known, if the rating count is too small, the book's rating score is easier to be extreme, such as near 0 or near 5. Thus, we regard ratings_count as a useful feature to predict the book average rating.
-
text_review_count
Text review is always related to the rating level. We will not abandon it here. The value of text_review_count may be too big when compared with some other numeric feature, so the normalization step is necessary.
-
publication_date
Publication_date is a date data. In order to confirm whether it can be useful for our models or not, we plot the total date data with rating score. It's shown as follow. We cannot detect any useful information. Because the average rating is almost steady from begin to the end. But it may be existing some implicit period information. so we plot the month average rating from year 2000 to 2004. As shown below. Unfortunately, we still cannot observe period mode from figure 3. So we decided to abandon publication_date feature in the end.
-
publisher
The publisher feature is similar to authors. There may be several publishers take in charge one book. So we need to consider the effect of publisher number. According to our preprocessing, there are 2130 different publishers in training data.
-
isbn
Just abandon it since there is isbn13 here.
-
bookID
Abandon.
-
title
Abandon.
The features we used here are authors, language_code, num_pages, ratings_count, text_review_count and publisher. And the normalization step is necessary. The encoding method we used in string type data is Bayesian Target Encoding.
- Linear Regression
- Decision Tree
- MLP
- AdaBoost
- Bagging with MLP
- SVR
- Random Forest
Based on 10-fold cross validation.
Based on Random Forest model. RF model can explore the importance of features.