Git Product home page Git Product logo

bookrating's Introduction

BookRating

Predict book average rating with some simple models.

files

  • map.py: clear data

  • reduce.py: transform publication_date feature.

    Train_data.csv --> Train_data_1.csv

  • process_isbn13_to_nation.py: extract nation info from isbn13

    Train_data_1.csv --> Train_data_2.csv

    Test_data.csv --> Test_data_2.csv

  • preprocessing.ipynb: Code is almost the same as preprocessing.py. Transform features language, nation, author, publisher, date with Bayesian Target Encoding.

    Train_data_2.csv --> train_processed_language_author_publisher_nation_date.csv

    Test_data_2.csv --> test_processed_language_author_publisher_nation_date.csv

  • date_score_relation.py: used to plot date graph.

  • plot.py: plot some other graph.

Preprocessing Steps

  1. Run python map.py | reduce.py
  2. Run python process_isbn13_to_nation.py
  3. Run python preprocessing.py

Then, you can get processed files under processed_file direction.

Preprocessing detail

Clear data

Use mapreduce to clear the noise and transform the date feature.

Feature selection

  1. author

    The authors feature is string type data. According to the experience, author is an important aspect which can affect the rating score significantly. Every single book may have several different authors. So, when we select authors as one of the necessary features, the impact of author number should be considered. According to our preprocessing, there are 8442 different authors in training data.

  2. isbn13

    The feature isbn13 contains some implicit information, such as nation, language and publisher. Nation and language is related to the existing feature language_code, and publisher is already existing in our raw data. So, isbn13 will be abandoned.

  3. language

    Language_code is a related feature to the final average rating. It's also a string type feature, so we need to transform it into numeric value before utilizing it. There are 26 different languages in training data, and their average rating are shown below. language average score We see that there are some gaps between the average rating of different languages. So, language can be a useful feature for us to predict the test data.

  4. num_pages

    According to experience, the page number of a book can affect its rating score to some extent. Therefore, it's useful in our prediction models. Before using this feature, we should normalize it and other features into the same magnitude.

  5. ratings_count

    The ratings_count feature can also affect the average rating of a book. As we known, if the rating count is too small, the book's rating score is easier to be extreme, such as near 0 or near 5. Thus, we regard ratings_count as a useful feature to predict the book average rating.

  6. text_review_count

    Text review is always related to the rating level. We will not abandon it here. The value of text_review_count may be too big when compared with some other numeric feature, so the normalization step is necessary.

  7. publication_date

    Publication_date is a date data. In order to confirm whether it can be useful for our models or not, we plot the total date data with rating score. It's shown as follow. general data rating We cannot detect any useful information. Because the average rating is almost steady from begin to the end. But it may be existing some implicit period information. so we plot the month average rating from year 2000 to 2004. As shown below. 2000-2004 month average rating Unfortunately, we still cannot observe period mode from figure 3. So we decided to abandon publication_date feature in the end.

  8. publisher

    The publisher feature is similar to authors. There may be several publishers take in charge one book. So we need to consider the effect of publisher number. According to our preprocessing, there are 2130 different publishers in training data.

  9. isbn

    Just abandon it since there is isbn13 here.

  10. bookID

    Abandon.

  11. title

    Abandon.

Brief summary

The features we used here are authors, language_code, num_pages, ratings_count, text_review_count and publisher. And the normalization step is necessary. The encoding method we used in string type data is Bayesian Target Encoding.

Models

  1. Linear Regression
  2. Decision Tree
  3. MLP
  4. AdaBoost
  5. Bagging with MLP
  6. SVR
  7. Random Forest

Performance

Based on 10-fold cross validation.

performance

Feature importance

Based on Random Forest model. RF model can explore the importance of features.

features importances

bookrating's People

Contributors

marschildx avatar triprunn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.