Inspired by Spotify Million Playlist Dataset Challenge
People love playlists. Spotify reported in 2008 that their users have generated over 4 bn playlists [2]. Various industry studies indicate that playlists account for a third of all music playtime [1], and over a half of users say that playlists are replacing albums in they music listening habits [2].
Playlists create benefits for consumers by providing personalised music discovery and reccomendations for various occasions, moods and themes. The importance of playlist for the music industry is also paramount, covering use cases like consumer engagement improvement, increased playtime, better music search, and also helping less known artist get discovered though automatically generated playlists.
In this project I have explored Content Based Filtering (CBF) and Collaborative Filtering (CF) with python to solve the task of automatic playlist creation based on first n tracks from a playlist or n randomly selected items from a playlist.
Dataset comes from original [1] Spotify Million Playlist Dataset Challenge
As mentioned above, the project used Collaborative Filtering (CF) and Content Based Filtering (CBF) as two main approaches
Collaborative Filtering: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.
Notebooks:
- Memory based:
modeling-notebooks/CF00_Memory_based_scaled_down.ipynb
- Model based:
- Surprise
modeling-notebooks/CF01_Model_Surprise_50pct_sample.ipynb
modeling-notebooks/CF01_Model_Surprise_scaled_down.ipynb
modeling-notebooks/CF01_Model_Surprise_scaled_down.ipynb
- Alternating Least Squares with Implicit
modeling-notebooks/F02_Model_ALS_Implicit_binary.ipynb
- contains demomodeling-notebooks/CF02_Model_ALS_Implicit_pos.ipynb
- contains demo
- SVD
modeling-notebooks/CF03_Model_SVD_sparse_matrix_binary_ratings.ipynb
modeling-notebooks/CF03_Model_SVD_sparse_matrix_pos_ratings.ipynb
- Surprise
Content-Based Filtering: This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.
Notebooks:
modeling-notebooks/CBF00_audio_features.ipynb
- contains demo- WIP:
modeling-notebooks/CBF01_Audio_features_genres_data_preparation.ipynb
- WIP:
modeling-notebooks/CBF01_Audio_features_genres_model.ipynb
Two information retrival systems evaluation metrics were used, partially following the set up from the original challenge.
Notebook and the Original Challenge Definitions.
To complete the evaluation, data set is split into pseudo train and test sets. Each playlist in the test set is split into two subsets: seed tracks and hold-out, or ground thruth, tracks. Train playlists and test playlists containing only seed tracks are then used to train the models. Test playlists containing only seed tracks are then used to obtain recommendations from the models. The R-precision
and NDCG
of the obtained recommendation is calculated against the Ground truth
R-precision
measures the number of retrieved relevant tracks divided by the number of known relevant tracks (i.e., the number of withheld tracks)Normalised Discounted Cumulative Gain (NDGS)
Discounted Cumulative Gain (DCG) measures the ranking quality of the recommended tracks, increasing when relevant tracks are placed higher in the list. Normalized DCG (NDCG) is determined by calculating the DCG and dividing it by the ideal DCG in which the recommended tracks are perfectly ranked
- python version: python 3.8.3
- dependencies:
requirements.txt
├── data-processing-notebooks/ <- Notebooks with data extractions and processing
├── evaluation/ <- Evaluation results in csv
├── modeling-notebooks/ <- Notebooks with models and evaluation
├── README.md <- High level readme file
├── requirements.txt <- requirements.txt
└── src/ <- scripts provided with data set to to obtain basic descriptive statistics of the dataset
[2] Spotify Million Playlist Dataset Challenge
[3] Spotify Web API
[5] Introduction to recommender systems
[6] Playlists: Good Or Bad For Musicians?
[7] Spotify Sentiment Analysis
[8] Recommender Systems in Python 101
[9] Evaluate your Recommendation Engine using NDCG
[10] How to Build a Memory-Based Recommendation System using Python Surprise
[11] How to Build a Model-Based Recommendation System using Python Surprise
[12] ALS Implicit Collaborative Filtering
[13] Beginner Tutorial: Recommender Systems in Python
[14] Stop One-Hot Encoding Your Categorical Variables
[15] Brief presentation