In this lab, we will build a basic version of low-rank matrix factorization engine for recommending movies in using Numpy and SciPy. The purpose of this lesson to help develop a sound understanding of how this process works in detail, before we implement it in PySpark. We will use a small movie recommendation dataset suitable for such experiments. The system should make not highly accurate, yet meaningful recommendations.
- Build a basic recommendation system using MovieLense 1 million movies dataset
- Use Scipy and Numpy to build a recommendation system using matrix factorization.
For this lab, we will use a dataset of 1 million movie ratings available from the MovieLens project collected by GroupLens Research at the University of Minnesota. The website offers many versions and subsets of the complete MovieLens dataset. We have downloaded the 1 million movies subset for you and you can find it under the folder ml-1m
. Visit this link and also the MovieLens site above to get an understanding of format for included files before moving on:
- ratings.dat
- users.dat
- movies.dat
Let's first read our dataset into pandas dataframe before moving on.
Our datasets are .dat
format with features split using the delimiter '::'
. Perform following tasks:
-
Read the files
ratings.dat
,movies.dat
andusers.dat
using python'sopen()
. Useencoding='latin-1'
for these files -
Split above files on delimiter '::' and create arrays for users , movies and ratings
-
Create ratings and movies dataframes from arrays above with columns:
ratings = ['UID', 'MID', 'Rating', 'Time']
movies = ['MID', 'Title', 'Genre']
-
View the contents of
movies
andratings
datasets
Note: Make sure to change the appropriate datatypes to int (numeric) in these datasets.
# Code here
# Uncomment Below
# print(movies.head())
# print()
# print(ratings.head())
MID Title Genre
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
UID MID Rating Time
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
Matrix factorization, as we saw in the previous lesson, uses a "Utility Matrix" of users x movies. The intersection points between users and movies indicate the ratings users have given to movies. We saw how this is mostly sparse matrix. Here is a quick refresher below:
Next, our job is to create such a matrix from the ratings table above in order to proceed forward with SVD.
- Create a utility matrix
A
to contain one row per user and one column per movie. Pivotratings
dataframe to achieve this.
# Create a utility matrix A by pivoting ratings.df
# Code here
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
MID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 3943 | 3944 | 3945 | 3946 | 3947 | 3948 | 3949 | 3950 | 3951 | 3952 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UID | |||||||||||||||||||||
1 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 3706 columns
Finally, let's perform mean normalization on our utility matrix, and save it as a numpy array for decomposition tasks.
# Mean normalize dataframe A to numpy array A_norm
# Code here
array([[ 4.94009714, -0.05990286, -0.05990286, ..., -0.05990286,
-0.05990286, -0.05990286],
[-0.12924987, -0.12924987, -0.12924987, ..., -0.12924987,
-0.12924987, -0.12924987],
[-0.05369671, -0.05369671, -0.05369671, ..., -0.05369671,
-0.05369671, -0.05369671],
...,
[ 3.85429034, -0.14570966, -0.14570966, ..., -0.14570966,
-0.14570966, -0.14570966],
[ 4.89314625, -0.10685375, -0.10685375, ..., -0.10685375,
-0.10685375, -0.10685375],
[ 4.55477604, 4.55477604, -0.44522396, ..., -0.44522396,
-0.44522396, -0.44522396]])
SVD can help us decomposes the matrix
A above is users' ratings matrix (utility),
To get the lower rank approximation, we take these matrices and keep only the top
Perform following tasks:
- Import
svds
from Scipy - decompose A_norm using 50 factors i.e. pass
k=50
argument tosvds()
# Code here
The sigma above is returned as a 1 dimensional array of latent factor values. As we need to perform matrix multiplication in our next part, lets convert it to a diagonal matrix using np.diag()
. Here is an explanation for this
- Convert sigma factors into a diagonal matrix with
np.diag()
- Check and confirm the shape of
sigma
befora and after conversion
# Code here
[ 147.18581225 147.62154312 148.58855276 150.03171353 151.79983807
153.96248652 154.29956787 154.54519202 156.1600638 157.59909505
158.55444246 159.49830789 161.17474208 161.91263179 164.2500819
166.36342107 166.65755956 167.57534795 169.76284423 171.74044056
176.69147709 179.09436104 181.81118789 184.17680849 186.29341046
192.15335604 192.56979125 199.83346621 201.19198515 209.67692339
212.55518526 215.46630906 221.6502159 231.38108343 239.08619469
244.8772772 252.13622776 256.26466285 275.38648118 287.89180228
315.0835415 335.08085421 345.17197178 362.26793969 415.93557804
434.97695433 497.2191638 574.46932602 670.41536276 1544.10679346] (50,)
(array([[ 147.18581225, 0. , 0. , ..., 0. ,
0. , 0. ],
[ 0. , 147.62154312, 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0. , 148.58855276, ..., 0. ,
0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 574.46932602,
0. , 0. ],
[ 0. , 0. , 0. , ..., 0. ,
670.41536276, 0. ],
[ 0. , 0. , 0. , ..., 0. ,
0. , 1544.10679346]]), (50, 50))
Excellent, We changed sigma from a vector of size fifty to a 2D diagonal matrix of size 50x50. We can now move on to making predictions from our decomposed matrices.
Now we have everything required to make movie ratings predictions for every user. We will do it all at once by following the math and matrix multiply
- Use
np.dot()
to multiply$U,\Sigma, V$ - add the user ratings means back to get the actual star ratings prediction.
# Code here
For a practical system, the value of k
above would be identified through creating test and training datasets and selecting optimal value of this parameter. We will leave this bit for our detailed experiment in the next lab.
Here, we'll see how to make recommendations based on predictions
array created above.
With the predictions matrix for every user, we can build a function to recommend movies for any user. We need to return the movies with the highest predicted rating that the specified user hasn't already rated. We will also merge in the user information to get a more complete picture of the recommendations. We will also return the list of movies the user has already rated, for the sake of comparison.
- Create a Dataframe from predictions and view contents
- Use column names of
A
as the new names for this dataframe
# Code here
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
MID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 3943 | 3944 | 3945 | 3946 | 3947 | 3948 | 3949 | 3950 | 3951 | 3952 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.288861 | 0.143055 | -0.195080 | -0.018843 | 0.012232 | -0.176604 | -0.074120 | 0.141358 | -0.059553 | -0.195950 | ... | 0.027807 | 0.001640 | 0.026395 | -0.022024 | -0.085415 | 0.403529 | 0.105579 | 0.031912 | 0.050450 | 0.088910 |
1 | 0.744716 | 0.169659 | 0.335418 | 0.000758 | 0.022475 | 1.353050 | 0.051426 | 0.071258 | 0.161601 | 1.567246 | ... | -0.056502 | -0.013733 | -0.010580 | 0.062576 | -0.016248 | 0.155790 | -0.418737 | -0.101102 | -0.054098 | -0.140188 |
2 | 1.818824 | 0.456136 | 0.090978 | -0.043037 | -0.025694 | -0.158617 | -0.131778 | 0.098977 | 0.030551 | 0.735470 | ... | 0.040481 | -0.005301 | 0.012832 | 0.029349 | 0.020866 | 0.121532 | 0.076205 | 0.012345 | 0.015148 | -0.109956 |
3 | 0.408057 | -0.072960 | 0.039642 | 0.089363 | 0.041950 | 0.237753 | -0.049426 | 0.009467 | 0.045469 | -0.111370 | ... | 0.008571 | -0.005425 | -0.008500 | -0.003417 | -0.083982 | 0.094512 | 0.057557 | -0.026050 | 0.014841 | -0.034224 |
4 | 1.574272 | 0.021239 | -0.051300 | 0.246884 | -0.032406 | 1.552281 | -0.199630 | -0.014920 | -0.060498 | 0.450512 | ... | 0.110151 | 0.046010 | 0.006934 | -0.015940 | -0.050080 | -0.052539 | 0.507189 | 0.033830 | 0.125706 | 0.199244 |
5 rows × 3706 columns
Now we have a predictions dataframe composed from a reduced factors. We can now can predicted recommendations for every user. We will create a new function recommender()
as shown below:
-
recommender()
- Inputs: predictions dataframe , chosen UserID, movies dataframe, original ratings df, num_recommendations
- Outputs: Movies already rated by user, predicted ratings for remaining movies for user
-
Get and set of predictions for selected user and sort in descending order
-
Get the movies already rated by user and sort them in descending order by rating
-
Create a set of recommendations for all movies, not yet rated by the user
# Recommending top movies not yet rated by user
def recommender(predictions_df, UID, movies_df, original_ratings_df, num_recommendations=5):
# Get and sort the user's predictions
user_row = None # UID starts at 1, not 0
sorted_predictions = None
# Get the original user data and merge in the movie information
user_data = None
user_full = None
# Recommend the highest predicted rating movies that the user hasn't seen yet.
recommendations = None
# Print user information
pass #return user_full, recommendations
Using above function, we can now get a set of recommendations for any user.
# Get a list of already reated and recommended movies for a selected user
# Uncomment to run below
#rated, recommended = recommender(predictions_df, 100, movies, ratings, 10)
User 100 has already rated 76 movies.
Recommending highest 10 predicted ratings movies not already rated.
# Uncomment to run
# rated.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
UID | MID | Rating | Time | Title | Genre | |
---|---|---|---|---|---|---|
1 | 100 | 800 | 5 | 977593915 | Lone Star (1996) | Drama|Mystery |
63 | 100 | 527 | 5 | 977594839 | Schindler's List (1993) | Drama|War |
16 | 100 | 919 | 5 | 977594947 | Wizard of Oz, The (1939) | Adventure|Children's|Drama|Musical |
17 | 100 | 924 | 4 | 977594873 | 2001: A Space Odyssey (1968) | Drama|Mystery|Sci-Fi|Thriller |
29 | 100 | 969 | 4 | 977594044 | African Queen, The (1951) | Action|Adventure|Romance|War |
22 | 100 | 2406 | 4 | 977594142 | Romancing the Stone (1984) | Action|Adventure|Comedy|Romance |
47 | 100 | 318 | 4 | 977594839 | Shawshank Redemption, The (1994) | Drama |
20 | 100 | 858 | 4 | 977593950 | Godfather, The (1972) | Action|Crime|Drama |
49 | 100 | 329 | 4 | 977594297 | Star Trek: Generations (1994) | Action|Adventure|Sci-Fi |
50 | 100 | 260 | 4 | 977593595 | Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Fantasy|Sci-Fi |
# Uncomment to run
# print ("\nTop Ten recommendations for selected user" )
# recommended
Top Ten recommendations for selected user
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
MID | Title | Genre | |
---|---|---|---|
1311 | 1374 | Star Trek: The Wrath of Khan (1982) | Action|Adventure|Sci-Fi |
1148 | 1193 | One Flew Over the Cuckoo's Nest (1975) | Drama |
1312 | 1376 | Star Trek IV: The Voyage Home (1986) | Action|Adventure|Sci-Fi |
285 | 296 | Pulp Fiction (1994) | Crime|Drama |
570 | 590 | Dances with Wolves (1990) | Adventure|Drama|Western |
1184 | 1240 | Terminator, The (1984) | Action|Sci-Fi|Thriller |
877 | 912 | Casablanca (1942) | Drama|Romance|War |
1161 | 1214 | Alien (1979) | Action|Horror|Sci-Fi|Thriller |
1524 | 1617 | L.A. Confidential (1997) | Crime|Film-Noir|Mystery|Thriller |
997 | 1036 | Die Hard (1988) | Action|Thriller |
For above randomly selected user 100
, we can subjectively evaluate that the recommender is doing a decent job. The movies being recommended are quite similar in taste as movies already rated by user. Remember this system is built using only a small subset of the complete MovieLense database which carries potential for significant improvement in predictive performance.
-
Run the experiment again using validation testing to identify the optimal value of rank
k
-
Create Test and Train datasets to predict and evaluate the ratings using a suitable method (e.g. RMSE)
-
How much of an improvement do you see in recommendations as a result of validation ?
-
Ask other interesting questions
-
https://towardsdatascience.com/how-to-build-a-simple-song-recommender-296fcbc8c85 - A similar system recommending songs to users
-
http://surpriselib.com/ Surprise is a Python Scikit is a very popular module for building receommendation systems
In this lab, we learned that we can make good recommendations with collaborative filtering methods using latent features from low-rank matrix factorization methods. This technique also scales significantly better to larger datasets. We will next work with a larger MovieLense dataset and using the mapreduce techniques seen in the previous section, we will use PySpark to implement a similar approach using an implementation of matrix factorization called Alternating Least Squares (ALS) method.