Singular Value Decomposition with Numpy and SciPy

Introduction

In this lab, we will build a basic version of low-rank matrix factorization engine for recommending movies in using Numpy and SciPy. The purpose of this lesson to help develop a sound understanding of how this process works in detail, before we implement it in PySpark. We will use a small movie recommendation dataset suitable for such experiments. The system should make not highly accurate, yet meaningful recommendations.

Objectives

Build a basic recommendation system using MovieLense 1 million movies dataset
Use Scipy and Numpy to build a recommendation system using matrix factorization.

Dataset

For this lab, we will use a dataset of 1 million movie ratings available from the MovieLens project collected by GroupLens Research at the University of Minnesota. The website offers many versions and subsets of the complete MovieLens dataset. We have downloaded the 1 million movies subset for you and you can find it under the folder ml-1m. Visit this link and also the MovieLens site above to get an understanding of format for included files before moving on:

ratings.dat
users.dat
movies.dat

Let's first read our dataset into pandas dataframe before moving on.

Our datasets are .dat format with features split using the delimiter '::'. Perform following tasks:

Read the files ratings.dat, movies.dat and users.dat using python's open(). Use encoding='latin-1' for these files
Split above files on delimiter '::' and create arrays for users , movies and ratings
Create ratings and movies dataframes from arrays above with columns:
- ratings = ['UID', 'MID', 'Rating', 'Time']
- movies = ['MID', 'Title', 'Genre']
View the contents of movies and ratings datasets

Note: Make sure to change the appropriate datatypes to int (numeric) in these datasets.

# Code here

# Uncomment Below

# print(movies.head())
# print()
# print(ratings.head())

   MID                               Title                         Genre
0    1                    Toy Story (1995)   Animation|Children's|Comedy
1    2                      Jumanji (1995)  Adventure|Children's|Fantasy
2    3             Grumpier Old Men (1995)                Comedy|Romance
3    4            Waiting to Exhale (1995)                  Comedy|Drama
4    5  Father of the Bride Part II (1995)                        Comedy

   UID   MID  Rating       Time
0    1  1193       5  978300760
1    1   661       3  978302109
2    1   914       3  978301968
3    1  3408       4  978300275
4    1  2355       5  978824291

Creating the Utility Matrix

Matrix factorization, as we saw in the previous lesson, uses a "Utility Matrix" of users x movies. The intersection points between users and movies indicate the ratings users have given to movies. We saw how this is mostly sparse matrix. Here is a quick refresher below:

Next, our job is to create such a matrix from the ratings table above in order to proceed forward with SVD.

Create a utility matrix A to contain one row per user and one column per movie. Pivot ratings dataframe to achieve this.

# Create a utility matrix A by pivoting ratings.df

# Code here

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

MID	1	2	3	4	5	6	7	8	9	10	...	3943	3944	3945	3946	3947	3948	3949	3950	3951	3952
UID
1	5.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 3706 columns

Finally, let's perform mean normalization on our utility matrix, and save it as a numpy array for decomposition tasks.

# Mean normalize dataframe A to numpy array A_norm


# Code here

array([[ 4.94009714, -0.05990286, -0.05990286, ..., -0.05990286,
        -0.05990286, -0.05990286],
       [-0.12924987, -0.12924987, -0.12924987, ..., -0.12924987,
        -0.12924987, -0.12924987],
       [-0.05369671, -0.05369671, -0.05369671, ..., -0.05369671,
        -0.05369671, -0.05369671],
       ...,
       [ 3.85429034, -0.14570966, -0.14570966, ..., -0.14570966,
        -0.14570966, -0.14570966],
       [ 4.89314625, -0.10685375, -0.10685375, ..., -0.10685375,
        -0.10685375, -0.10685375],
       [ 4.55477604,  4.55477604, -0.44522396, ..., -0.44522396,
        -0.44522396, -0.44522396]])

Matrix Factorization with SVD

SVD can help us decomposes the matrix $A$ into the best lower rank approximation of the original matrix $A$. Mathematically, it decomposes $A$ into a two unitary matrices and a diagonal matrix:

$$\begin{equation} A = U\Sigma V^{T} \end{equation}$$

A above is users' ratings matrix (utility), $U$ is the user "features" matrix, $\Sigma$ is the diagonal matrix of singular values (essentially weights), and $V^{T}$ is the movie "features" matrix. $U$ and $V^{T}$ are orthogonal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, we take these matrices and keep only the top $k$ features, which we think of as the underlying tastes and preferences vectors.

Perform following tasks:

Import svds from Scipy
decompose A_norm using 50 factors i.e. pass k=50 argument to svds()

# Code here

Creating diagonal matrix for sigma factors

The sigma above is returned as a 1 dimensional array of latent factor values. As we need to perform matrix multiplication in our next part, lets convert it to a diagonal matrix using np.diag(). Here is an explanation for this

Convert sigma factors into a diagonal matrix with np.diag()
Check and confirm the shape of sigma befora and after conversion

# Code here

[ 147.18581225  147.62154312  148.58855276  150.03171353  151.79983807
  153.96248652  154.29956787  154.54519202  156.1600638   157.59909505
  158.55444246  159.49830789  161.17474208  161.91263179  164.2500819
  166.36342107  166.65755956  167.57534795  169.76284423  171.74044056
  176.69147709  179.09436104  181.81118789  184.17680849  186.29341046
  192.15335604  192.56979125  199.83346621  201.19198515  209.67692339
  212.55518526  215.46630906  221.6502159   231.38108343  239.08619469
  244.8772772   252.13622776  256.26466285  275.38648118  287.89180228
  315.0835415   335.08085421  345.17197178  362.26793969  415.93557804
  434.97695433  497.2191638   574.46932602  670.41536276 1544.10679346] (50,)





(array([[ 147.18581225,    0.        ,    0.        , ...,    0.        ,
            0.        ,    0.        ],
        [   0.        ,  147.62154312,    0.        , ...,    0.        ,
            0.        ,    0.        ],
        [   0.        ,    0.        ,  148.58855276, ...,    0.        ,
            0.        ,    0.        ],
        ...,
        [   0.        ,    0.        ,    0.        , ...,  574.46932602,
            0.        ,    0.        ],
        [   0.        ,    0.        ,    0.        , ...,    0.        ,
          670.41536276,    0.        ],
        [   0.        ,    0.        ,    0.        , ...,    0.        ,
            0.        , 1544.10679346]]), (50, 50))

Excellent, We changed sigma from a vector of size fifty to a 2D diagonal matrix of size 50x50. We can now move on to making predictions from our decomposed matrices.

Making Predictions from the Decomposed Matrices

Now we have everything required to make movie ratings predictions for every user. We will do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$. Perform following tasks to achieve this

Use np.dot() to multiply $U,\Sigma, V$
add the user ratings means back to get the actual star ratings prediction.

# Code here

For a practical system, the value of `k` above would be identified through creating test and training datasets and selecting optimal value of this parameter. We will leave this bit for our detailed experiment in the next lab.

Here, we'll see how to make recommendations based on predictions array created above.

Making Recommendations

With the predictions matrix for every user, we can build a function to recommend movies for any user. We need to return the movies with the highest predicted rating that the specified user hasn't already rated. We will also merge in the user information to get a more complete picture of the recommendations. We will also return the list of movies the user has already rated, for the sake of comparison.

Create a Dataframe from predictions and view contents
Use column names of A as the new names for this dataframe

# Code here

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

MID	1	2	3	4	5	6	7	8	9	10	...	3943	3944	3945	3946	3947	3948	3949	3950	3951	3952
0	4.288861	0.143055	-0.195080	-0.018843	0.012232	-0.176604	-0.074120	0.141358	-0.059553	-0.195950	...	0.027807	0.001640	0.026395	-0.022024	-0.085415	0.403529	0.105579	0.031912	0.050450	0.088910
1	0.744716	0.169659	0.335418	0.000758	0.022475	1.353050	0.051426	0.071258	0.161601	1.567246	...	-0.056502	-0.013733	-0.010580	0.062576	-0.016248	0.155790	-0.418737	-0.101102	-0.054098	-0.140188
2	1.818824	0.456136	0.090978	-0.043037	-0.025694	-0.158617	-0.131778	0.098977	0.030551	0.735470	...	0.040481	-0.005301	0.012832	0.029349	0.020866	0.121532	0.076205	0.012345	0.015148	-0.109956
3	0.408057	-0.072960	0.039642	0.089363	0.041950	0.237753	-0.049426	0.009467	0.045469	-0.111370	...	0.008571	-0.005425	-0.008500	-0.003417	-0.083982	0.094512	0.057557	-0.026050	0.014841	-0.034224
4	1.574272	0.021239	-0.051300	0.246884	-0.032406	1.552281	-0.199630	-0.014920	-0.060498	0.450512	...	0.110151	0.046010	0.006934	-0.015940	-0.050080	-0.052539	0.507189	0.033830	0.125706	0.199244

5 rows × 3706 columns

Now we have a predictions dataframe composed from a reduced factors. We can now can predicted recommendations for every user. We will create a new function recommender() as shown below:

recommender()
- Inputs: predictions dataframe , chosen UserID, movies dataframe, original ratings df, num_recommendations
- Outputs: Movies already rated by user, predicted ratings for remaining movies for user
Get and set of predictions for selected user and sort in descending order
Get the movies already rated by user and sort them in descending order by rating
Create a set of recommendations for all movies, not yet rated by the user

# Recommending top movies not yet rated by user
def recommender(predictions_df, UID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row = None # UID starts at 1, not 0
    sorted_predictions = None
    
    # Get the original user data and merge in the movie information 
    user_data = None 
    user_full = None

    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = None
                    
    # Print user information
    
    
    pass #return user_full, recommendations

Using above function, we can now get a set of recommendations for any user.

# Get a list of already reated and recommended movies for a selected user
# Uncomment to run below


#rated, recommended = recommender(predictions_df, 100, movies, ratings, 10)

User 100 has already rated 76 movies.
Recommending highest 10 predicted ratings movies not already rated.

# Uncomment to run 

# rated.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	UID	MID	Rating	Time	Title	Genre
1	100	800	5	977593915	Lone Star (1996)	Drama\|Mystery
63	100	527	5	977594839	Schindler's List (1993)	Drama\|War
16	100	919	5	977594947	Wizard of Oz, The (1939)	Adventure\|Children's\|Drama\|Musical
17	100	924	4	977594873	2001: A Space Odyssey (1968)	Drama\|Mystery\|Sci-Fi\|Thriller
29	100	969	4	977594044	African Queen, The (1951)	Action\|Adventure\|Romance\|War
22	100	2406	4	977594142	Romancing the Stone (1984)	Action\|Adventure\|Comedy\|Romance
47	100	318	4	977594839	Shawshank Redemption, The (1994)	Drama
20	100	858	4	977593950	Godfather, The (1972)	Action\|Crime\|Drama
49	100	329	4	977594297	Star Trek: Generations (1994)	Action\|Adventure\|Sci-Fi
50	100	260	4	977593595	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Fantasy\|Sci-Fi

# Uncomment to run

# print ("\nTop Ten recommendations for selected user" )
# recommended

Top Ten recommendations for selected user

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	MID	Title	Genre
1311	1374	Star Trek: The Wrath of Khan (1982)	Action\|Adventure\|Sci-Fi
1148	1193	One Flew Over the Cuckoo's Nest (1975)	Drama
1312	1376	Star Trek IV: The Voyage Home (1986)	Action\|Adventure\|Sci-Fi
285	296	Pulp Fiction (1994)	Crime\|Drama
570	590	Dances with Wolves (1990)	Adventure\|Drama\|Western
1184	1240	Terminator, The (1984)	Action\|Sci-Fi\|Thriller
877	912	Casablanca (1942)	Drama\|Romance\|War
1161	1214	Alien (1979)	Action\|Horror\|Sci-Fi\|Thriller
1524	1617	L.A. Confidential (1997)	Crime\|Film-Noir\|Mystery\|Thriller
997	1036	Die Hard (1988)	Action\|Thriller

For above randomly selected user 100, we can subjectively evaluate that the recommender is doing a decent job. The movies being recommended are quite similar in taste as movies already rated by user. Remember this system is built using only a small subset of the complete MovieLense database which carries potential for significant improvement in predictive performance.

Level Up - Optional

Run the experiment again using validation testing to identify the optimal value of rank k
Create Test and Train datasets to predict and evaluate the ratings using a suitable method (e.g. RMSE)
How much of an improvement do you see in recommendations as a result of validation ?
Ask other interesting questions

Additional Resources

https://towardsdatascience.com/how-to-build-a-simple-song-recommender-296fcbc8c85 - A similar system recommending songs to users
http://surpriselib.com/ Surprise is a Python Scikit is a very popular module for building receommendation systems

Summary

In this lab, we learned that we can make good recommendations with collaborative filtering methods using latent features from low-rank matrix factorization methods. This technique also scales significantly better to larger datasets. We will next work with a larger MovieLense dataset and using the mapreduce techniques seen in the previous section, we will use PySpark to implement a similar approach using an implementation of matrix factorization called Alternating Least Squares (ALS) method.

MID	1	2	3	4	5	6	7	8	9	10	...	3943	3944	3945	3946	3947	3948	3949	3950	3951	3952
UID
1	5.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

MID	1	2	3	4	5	6	7	8	9	10	...	3943	3944	3945	3946	3947	3948	3949	3950	3951	3952
UID
1	5.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

learn-co-students / dsc-4-39-03-singular-value-decomposition-numpy-scipy-lab-nyc-career-ds-102218 Goto Github PK

dsc-4-39-03-singular-value-decomposition-numpy-scipy-lab-nyc-career-ds-102218's Introduction

Singular Value Decomposition with Numpy and SciPy

Introduction

Objectives

Dataset

Creating the Utility Matrix

Matrix Factorization with SVD

Creating diagonal matrix for sigma factors

Making Predictions from the Decomposed Matrices

For a practical system, the value of k above would be identified through creating test and training datasets and selecting optimal value of this parameter. We will leave this bit for our detailed experiment in the next lab.

Making Recommendations

Level Up - Optional

Additional Resources

Summary

dsc-4-39-03-singular-value-decomposition-numpy-scipy-lab-nyc-career-ds-102218's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

For a practical system, the value of `k` above would be identified through creating test and training datasets and selecting optimal value of this parameter. We will leave this bit for our detailed experiment in the next lab.

MID	1	2	3	4	5	6	7	8	9	10	...	3943	3944	3945	3946	3947	3948	3949	3950	3951	3952
UID
1	5.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0