Similarity Based Recommendation System

_______________________________________________________________________________________________________________________________

Luis Terán

A recommendation engine is an information filtering system uploading information tailored to users' interests, preferences, or behavioral history on an item. It is able to predict a specific user's preference on an item based on their profile. With the use of product recommendation systems, the customers are able to find the items they are looking for easily and quickly. A few recommendation systems have been developed so far to find products the user has watched, bought or somehow interacted with in the past.

The recommendation engine is a splendid marketing tool especially for e-commerce and is also useful for increasing profits, sales and revenues in general. That's why personalized product recommendations are so widely used in the retail industry, eleven more highlighting the importance of recommendation engines in the e-commerce industry.

Recommendation system

Recommendation systems use a number of different technologies. We can classify these systems into two broad groups:

Content-based systems: Content-based systems examine properties of the items recommended. For instance, if a Netflix user has watched many cowboy movies, then recommend a movie classified in the database as having the “cowboy” genre.
Collaborative filtering systems: Collaborative filtering systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users.

The aim of this project is to create a similarity based recommendation system using collaborative filtering completely capable of make concrete recommendations of movies according to previous rated books movies the user. The original dataset was obtained from:

https://grouplens.org/datasets/movielens/latest/

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data files used in this project were:

Movies Data File Structure (movies.csv): Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the (movieId, title, genre) format.
Ratings Data File Structure (ratings.csv): All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the (userId, movieId, rating, timestamp) format.

The process followed by the project was:

Exploratory Data Analysis:
User Item Matrix
Similar User Search
Recommendation System Implementation
Final Evaluation

1. Exploratory Data Analysis

1.1 General Analysis

First of all, we need to import both datasets (movies & ratings) and take a view of the data. There are 9,742 movies with id, title and genres, also 100,836 ratings of those movies. Before proceeding, we need to take care of some aspects of the dataset:

Missing values: There were found 0 missing in any of the datasets.
Duplicates: There were no duplicate observations in any dataset.
Unnecessary variables: Since the recommender system is based only in previous evaluations, is only necessary to know the evaluation, which movie was evaluated and who made the evaluation ('rating', 'movieId', 'userId').

# All the libraries needed for the project are loaded
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
import numpy as np
import warnings

np.random.seed(0)
warnings.filterwarnings('ignore')

# Importing movie dataset
movies = pd.read_csv("./data/movies.csv")
movies.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

# Showing number of rows (9742) and columns (3) of the movies dataset
movies.shape

(9742, 3)

# Looking for null values
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId    9742 non-null int64
title      9742 non-null object
genres     9742 non-null object
dtypes: int64(1), object(2)
memory usage: 152.3+ KB

# Importing rating dataset
ratings = pd.read_csv("./data/ratings.csv")
ratings.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

# Showing number of rows (100836) and columns (4) of the movies dataset
ratings.shape

(100836, 4)

# Looking for null values
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
userId       100836 non-null int64
movieId      100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB

# Looking for duplicated values
ratings.duplicated().sum()

1.2 Distribution of the ratings dataset

We'll start by understanding what are the most frequent rating evaluations in the ratings dataset and how are they distributed. The evaluation consists of 10 possible values between 0.5 and 5 (0.5 step increasing). We can see that 75% of the reviews are equal or greater than 3 and 50% is greater or equal to 4 stars. The low ratings are not frequent in the dataset and a 4 star rating is the most frequent evaluation.

(pd.DataFrame(ratings['rating']
              .value_counts(sort=False))
              .sort_index()
              .plot(kind='bar', color='#4472C4', figsize=(15,5), alpha=0.6))
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

ratings['rating'].describe().to_frame()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rating
count	100836.000000
mean	3.501557
std	1.042529
min	0.500000
25%	3.000000
50%	3.500000
75%	4.000000
max	5.000000

Now, let's count how those reviews are distributed among the users. We find out that the numbers of ratings by user ranges between 2 and 791, but almost the 75% of the users has only reviewed less than 100 movies.

(pd.DataFrame(ratings.groupby('userId')
              .count()['rating'])
              .plot(kind='hist', color='#4472C4', figsize=(15,5), alpha=0.6))
plt.title('Number of ratings by user')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

pd.DataFrame(ratings.groupby('userId').count()['rating']).describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	rating
count	610.000000
mean	165.304918
std	269.480584
min	20.000000
25%	35.000000
50%	70.500000
75%	168.000000
max	2698.000000

One important aspect of the recommendation system is what makes a movie recommendable. One movie could have an average 5 star rating with only two reviews but that doesn't make that movie the most recommendable movie. For that reason, we'll define a threshold, the threshold will the minimum number of reviews for a movie to be considered as a reliable recommendations. We will drop movies below the threshold value so only popular movies are recommend, we don't have an explicit measure of how popular a movie but we can obtain how many reviews have received every movie.

# Reviews per movie plot
ratesMovies = pd.DataFrame(ratings['movieId'].value_counts(sort=False))
ratesMovies.plot(kind='hist', color='#4472C4', figsize=(15,5), alpha=0.6)
plt.title('Reviews per movie')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

As a first view, we realize that most of the movies don't have more than 30 reviews but there is no clear value to select as threshold.

# Reviews per movie plot between 0 and 100
ratesMovies.plot.hist(xlim=(0,100), bins=300, color='#4472C4', figsize=(15,5), alpha=0.6)
plt.title('Number of reviews for movies')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

Taking a closer look we see that most of the movies have less than 30 reviews and after this value the frequency remains constant so this value we will be used as the threshold. Then, we need to remove all those reviews that belong to not popular movies. From the 100,836 initial reviews only 57,358 remained.

# The threshold is defined
threshold = 30

# Ratings before removing the less popular movies
len(ratings)

# Removing less popular movies
ratesMovies.reset_index(inplace=True)
ratesMovies.columns=['movieId', 'reviews']
ratings = pd.merge(ratings, ratesMovies, on='movieId', how="left").copy()
ratings = ratings[ratings['reviews']>threshold].copy()

# Ratings after removing the less popular movies
len(ratings)

# We will also remove not popular movies from the movies dataset
remaining_movies = list(ratings.movieId.values)
movies = movies[movies['movieId'].isin(remaining_movies)].copy()

1.3 Distribution of the ratings dataset

As we already told, for the system the movie information and how the users have interacted determines how the movies are recommended. This means that the movies and their characteristics are not as important as the ones in the ratings dataset. Nevertheless, we'll take a quick view of the characteristics in the dataset in order to have a full picture of the data we are handling. Some extra information can be obtained by extracting the year of release for the movies. From that we can conclude that the majority of the movies in the dataset were released near the year 2000.

# Extracting the year out of the title column
movies['year'] = movies['title'].str.extract(r"\((\d+)\)", expand=False)
movies['year'] = pd.to_numeric(movies['year'])
movies.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	title	genres	year
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	1995
2	3	Grumpier Old Men (1995)	Comedy\|Romance	1995
4	5	Father of the Bride Part II (1995)	Comedy	1995
5	6	Heat (1995)	Action\|Crime\|Thriller	1995

# Year of release histogram
movies['year'].plot(kind = 'hist', xlim=(1900,2020), bins=300, color='#4472C4', figsize=(15,5), alpha=0.6)
plt.title('Year of release of movies')
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.show()

Another important characteristic parameter is the movie genre. More than one movie genre can appear for movie description so we will count the number of total occurrences of elements in the Genre column. The most frequent genres are Drama and Comedy, being approximately 4 times more frequent than Fantasy or Horror movies.

# Genre frequency list
genres = list(movies.genres.values)
movie_genres = [movie.split('|') for movie in genres]
movie_genres = [x for sublist in movie_genres for x in sublist]
genres_freq = pd.DataFrame(pd.Series(movie_genres, name='Genre movies').value_counts().head(10))
genres_freq.reset_index(inplace=True)
genres_freq

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	index	Genre movies
0	Drama	355
1	Comedy	345
2	Action	292
3	Thriller	242
4	Adventure	236
5	Sci-Fi	161
6	Romance	159
7	Crime	150
8	Fantasy	120
9	Children	82

# Genre frequency barplot
fig = plt.figure(figsize=(16,6))
ax = sns.barplot(x="index", y="Genre movies", data=genres_freq)
plt.title('Frequency of movies genres', size=18)
plt.xticks(size=12)
plt.xlabel('Movie genre', size=12)
plt.yticks(size=12)
plt.ylabel('Movies count', size=12)
plt.grid(alpha=0.5)
plt.show()

1.4 Top movies

Finally, since this is a real users dataset, just for fun we can see what are the best rated popular movies and what are worst rated popular movies. One interesting fact is that Drama genre is frequent in the top rated movies, also Crime movies, even though there are not frequent, are really well rated. On the other hand, action and comedy movies are frequent genres in the worst rated movies.

# Obtaining mean rating for every movie
sorted_movies = ratings.groupby('movieId').mean()
sorted_movies.sort_values(['rating', 'reviews'], ascending=[False, False], inplace = True)
top10movies = list(sorted_movies.head(10).index)
bottom10movies = list(sorted_movies.tail(10).index)

# Top rated movies
movies[movies['movieId'].isin(top10movies)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	title	genres	year
277	318	Shawshank Redemption, The (1994)	Crime\|Drama	1994
602	750	Dr. Strangelove or: How I Learned to Stop Worr...	Comedy\|War	1964
659	858	Godfather, The (1972)	Crime\|Drama	1972
686	904	Rear Window (1954)	Mystery\|Thriller	1954
906	1204	Lawrence of Arabia (1962)	Adventure\|Drama\|War	1962
914	1213	Goodfellas (1990)	Crime\|Drama	1990
922	1221	Godfather: Part II, The (1974)	Crime\|Drama	1974
975	1276	Cool Hand Luke (1967)	Drama	1967
2226	2959	Fight Club (1999)	Action\|Crime\|Drama\|Thriller	1999
6315	48516	Departed, The (2006)	Crime\|Drama\|Thriller	2006

# Worst rated movies
movies[movies['movieId'].isin(bottom10movies)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	title	genres	year
163	193	Showgirls (1995)	Drama	1995
313	355	Flintstones, The (1994)	Children\|Comedy\|Fantasy	1994
379	435	Coneheads (1993)	Comedy\|Sci-Fi	1993
396	455	Free Willy (1993)	Adventure\|Children\|Drama	1993
607	762	Striptease (1996)	Comedy\|Crime	1996
1174	1562	Batman & Robin (1997)	Action\|Adventure\|Fantasy\|Thriller	1997
1235	1644	I Know What You Did Last Summer (1997)	Horror\|Mystery\|Thriller	1997
1373	1882	Godzilla (1998)	Action\|Sci-Fi\|Thriller	1998
2029	2701	Wild Wild West (1999)	Action\|Comedy\|Sci-Fi\|Western	1999
2860	3826	Hollow Man (2000)	Horror\|Sci-Fi\|Thriller	2000

2. User Item Matrix

After that, we need to create the User Item Matrix, this is a matrix with all the movies in the dataset as columns and all the users as rows, so every element inside the matrix represents what is the evaluation a user has made for a particular movie. In the matrix, a column (specific movie) will contain all the evaluations the users have made for that movie meanwhile every row will contain all the ratings made from the same user.

User Item Matrix

For that, we start by selecting only the useful information from the ratings dataset ('userId', 'movieId', 'rating') and then use a pivot table to define users as rows, movies as columns and the ratings as values of the matrix. Due to the unseen movies, lots of the values will be presented as 'NaN', these 'NaN' values will be replaced by 0 to represent unseen movies since the scale of evaluation minimum value is 0.5.

# Getting only significant columns for the UIM matrix
uim = ratings[['userId', 'movieId', 'rating']].copy()
uim.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	userId	movieId	rating
0	1	1	4.0
1	1	3	4.0
2	1	6	4.0
3	1	47	5.0
4	1	50	5.0

# Creating UIM matrix using pivot
uim = uim.pivot(index='userId', columns='movieId', values='rating')
uim.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

movieId	1	2	3	5	6	7	10	11	16	17	...	115617	116797	119145	122882	122886	122904	134130	134853	139385	152081
userId
1	4.0	NaN	4.0	NaN	4.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	5.0	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 860 columns

# Filling NA's with 0
uim = uim.fillna(0)
uim.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

movieId	1	2	3	5	6	7	10	11	16	17	...	115617	116797	119145	122882	122886	122904	134130	134853	139385	152081
userId
1	4.0	0.0	4.0	0.0	4.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 860 columns

3. Similar User Search

Once we have represented as rows all the reviews made by a particular user we need a way to compare the similarity between a new user and the users from the dataset, for this we will implement Cosine Similarity method.

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. A vector can represent thousands of attributes, in this case every attribute is the evaluation of a particular movie. Thus, each group of ratings is an object represented by what is called a term-frequency vector, this is every singular row in our User Item Matrix.

Cosine similarity

The cosine between two vectors (theta) can be also defined as the dot product divided by the module of them. Additionally, the resulting value will change according to the angle between them, as we can see from the picture when the angle is zero the two vectors are overlapping, thus they are really similar. However, when the angle is really open means the vectors are completely different from each other. When we compute the cosine of the angle it gives us values between 0 and 1.

As the values of cosine similarity gets closer to 1 (angle = 0°), the more similar the vectors.
As the values of cosine similarity gets closer to 0 (angle = 90°), the less similar the vectors.

Considering this, we will create a function that iteratively will calculate the similarity from all the users present in the UIM matrix. Then, the function will return a list of the similar users and their respective cosine similarity, these similar users and similarities will be sorted by how similar they are to the studied user.

We tested the function with the first user of our dataset, as we expected the greatest similarity is with himself. But also is similar with users 597, 366, 311 & 417.

from sklearn.metrics.pairwise import cosine_similarity

# Function for finding similar users
# Receives:
# - user: The ratings made by the user to study
# - uim: The User Item Matrix with all other users to calculate similarity
def findSimilarUsers(user, uim):
    similarity = []
    for i,row in enumerate(uim.values):
        cos = cosine_similarity(user, row.reshape(1, -1))[0][0]
        similarity.append([i, cos])
    temp = pd.DataFrame(similarity, columns=['userId', 'similarity'])
    temp = temp.sort_values(by=['similarity'], ascending=False).copy()
    similar_users = list(temp['userId'].values)
    similarities = list(temp['similarity'].values)

    return (similar_users, similarities)

# Test user is created by selecting the first user of the UIM matrix
user = uim.iloc[0].values.reshape(1, -1)
temp = findSimilarUsers(user, uim)

# The top 5 similar users are:
temp[0][0:5]

[0, 597, 366, 311, 467]

# The cosine similarity obtained respectively for that users are:
temp[1][0:5]

[0.9999999999999998,
 0.4855035808289292,
 0.4784148761010285,
 0.4756257951946654,
 0.4717509600003482]

4. Recommendation System Implementation

Now we have similar users to recommend movies but how can we select which movies to recommend? Well, for that problem we decided to create weight selection.

After we know which users are similar to the studied user, we first verify what movies has the user seen and remove them from the evaluations of the similar users in order to avoid recommending movies that the studied user has already seen.

Selecting unseen movies

From the remaining movies, we'll keep only those movies that similar users liked hence had given a high rate for the movies previously selected. Therefore, before doing this we need to define a value in order to select if a value will be removed or not, for the images presented a threshold of 4 was selected but for the actual project the defined threshold was 5. This will result in a matrix of movies that the studied user hasn't watched classified into the ones that probably would like (1) and the ones that would not (0).

Filtering movies according to the rating

Classified matrix

Next, we can use the calculated cosine similarity for every user as the weight of their recommendations, the possible movie recommendations (rows of the matrix) are multiplied by the cosine similarity values previously calculated, this value is different for every user.

Cosine similarity

Cosine similarity multiplied

Finally, the final score of recommendation for every movie is obtained by the sum of the the individual scores of every user for that movie (1.6 for the image above), in other words, since every column of the matrix represents a singular movie, the sum of values for that column gives us the final score. This way, we are taking into consideration how many times a movie is well rated by a similar user, if similar users have evaluated the same movie with high ratings it will result in a highly recommendable movie. The importance (weigth) that is given for the opinion of every user is given by the similarity to the studied user.

This is a simplification of the process for understanding purposes, the actual system implementation may change in some details. Nevertheless, the main idea is shown above.

# Function that returns the most suitable recommendations of movies
# Requires:
# - user: The ratings made by the user to study
# - uim: The User Item Matrix with all other users to calculate similarity
# - recommendations: Number of expected recommendations
# - analyzed_users: Number of similar users to analyze
# - findSimilarUsers: Function that finds similar users from the dataset

def findSimilarMovies(user, uim, recommendations=10, analyzed_users=10):
    # Looking for movies that the user has already seen
    seen = list(uim.columns[list((user>0)[0])])
    
    # Looking for similar users
    similars = findSimilarUsers(user, uim)
    
    # The Dataframe of results is ceated
    scores = pd.DataFrame(columns=['movieId', 'score'])
    dtypes = np.dtype([
          ('movieId', int),          
          ('score', float),          
          ])

    # For the top similar users (analyzed users) the process gets repeated
    for sim_user, sim_score in zip(similars[0][0:analyzed_users], similars[1][0:analyzed_users]):
        # Dropping movies that the studied user has already seen
        rec_movies = uim.iloc[sim_user].drop(seen)
        # Dropping low rated movies
        rec_index = list(rec_movies[rec_movies>4].index.values)        
        
        if (len(rec_index)>0):
            # For every recommended movies of a particualar similar user:
            for movie in rec_index:
                # If the movie is not in the dataframe, it will add it to the dataframe
                if (movie not in scores['movieId'].values):
                    scores.loc[len(scores)] = (movie, 10*sim_score)
                else: 
                # If the movies is already in the dataframe , it will increase its score
                    scores.loc[scores['movieId']==movie, 'score'] += 10*sim_score
        # The values are sorted by the score obtained
        scores.sort_values(by='score', ascending=False, inplace=True)
        
        # There could the case that not enough users are similar
        try:
            scores = scores.head(recommendations)
            scores['movieId'] = scores['movieId'].astype(int)
        except:
            scores['movieId'] = scores['movieId'].astype(int)
    
    # The movie id's and their corresponding scores are returned in lists
    return (list(scores.movieId.values), list(scores.score.values))

# movie Id and Score obtained for that recommendation
ids, scores = findSimilarMovies(user, uim)    
pd.DataFrame({'movieId': ids, 'Score': scores})

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	Score
0	293	14.059126
1	4226	9.519094
2	4973	9.377341
3	112	4.855036
4	6874	4.855036
5	6711	4.855036
6	5669	4.855036
7	4848	4.855036
8	3949	4.855036
9	52973	4.522305

# Looking for what movies belong to those movie ids
movies[movies['movieId'].isin(ids)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	title	genres	year
99	112	Rumble in the Bronx (Hont faan kui) (1995)	Action\|Adventure\|Comedy\|Crime	1995
254	293	Léon: The Professional (a.k.a. The Professiona...	Action\|Crime\|Drama\|Thriller	1994
2945	3949	Requiem for a Dream (2000)	Drama	2000
3141	4226	Memento (2000)	Mystery\|Thriller	2000
3544	4848	Mulholland Drive (2001)	Crime\|Drama\|Film-Noir\|Mystery\|Thriller	2001
3622	4973	Amelie (Fabuleux destin d'Amélie Poulain, Le) ...	Comedy\|Romance	2001
4012	5669	Bowling for Columbine (2002)	Documentary	2002
4529	6711	Lost in Translation (2003)	Comedy\|Drama\|Romance	2003
4615	6874	Kill Bill: Vol. 1 (2003)	Action\|Crime\|Thriller	2003
6481	52973	Knocked Up (2007)	Comedy\|Drama\|Romance	2007

5. Final Evaluation

Finally we've got a movie recommendation system, now we can make predictions of possible recommendations but we still we don't know if the predictions are made correctly. It's hard to really evaluate whether the system is predicting the right movies or not. The best way to evaluate is to make predictions with real users and see of the movies are correctly predicted, or split the data reducing the UIM matrix, cutting some movies the users has liked. A simpler way is by random gender selection, since the movies predicted are not content based, the prediction system never considers the gender of the movie (ratings dataset), so if we create a user that only likes movies that belong to the same genre, we estimate that the predictions made belong to that genre too.

Recommendation system

For this part of the project we'll make this prediction for two genres:

Comedy movies
Horror movies For the creation of the movies, 10 random movies were selected that contain the genre "Comedy"/"Horror" in the movie genre description. Another important fact, is that as the user rates more movies, the higher the recommendation score could be, this is, as the user evaluates more movies the system creates better recommendations, for the cases presented we've only used 10 ratings. As we expected, the general view states that most of the movies belong to the same genre the user liked or related. Nevertheless, there are some different genres in the recommendations this is because some movies don't have a unique genre, there are some movies that ave multiple genres. But, in a general way, we can see that for most of the predictions, the recommendations are related and are from the same kind.

5.1 User that only likes comedy movies

comedy_movies = movies[movies['genres'].str.contains('Comedy', regex=False)].copy()
comedy_movies = comedy_movies.sample(10).copy()
comedy_movies

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	title	genres	year
99	112	Rumble in the Bronx (Hont faan kui) (1995)	Action\|Adventure\|Comedy\|Crime	1995
1727	2321	Pleasantville (1998)	Comedy\|Drama\|Fantasy	1998
1603	2145	Pretty in Pink (1986)	Comedy\|Drama\|Romance	1986
820	1080	Monty Python's Life of Brian (1979)	Comedy	1979
3568	4886	Monsters, Inc. (2001)	Adventure\|Animation\|Children\|Comedy\|Fantasy	2001
8636	119145	Kingsman: The Secret Service (2015)	Action\|Adventure\|Comedy\|Crime	2015
383	440	Dave (1993)	Comedy\|Romance	1993
6016	38061	Kiss Kiss Bang Bang (2005)	Comedy\|Crime\|Mystery\|Thriller	2005
1005	1307	When Harry Met Sally... (1989)	Comedy\|Romance	1989
5938	34162	Wedding Crashers (2005)	Comedy\|Romance	2005

user1 = []
for col in uim.columns:   
    user1.append(5) if (col in list(comedy_movies['movieId'].values))  else user1.append(0)      
user1 = np.array(user1).reshape(1,-1)

print ('User 1 has {} rated movies and {} unseen movies'.format(len(user1[user1==5]), len(user1[user1==0])))

User 1 has 10 rated movies and 850 unseen movies

ids, scores = findSimilarMovies(user1, uim)  
sc = pd.DataFrame({'movieId': ids, 'Score': scores}).copy()
recs = movies[movies['movieId'].isin(ids)].copy()
pd.merge(sc, recs).sort_values('Score', ascending=False)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	Score	title	genres	year
0	1	5.089188	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	356	3.815754	Forrest Gump (1994)	Comedy\|Drama\|Romance\|War	1994
2	6942	2.687132	Love Actually (2003)	Comedy\|Drama\|Romance	2003
3	7451	2.687132	Mean Girls (2004)	Comedy	2004
4	1215	2.652919	Army of Darkness (1993)	Action\|Adventure\|Comedy\|Fantasy\|Horror	1993
5	8641	2.652000	Anchorman: The Legend of Ron Burgundy (2004)	Comedy	2004
6	6188	2.652000	Old School (2003)	Comedy	2003
7	88163	1.525983	Crazy, Stupid, Love. (2011)	Comedy\|Drama\|Romance	2011
8	1265	1.525983	Groundhog Day (1993)	Comedy\|Fantasy\|Romance	1993
9	1258	1.503717	Shining, The (1980)	Horror	1980

5.2 User that only like horror movies

horror_movies = movies[movies['genres'].str.contains('Horror', regex=False)].copy()
horror_movies = horror_movies.sample(10).copy()
horror_movies

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	title	genres	year
2078	2762	Sixth Sense, The (1999)	Drama\|Horror\|Mystery	1999.0
6630	56174	I Am Legend (2007)	Action\|Horror\|Sci-Fi\|Thriller\|IMAX	2007.0
2641	3535	American Psycho (2000)	Crime\|Horror\|Mystery\|Thriller	2000.0
920	1219	Psycho (1960)	Crime\|Horror	1960.0
5335	8874	Shaun of the Dead (2004)	Comedy\|Horror	2004.0
1997	2657	Rocky Horror Picture Show, The (1975)	Comedy\|Horror\|Musical\|Sci-Fi	1975.0
915	1214	Alien (1979)	Horror\|Sci-Fi	1979.0
1083	1407	Scream (1996)	Comedy\|Horror\|Mystery\|Thriller	1996.0
957	1258	Shining, The (1980)	Horror	1980.0
2027	2699	Arachnophobia (1990)	Comedy\|Horror	1990.0

user2 = []
for col in uim.columns:   
    user2.append(5) if (col in list(horror_movies['movieId'].values))  else user2.append(0)      
user2 = np.array(user2).reshape(1,-1)

print ('User 2 has {} rated movies and {} unseen movies'.format(len(user2[user2==5]), len(user2[user2==0])))

User 2 has 10 rated movies and 426 unseen movies

ids, scores = findSimilarMovies(user2, uim)  
sc = pd.DataFrame({'movieId': ids, 'Score': scores}).copy()
recs = movies[movies['movieId'].isin(ids)].copy()
pd.merge(sc, recs).sort_values('Score', ascending=False)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movieId	Score	title	genres	year
0	1387	9.695435	Jaws (1975)	Action\|Horror	1975.0
1	253	9.543883	Interview with the Vampire: The Vampire Chroni...	Drama\|Horror	1994.0
2	1215	7.747767	Army of Darkness (1993)	Action\|Adventure\|Comedy\|Fantasy\|Horror	1993.0
3	1997	5.851641	Exorcist, The (1973)	Horror\|Mystery	1973.0
4	5952	3.922561	Lord of the Rings: The Two Towers, The (2002)	Adventure\|Fantasy	2002.0
5	1252	3.900174	Chinatown (1974)	Crime\|Film-Noir\|Mystery\|Thriller	1974.0
6	1222	3.891607	Full Metal Jacket (1987)	Drama\|War	1987.0
7	778	2.219149	Trainspotting (1996)	Comedy\|Crime\|Drama	1996.0
8	3471	2.028370	Close Encounters of the Third Kind (1977)	Adventure\|Drama\|Sci-Fi	1977.0
9	79132	1.863237	Inception (2010)	Action\|Crime\|Drama\|Mystery\|Sci-Fi\|Thriller\|IMAX	2010.0

luisteran5296 / similarity-based-recommendation-system Goto Github PK