Git Product home page Git Product logo

k-nearest-neighbors-project-tutorial's Introduction

K-Nearest neighbors - Step by step guide

  • Understanding a new dataset.
  • Model the data using a KNN.
  • Analyze the results and optimize the model.

๐ŸŒฑ How to start this project

Follow the instructions below:

  1. Create a new repository based on machine learning project by clicking here.
  2. Open the newly created repository in Codespace using the Codespace button extension.
  3. Once the Codespace VSCode has finished opening, start your project by following the instructions below.

๐Ÿš› How to deliver this project

Once you have finished solving the exercises, be sure to commit your changes, push them to your repository, and go to 4Geeks.com to upload the repository link.

๐Ÿ“ Instructions

Movie recommendation system

Would we be able to predict which movies might or might not be a commercial success? This dataset collects part of the knowledge from the API TMDB, which contains only 5000 movies out of the total number. The following resources are available:

  • tmdb_5000_movies:
https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv
  • tmdb_5000_credits:
https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv

Step 1: Loading the dataset

We must load the two files and store them in two separate data structures (Pandas DataFrames). On one side, we will have stored the information about the movies and their credits.

Step 2: Creation of a database

Create a database to store the two DataFrames in separate tables. Then join the two tables with SQL (and integrate it with Python) to generate a third table containing information from both tables unified. The key through which the join can be done is the title of the movie (title).

Now, clean the generated table and leave only the following columns:

  • movie_id
  • title
  • overview
  • genres
  • keywords
  • cast
  • crew

Step 3: Transform the data

As you can see, there are some JSON formatted columns. From each of the JSONs, select the name attribute and replace the genres and keywords columns. For the cast column, select the first three names.

The only columns left to modify are crew (team) and overview (summary). For the first column, convert it to contain the name of the director. For the second, convert it to a list.

Once we have finished processing the columns and the recommendation model is not confused, for example, between Jennifer Aniston and Jennifer Conelly, we will remove the spaces between the words. Apply this function to the columns genres, cast, crew and keywords.

Finally, we will reduce our dataset by combining all of our previous converted columns into a single column called tags (which we will create). This column will now have all the elements separated by commas and then we will replace them with blanks. It should look something like this:

new_df["tags"][0]

>>>>"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron"

Step 4: Build a KNN

To solve this problem we will create our own KNN. The first thing to do is to vectorize the text, following the same steps you learned in the previous lesson.

Once you have done that, we will have to choose a distance to compare texts. In this module we have seen a few, and the only one compatible with what we want to do is the cosine distance:

from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(vectors)

With this code, we can see the similarity between our vectors (vector representations of the tags column).

Finally, we can design our similarity function based on the cosine distance. Our proposal is as follows:

def recommend(movie):
    movie_index = new_df[new_df["title"] == movie].index[0]
    distances = similarity[movie_index]
    movie_list = sorted(list(enumerate(distances)), reverse = True , key = lambda x: x[1])[1:6]
    
    for i in movie_list:
        print(new_df.iloc[i[0]].title)

In such a way that we would return the 5 movies most similar to the one we enter in the title. We could use it as follows:

recommend("Enter a film name")

Note: We also incorporated the solution samples on ./solution.ipynb that we strongly suggest you only use if you are stuck for more than 30 min or if you have already finished and want to compare it with your approach.

k-nearest-neighbors-project-tutorial's People

Contributors

danielaaz04 avatar alesanchezr avatar josemoracard avatar cvazquezlos avatar lorenagubaira avatar tommygonzaleza avatar arnaldoperez avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.