Git Product home page Git Product logo

movie-classifier's Introduction

movie-classifier

Introduction

Multi-label classification problem involving prediction of a movie's genres given a title and a text description.

Dataset

MovieLens dataset.

Approach

The aim of this project is to build a ML system capable of assigning a set of label tags based on a name and a natural language text. Since a movie can have multiple genres ([Action, Comedy, Drama]), we cannot threat the problem as a multi-class classification problem. Instead, we can simplify it by generalising the problem to multiple binary classifications: commonly known as the one-vs-rest approach.

The scikit-learn library has been chosen for this project because it enables quick prototyping by offering a wide range of ML algorithms and a easy to use API. In particular, the implemented model (OvRModel) is a wrapper of the sklearn.multiclass.OneVsRestClassifier strategy. The advantage of this method is that it can use any type of classifier that inherits the sklearn's BaseEstimator class.

Furthermore, special attention has been given to the extendibility of the project: the use of an abstract class for the Model makes the integration of future ML algorithms a very straightforward task.

In order to transform the raw text data into useful features, we must apply a series of processing techniques which include:

  • Convert to lowercase.
  • Remove non-ASCII characters.
  • Remove special characters like punctuation and extra spaces.
  • Remove common stopwords.
  • Convert numbers to text.
  • Lemmatisation.

Then the TF-IDF algorithm is used to vectorize the transformed text, obtaining a feature vector that can be used to train the model.

The base estimator used in this project is the logistic regression, which achieves the following results (micro average) using a threshold value of 0.2:

Precision Recall F1-score
0.512 0.701 0.592

Installation

The requirements.txt file lists all the required libraries, to install them:

pip install -r requirements.txt

Download the nltk corpus:

python install_nltk_corpus.py

Run the setup.py to install the packages:

pip install .

Docker

Alternatively, you can run a docker container. First, build the image from the Dockerfile:

docker image build -t pylearn .

Then run the container using:

docker run -it -v $pwd'':/home/jovyan/work --name movie-classifier-app pylearn

Note: add -p 8888:8888 if you want to run jupyter notebooks.

Usage

The following commands assumes that the current directory is /src.

Inference

Run the script movie_classifier.py to make predictions:

python movie_classifier.py --title "The Shawshank Redemption" --description "In 1947 Portland, Maine, banker Andy Dufresne is convicted of murdering his wife and her lover and is sentenced to two consecutive life sentences at the Shawshank State Penitentiary. He is befriended by Ellis Red Redding, an inmate and prison contraband smuggler serving a life sentence."

Data preprocessing

To clean the raw dataset (as csv), use the prepare_data.py script:

python prepare_data.py

Note: use -f "PATH" to specify the raw dataset, -s "PATH" to indicate where to save.

Training

Although a trained model is included in this repository (models/model.hal), you can train a new one by running the training.py script:

python training.py

Note: use -f "PATH" to specify the cleaned dataset, -s "PATH" to indicate where to save the model, and --testsize # to set the proportion of the test set.

Testing (unittest)

You can run the unittests from the project root direcotory using:

python -m unittest tests/test_*

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.