Git Product home page Git Product logo

ramos-iyer / prediction-of-box-office-profit-of-movies-using-hierarchical-clustering-random-forest-ensemble Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 2.0 28.93 MB

This research paper aims to predict the box office profit of a movie using a clustering and random forest based ensemble technique. The motivation behind this comes from the need of better predictive techniques in the movie industry owing to its unpredictable nature. The literature survey found meta data and social data to be important factors in predicting a movies box-office success. The use of clustering and random forest ensemble provided an accuracy of 78 percent on the training data and 62 percent on unseen or test data with a correlation of 78 percent between actual and predicted values. The variable importance chart showcased the budget, budget year ratio, release month, run time, genre, collection and review sentiment score to be influential in the profitability of a movie. The business use cases for this model and data considered is mainly for on-demand video service providers and television service companies.

R 80.33% Python 19.67%

prediction-of-box-office-profit-of-movies-using-hierarchical-clustering-random-forest-ensemble's Introduction

Prediction-of-box-office-profit-of-movies-using-hierarchical-clustering-random-forest-ensemble

Masters in Data Analytics Project

Project: Prediction of box-office profit of movies using hierarchical clustering random forest ensemble

Table of Contents


Overview

This project aims to predict the box office profit of a movie using a clustering and random forest based ensemble technique. The motivation behind this comes from the need of better predictive techniques in the movie industry owing to its unpredictable nature. The literature survey found meta data and social data to be important factors in predicting a movies box-office success. The use of clustering and random forest ensemble provided an accuracy of 78 percent on the training data and 62 percent on unseen or test data with a correlation of 78 percent between actual and predicted values. The variable importance chart showcased the budget, budget year ratio, release month, run time, genre, collection and review sentiment score to be influential in the profitability of a movie. The business use cases for this model and data considered is mainly for on-demand video service providers and television service companies.

Components

There are three components to this project:

Data Extraction and Merging

File 'Visualization.py' :

  • Extracts the movie metadata and movie reviews dataset.
  • Performs sentiment analysis on movie reviews.
  • Merges the two datasets to form the final dataset.
  • Exports the merged data into a .csv file.

Screenshot1

Data Visualizations

File 'Visualizations.pbix' :

  • Imports the merged dataset.
  • Create visulaizations using Microsoft PowerBI.
  • Exports the visualizations into images.

Data Transformations and application of predictive model

File 'Prediction Model.r' :

  • Imports the merged dataset.
  • Applies Transformations on the data to gain information.
  • Applies the predictive model on the data.
  • Generates evaluation metrics

Running the Code

The Data Extraction and Merging code in 'Visualization.py' has been created in Python and needs Python along with any IDE that supports Python codes in order to run the same.

The 'Visualizations.pbix' file contains visualizations created using PowerBI from th emerged dataset. This requires installation of PowerBI in order to open and view the visuals created.

The code in 'Prediction Model.r' needs to be opened on R Studio and can be run as a whole or run line by line. The code contains comments which provides details on what each chunk of code performs on the data.

The below sequence needs to be followed in order to run the whole code:

  • Visualization.py
  • Visualizations.pbix (Optional, as it contains only visualizations)
  • Prediction Model.r

Screenshots

Screenshot2 Screenshot3 Screenshot4 Screenshot5 Screenshot6 Screenshot7 Screenshot8 Screenshot9 Screenshot10 Screenshot11 Screenshot12 Screenshot13 Screenshot14 Screenshot15

System Configuration Steps

In order to run the code, below are the necessary requirements:

  • Python and Python IDE: As the code for data extraction and merging is written in Python, Python along with any IDE is required for the execution of the same. Below are the packages that are required as part of the pre-requisites for the same:

requests, json, pandas, numpy

  • PowerBI: The visualizations have been created in PowerBI and hence requires PowerBI.
  • R and R Studio: As the code is developed in R, you need to install R as well as R Studio in order to open and execute the files. Below is a list of packages that need to be installed before execution of the code.

tidyverse, plotly, ggthemes, viridis, corrplot, gridExtra, VIM, lubridate, randomForest, caret, psych, RWeka, car, MLmetrics, cluster, clValid, StatMatch

File Descriptions

Below are the files and the folders that are part of the project implementation:

  1. Code:
  • Visualization.py: Extracts the movie metadata and review data from the API and merges them to form the final dataset.
  • Prediction Model.r: Imports the merged dataset and applies the predictive model on the same.
  1. Visualization Dataset:
  • TMDB Movie Reviews.xlsx: Contains the Movie reviews data
  • TMDB Movies.xlsx: Contains the Movie metadata
  • Visualizations.pbix: Contains the visualizations created in PowerBI
  1. DAPA.json: Conatains the Movie metadata in .json format

  2. DAPA_Reviews.json: Contains the Movie reviews data in .json format

  3. TMDB Movies Full.csv: Contains the merged dataset used for model application

Credits and Acknowledgements

  • TMDB for providing the API to extract the data used for this project.
  • NCI for a challenging project as part of their full-time masters in data analytics course subject 'Domain Applications of Predictive Analytics'

prediction-of-box-office-profit-of-movies-using-hierarchical-clustering-random-forest-ensemble's People

Contributors

ramos-iyer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.