Git Product home page Git Product logo

ada-2021-project-pyca's Introduction

Box-office and Quotebank

Data story

Our Data story is hosted on https://axgoujon.github.io/adatastory/.

It shows and explain our thought process and our findings for this project.

Note

As per https://ada2021epfl.zulipchat.com/#narrow/stream/302232-Project/topic/Deliverable.20notebook, we chose to use several notebooks. Their use is described below.

Abstract

Recent research work has investigated the impact of attributes (e.g., budget, release time) in predicting the box-office revenues1. Particularly, lead actors have been considered as one of the critical drivers for success in the motion pictures industry2.

However, an important attribute in forecasting the box-office receipts has remained mostly unnoticed: what and how the main film crew claims in the media coverage. Not only does the fame of a lead actor/director can influence the box office revenue, but also the public claims from the main film crew might shape the audience attitude and intentions to watch movie.

One salient research question in our study is to examine whether the main film crew's quotes in the press trigger an increase in the box office revenues over time. We will perform different machine learning methods with cross validation to show what kind of characteristics in the main film crew's quotes might forecast the box-office results.

Introduction

The motion pictures industry has become a roaring success that reached a all time high 42 billion U.S. dollars in the global box office in 20193. In U.S. and Canada, the box office receipt is over 10 billion U.S. dollars in the year of 2015-20194.

global box office revenue per year by format Global box office revenue per year by format

In recent years, numerous research work has uncovered which attributes might predict the financial success of motion pictures after they were released, and why some movies could be "hits" or "flops" using automation methods567.

In our study, we explore the important role of the claims from the main film crew on predicting the financial success of box-office receipts in 2015-2019 from the longitudinal time sequence (i.e., before and after the release date). More specifically, we will address the following key questions:

  • Does the quantity of quotes from the main film crew provide a boost to box-office revenue ?
  • How much does the fame of the speaker influence the box-office revenue ?
  • Which textual factors (e.g., sentiment polarity, the number of topics) from quotes might influence the financial performance of movies ?
  • Which model (e.g., logistic regression, SVM) has better predictive performance in forecasting the box-office receipts ?

Datasets Used

In our study, we combine the quotation-centric version of the Quotebank database8 with the IMDb movie information database 9 and box-office receipts from Box Office Mojo10.

Firstly, IMDb movie database, contains six types of movie data: (i) movie name title; (ii) the genre of movie, such as comedy, fantasy category; (iii) the release year (we select the movies in the year 2015-2020); (iv) region (focus on U.S. and U.K.); (v) runtime (in minutes); (vi) main people involved in the movie.

Secondly, Box Office Mojo, displayed the financial performance of movies in: (i) the total gross revenue (in U.S. dollars); (ii) ranking of box-office receipts; (iii) total gross revenue (in U.S. dollars); (iv) release date and year; (v) opening week gross; (vi) opening week gross percentage.

After that, we merge both datasets together by movie name and release year.

In the end, we extract key people in the movie (e.g., actors and actresses, director, and producer) in order to retrieve their quotes from the Quotebank database8. By doing so, we assume that the spread of a quote from a movie associated key figure might influence the financial performance of the movies.

We also extract, for the top 50 movies in term of box office revenue, all the quotes that mention these movies or a term related to these movies, in the 25 weeks around the release date.

See movieDataSetBuilder.ipynb, movieBoxOffice.ipynb and linkQuotesToMovies.ipynb to extract and merge these datasets.

Methods

Volume of quotes and box office revenue : In our preliminary analysis, we plot change of quantity in the main film crew’s quotes around the release date. There is a peak in the main crew’s quotes in the media coverage within one week after the movie has been released. Thus, we can assume that the main crew have been engaged in frequent media exposure for movie promotion around movie release dates. Further the Spearman correlation graph shows, box office revenue and main crew’s quotes seem to follow some sort of power law (it is positive significant).

Below are some early artifacts produced by our analysis, showing interesting correlations respectively between the number of quotes and the time to the release date, and between the number of quotes and the box office success.

Press Activity

Press Activity

See analysisQuote.ipynb for the preliminary analysis.

Percentage of box office revenue after first week release We compute box office revenue in three categories below:

  • (i) high % first WE (in this group, the percentage of opening revenue over total revenue is less than the third quantile);
  • (ii) intermediate score (the percentage of opening revenue over total revenue is between the third and the two third quantile);
  • (iii) high % after first WE (the percentage of opening revenue over total revenue is greater than the two third quantile).

Sentiment analysis and semantic analysis : We plan to pre-process text using NLP libraries (namely spacy, nltk, genism and sklearn). First, we will detect the sentiment polarity score from quotes with dictionary-based package Afinn. Then, we compute values for five lexicon terms ("warmth", "fun", "love", "emotional", "disappointment", "hate") using package Empath for movie related quotes.

Project Timeline

Week 1 (8 Nov-14 Nov): Project proposal, web scraping all available datasets, initial descriptive analysis.

Week 2 (22 Nov-28 Nov): Data cleaning, feature selection for all variables, compute textual characteristics of the quotes (e.g., sentiment polarity, semantic analysis), compute percentage of opening revenue relative to total revenue for box-office.

Week 3 (29 Nov-5 Dec): Visualize data and test statistically significant

Week 4 (6 Dec-12 Dec): Wrap up results, and write down data stories.

Week 5 (13 Dec-17 Dec): Double check code and prepare the final storytelling about our data results.

Organization within the team

  • Alex: Web scraping datasets, initial data analysis, pre-process datasets, data visialization
  • Christos: running tests, prepare data story
  • Pierre: Develop algorithm, feature engineering, code quality
  • Yiming: Analyze quote text using NLP methods, write project data story

Organization of the repository

Project hierarchy

 .
 ├── moviePreprocessing/
 │   ├── movieBoxOffice.ipynb
 │   └── movieDataSetBuilder.ipynb
 │
 ├── mergeDataSets/
 │   └── linkQuotesToMovies.ipynb
 │
 ├── analysis/
 │   ├── analysisQuote.ipynb
 │   ├── sentimentAnalysis.ipynb
 │   └── Text_Analysis.ipynb
 │
 └── data/
     │   # Generated by `movieDataSetBuilder.ipynb`
     ├── movie_data_2015_2020.csv
     │
     │   # Generated by `movieBoxOffice.ipynb`
     ├── boxoffice.csv
     │
     │   # Downloaded from `QuoteBank`
     ├── quotes-2015.json.bz2
     ├── quotes-2016.json.bz2
     ├── quotes-2017.json.bz2
     ├── quotes-2018.json.bz2
     ├── quotes-2019.json.bz2
     ├── quotes-2020.json.bz2
     │
     │   # Downloaded from `IMDb`
     ├── name.basics.tsv.gz
     ├── title.akas.tsv.gz
     ├── title.basics.tsv.gz
     ├── title.crew.tsv.gz
     ├── title.episode.tsv.gz
     ├── title.principals.tsv.gz
     ├── title.ratings.tsv.gz
     │
     │   # Generated by `linkQuotesToMovies.ipynb`
     ├── movie_2015_crew_quotes.csv.gz
     ├── movie_2016_crew_quotes.csv.gz
     ├── movie_2017_crew_quotes.csv.gz
     ├── movie_2019_crew_quotes.csv.gz
     ├── movie_2018_crew_quotes.csv.gz
     ├── movie_2020_crew_quotes.csv.gz
     │
     │   # Generated by `sentimentAnalysis.ipynb`
     │   # and `Text_Analysis.ipynb`
     ├── 50movies_sentiment_emotion.csv.gz
     ├── 50movies_sentiment_polarity.csv.gz
     ├── 50moviesquotes.csv.gz
     ├── text_clean.csv.gz
     └── text_final.csv.gz

All datasets, in particular intermediate ones to avoid having to run everything, can be found at https://drive.google.com/drive/folders/1Kwv7boEYxS1DRev6KCIJLhoYiBRBMHUV?usp=sharing.

Pre-existing datasets

The quotes-{year}.json.bz2 datasets come from the Quotebank database8.

The title.{name}.tsv.gz and name.basics.tsv.gz datasets come from the IMDb movie information database 9.

Creation of our own datasets

The movieDataSetBuilder.ipynb notebook merges the IMDb movie information database 9 and box-office receipts from Box Office Mojo10, to generate the movie_data_2015_2020.csv dataset containing all wanted information about a movie, including its box office results.

The linkQuotesToMovies.ipynb notebook filters the Quotebank database8 to only keep the quotes whose speaker appears in the previously generated movie_data_2015_2020.csv, and generates movie_{year}_crew_quotes.csv.gz datasets.

Note: All movie_{year}_crew_quotes.csv.gz datasets (with quotes whose speaker is in the main crew of a 2015-2020 movie) can be found on this google drive to avoid running the notebooks (which takes one hour).

The other datasets are intermediate results that have been exported to avoid having to run everything from scratch.

Analyses

The analysisQuote.ipynb notebook contains the pre-analysis, it revealed a few interesting correlations and produces some nice artifacts.

The Text_Analysis.ipynb and sentimentAnalysis.ipynb notebooks contain sentiment analysis related computations, they also produce plots that show relations between sentiments extracted from quotes and box office results.

Footnotes

  1. https://arxiv.org/abs/1506.05382

  2. https://journals.sagepub.com/doi/10.1509/jmkg.71.4.102

  3. https://www.billboard.com/articles/news/8547827/2019-global-box-office-revenue-hit-record-425b-despite-4-percent-dip-in-us

  4. https://www.statista.com/statistics/259987/global-box-office-revenue

  5. Buzz et recommandations sur Internet: quels effets sur le box-office?

  6. Blogs, Advertising, and Local-Market Movie Box Office Performance

  7. Predicting box-office success of motion pictures with neural networks

  8. QuoteBank datasets 2 3 4

  9. IMDb datasets 2 3

  10. Box Office Mojo datasets 2

ada-2021-project-pyca's People

Contributors

ax-goujon avatar pgimalac avatar photonampere avatar yiming-li3008 avatar yo-art7 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.