This is an ongoing project meant to use data for a large number of films (budget, domestic box-office gross, worldwide box-office gross, etc.) to build a linear model reflecting patterns in the performance of different films. Underneath is a description of the original project guidelines from Metis. The primary twist was not to look only at the gross of a film, but the net profit earned for a studio once accounting for the budget of a film.
Current Status: I am currently brainstorming an approach that would require more manual data-entry since there is currently no reliable data to acquire: categorizing films as completely original, sequels, reboots, based on existing source material, etc., and then creating linear models based on this categorization, ideally trying to find what best determines the success of a completely original film.
Using information we scrape from the web, can we build linear regression models from which we can learn about the movie industry?
- acquisition: web scraping
- storage: flat files
- sources: Box Office Mojo, The Numbers, any other publicly available information
- basics of the web (requests, HTML, CSS, JavaScript)
- web scraping
numpy
andpandas
statsmodels
,scikit-learn
- linear regression
- organized project repository
- slide presentation
- visual and oral communication in presentations
- write-up of process and results
- iterative design process
- scoping
- "MVP"s and building outward
We'll learn about web scraping using two popular tools - BeautifulSoup and Selenium. You'll have to know the very basics of HTML. We'll also be evolving the way we use IPython notebooks—during this project we'll begin to use the notebook as a development scratchpad, where we test things out through interactive scripting, but then solidify our work in python modules with reusable functions and classes.
We'll practice using linear regression. We'll have a first taste of feature selection, this time based on our intuition and some trial and error, and we'll build and refine our models.
We'll work in groups for brainstorming and design, and code sharing will be highly encouraged, but the final projects will be individual.
This project will really give you the freedom to challenge yourself, no matter your skill level. Find your boundaries, meet them, and push them a little further.
We are very excited to see what you will learn and do for Project Luther!