For this project, I conducted my own data analysis within a Jupyter Notebook and used it to create a shareable file that documents my findings. I started by taking a look at my dataset and brainstorming what questions I could answer using it. Then I used pandas
and NumPy
to answer the questions I'm most interested in, and created a report sharing the answers.
Several dataset were provided by Udacity for this project, here is the link. I have choosen to investigate the TMDb dataset. This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. Some of the features of this dataset to keep in mind are:
- Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters
- There are some odd characters in the ‘cast’ column, which we won't worry about
- The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time
Here are the questions posed in this analysis:
- What is the ideal movie runtime for higher revenue?
- Does higher budget equate to more revenue?
- What month of the year has the highest revenue?
- Is there a correlation between a movies popularity and its revenue?