Movies-ETL

In this module we are using ELT method to clean the data files, parse the data that we extracted to make it looks how we want. Merge parsed data sets and load the load the merged sets into pgAdmin for further use.

Step1

Using the ELT_function_test file by step:

Created a function to read in three separate files
Read in the kaggle metadata and MovieLens ratings CSV files as Pandas DataFrames.
Opened the Wikipedia JSON file and use the json() function to convert the JSON data to raw data.
Read in the raw wiki movie data as a Pandas DataFrame.
Use the code return the three DataFrames

return wiki_movies_df, kaggle_metadata, ratings

Create a variables provide to a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.
Set the three variables in Step 6 equal to the function created in Step 1.
Set the DataFrames from the return statement equal to the file names in Step 6. In this step, you are reassigning the variables created in Step 6 to the variables in the return statement. To make sure your dataframes look correct we use the .head().

kaggle_metadata.head()

ratings

Step 2

Extract and Transform the Wikipedia Data with the ETL_clean_wiki_movies.ipynb:

Create the code for the clean movie function that takes in the argument "movie".
Add the function you created in step 1 that reads in the three data files.
Inside the function you created in step 1's python file, remove the code that creates the wiki_movies_df DataFrame from the wiki_movies_raw file, then write a list comprehension that filters out TV shows from the wiki_movies_raw file.
Write a list comprehension to iterate through the cleaned wiki movies list that you created in Step 3.
Read in the cleaned movies list from Step 4 as a DataFrame.
Write a try-except block that will catch errors while extracting the IMDb IDs with a regular expression string and dropping any imdb_id duplicates. If there is an error, capture and print the exception.
Write a list comprehension to keep the columns that have non-null values from the DataFrame created in Step 5, then create a wiki_movies_df DataFrame from the list.
Create a variable that will hold all the non-null values from the "Box office" column
Convert the box office data created in Step 8 to string values using the lambda and join functions.
Write a regular expression to match the six elements of form_one of the box office data.
Write a regular expression to match the three elements of form_two of the box office data.
Add the parse_dollars() function.
Add the code that cleans the box office column in the wiki_movies_df DataFrame using the form_one and form_two lists created in Steps 10 and 11, respectively.
Add code that cleans the budget column in the wiki_movies_df DataFrame.
Add code that cleans the release date column in the wiki_movies_df DataFrame.
Add code that cleans the running time column in the wiki_movies_df DataFrame.
Use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.
Set the three variables in Step 17 equal to the function created in Deliverable 1.
Set the wiki_movies_df equal to the wiki_file variable.
Add the columns from wiki_movies_df DataFrame to a list, and confirm that they are the same as this image:

wiki_movies_df.columns.to_list()

Step 3

Extract and Transform the Kaggle Data using the ETL_clean_kaggle_data.ipynb

Add the function was created in our very first python file that reads in the three data files and creates the kaggle_metadata and ratings DataFrames. Also, make sure you add in all of the code from the ETL_clean_wiki_movies.ipynb file
Below the code that cleans the running time column in the wiki_movies_df DataFrame from ETL_clean_wiki_movies.ipynb, add the code that cleans the Kaggle metadata.
Merge the wiki_movies_df DataFrame and the kaggle_metadata DataFrames, then name the new DataFrame, movies_df.
Drop unnecessary columns from the movies_df DataFrame.
Add the fill_missing_kaggle_data() function that fills in the missing Kaggle data on the movies_df DataFrame.
Call the fill_missing_kaggle_data() function with the movies_df DataFrame and the Kaggle and Wikipedia columns to be cleaned as the arguments.
Filter the movies_df DataFrame to keep the necessary columns.
Rename the columns in the movies_df DataFrame.
Transform and merge the ratings DataFrame with the movies_df DataFrame, name the new DataFrame movies_with_ratings_df, then clean the movies_with_ratings_df DataFrame.
Use the variables provided to create a path to the Wikipedia data, the Kaggle metadata, and the MovieLens rating data files.
Set the three variables from Step 17 of ETL_clean_wiki_movies.ipynb equal to the function created in the first python file we created.
Set the DataFrames from the return statement after Step 9 equal to the file names in Step 11.
Check that your wiki_movies_df DataFrame is the same as in ETL_clean_wiki_movies.ipynb.
Confirm that your DataFrames look correct using the .head() method.

movies_with_ratings_df.head()

movies_df.head()

Step 4

By using our knowledge of Python, Pandas, the ETL process, code refactoring, and PostgreSQL to add the movies_df DataFrame and MovieLens rating CSV data to a SQL database.

Create database engine to communicate with SQL server.

"postgres://[user]:[password]@[location]:[port]/[database]"

Create the database engine.

engine = create_engine(db_string)

Save the movies_df DataFrame to a SQL table.

movies_df.to_sql(name='movies', con=engine, if_exists = 'replace')

The ratings data is too large to import in one statement, so it has to be divided into "chunks" of data.
- create a variable for the number of rows imported
- print out the range of rows that are being imported
- increment the number of rows imported by the size of 'data'
- print that the rows have finished importing
- also as an optional step we added the bild in time module to print the total amount of time elapsed at every step.

start_time = time.time()

time.time() - start_time

Once the cell finishes running, we check that the tables imported correctly using the pgAdmin. Confirm that the movies table has 6,052 rows and the ratings table has 26,024,289 rows. We just extracted really messy and almost unusable data, combed through it carefully to transform it, and then loaded it into a SQL database.

dangcoop / movie_etl Goto Github PK

movie_etl's Introduction

Movies-ETL

In this module we are using ELT method to clean the data files, parse the data that we extracted to make it looks how we want. Merge parsed data sets and load the load the merged sets into pgAdmin for further use.

Step1

Step 2

Step 3

Step 4

movie_etl's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent