Git Product home page Git Product logo

movies-etl's Introduction

Movies-ETL.

Overview

The purpose is to create an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. To do this, Movie data from 1990 to 2018 is analyzed from Wikipedia json file, extracted large data set from Kaggle, then transforming the data into a usable dataset for a "hacking competition." Once the data was transformed and narrowed in scope for the hack-a-thon, the DataFrames were loaded into PostgresSQL.

Process: ETL (Extract, Transform, Load)

The main goal is to create a refactorable and intuitive ETL Pipeline that helps automate processing large sets of data.

Extract - extract scraped Wikipedia data stored as a JSON, and Kaggle data stored in CSVs

  • Create an ETL pipeline from raw data to a SQL database.
  • Extract data from disparate sources using Python. Transform
  • Clean and transform data using Python, Pandas.
  • Use regular expressions to parse data and to transform text into numbers. Load
  • Load data with PostgreSQL.

Resources

  • Python 3.7.6, JupyterLab 2.26
  • PostgreSQL 12.2, Pgadmin 4.20
  • Data Source
    • Wikipedia: (format: json) 7,311 thousand movie titles that include information about the movies, including budgets, box office returns, from 1990 to 2018. wikipedia-movies.json is in the Resources folder
    • Kaggle: - 2 files (format: .csv): Movie Database download link. MovieLens is a website run by the GroupLens research group at the University of Minnesota. The Kaggle dataset pulls from the MovieLens dataset of over 20 million reviews and contains a metadata file with details about the movies from The Movie Database (TMDb). Files we will be using are:
      • movies_metadata.csv (in Resource folder)
      • ratings.csv (note: due to size of the raw data files, they are not included within this repo. A sample with 1000 entries is included in the Resources folder. For the entire dataset, click on the "The Movie Database" link above).

Results

Extract : Write an ETL Function to Read the Three Data Files.

  • ETL_function_test.ipynb
    • An ETL function is written to read in the three data files.

    • The function converts the Wikipedia JSON file to a Pandas DataFrame data-08-first-five-rows-of-wiki-movies-df-DataFrame_1

    • The function converts the Kaggle metadata file to a Pandas DataFrame data-08-first-five-rows-of-kaggle-metadata-DataFrame

    • The function converts the MovieLens ratings data file to a Pandas DataFrame data-08-ratings-DataFrame

Transform

  • ETL_clean_wiki_movies.ipynb: The transformation of the Wikipedia data in the ETL function
    • A list comprehension is used to keep columns with non-null values.
    • The non-null box office data is converted to string values using the lambda and join functions.
    • A regular expression is used to match the six elements of "form_one" of the box office data.
    • A regular expression is used to match the three elements of "form_two" of the box office data.
    • The following columns are cleaned in the Wikipedia DataFrame: (8 pt)
      • The box office column
      • The budget column
      • The release date column
      • The running time column

data-08-first-five-rows-of-wiki-movies-df-DataFrame_2

data-08-columns-of-the-wiki-movies-df-DataFrame

  • ETL_clean_kaggle_data.ipynb: Transform the Kaggle Data
  1. The extraction and transformation of the Kaggle metadata (movies_metadata.csv) using the ETL function does the following:
  • The Kaggle metadata is cleaned.
  • The Wikipedia and Kaggle DataFrames are merged
  • The following is performed on the merged Wikipedia and Kaggle DataFrames to create the movies_df
    • Unnecessary columns are dropped.
    • A function is used to fill in the missing Kaggle data.
    • The movies_df DataFrame is filtered to keep specific columns.
    • The movies_df DataFrame columns are renamed

data-08-first-five-rows-of-the-movies-df-DataFrame

  1. The extraction and transformation of the MovieLens ratings data (ratings.csv) using the ETL function does the following:
  • The ratings counts are cleaned.
  • The movies_df DataFrame is merged with the cleaned ratings DataFrame to create the movies_with_ratings_df DataFrame.
  • The empty values in the movies_with_ratings_df DataFrame are filled with โ€œ0โ€.

data-08-first-five-rows-of-movies-with-ratings-df-DataFrame

Load

ETL_create_database.ipynb: Create the Movie Database

  • The data from the movies_df DataFrame replaces the current data in the movies table in the SQL database, as determined by the movies_query.png.
  • The data from the MovieLens rating CSV file is added to the ratings table in the SQL database, as determined by the ratings_query.png.
  • The elapsed time to add the data to the database is displayed in the ETL_create_database.ipynb file.

movies_query

ratings_query

movies-etl's People

Contributors

ramya-ramamur avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.