Git Product home page Git Product logo

imdb-etl-project's Introduction

IMDB-ETL-Project

Implemented an end-to-end data pipeline utilizing Talend Studio, with Alteryx for data profiling, MySQL and Microsoft Azure Database for staging, SCD for data integrity, and Power BI/Tableau for analytics.

I. Perform Data Profiling with Alteryx

image image

  • The Alteryx workflows involve data profiling on TSV files containing information from the IMDB database, specifically focusing on six tables. In the first table, "IMDB NAME BASICS," individual identifiers, names, birth and death years, professions, and known titles are stored, each with specific data constraints.

  • The second table, "IMDB TITLE AKAS," contains details about movie titles, including language, region, and release information.

  • The third table, "IMDB TITLE BASICS," provides information about movies, such as title type, adult content, release years, and genres. These tables are interconnected through unique identifiers, ensuring data integrity and relational structure.

  • The fourth table, "IMDB TITLE CREW," lists the directors and writers associated with each movie, with creation timestamps.

  • The fifth table, "IMDB TITLE PRINCIPALS," includes details about individuals involved in a movie, such as their roles, order of appearance, and associated characters.

  • Lastly, the sixth table, "IMDB TITLE RATINGS," involves ratings assigned to movies, with director and writer information. The data in these tables is crucial for understanding the relationships between movies, individuals, and their respective roles in the filmmaking process.

  • Additionally, the workflows involve analyzing revenue data, which is spread across nine separate CSV files. Each file has 11 columns with specific data constraints. The revenue data includes movie identifiers, titles, dates, ranks, gross earnings, percentages, theater information, total gross, and duration. Data validation checks reveal issues such as trailing spaces and null values, providing insights into data quality and potential cleaning requirements.

  • The comprehensive overview of these Alteryx workflows and associated TSV files underscores the complexity and richness of the IMDB dataset, offering valuable insights for data analysis and exploration.

II. Load Data into Staging Tables with Talend

Staging Jobs – S.no Job name 1 Actors_combined 2 Combined_titles 3. Converting_runtime_int 4. Json_staging_names 5. Json_staging_titles 6. Sql_scripts_staging 7. Tsv_file_staging

DIM_Loading Jobs – S.no Job Name

  1. Date_dim_loading
  2. Directors_dim
  3. Genres_dim
  4. Location_dim
  5. Movie_dim_Scd
  6. Person_dim_scd
  7. Principals_dim 8, Writers_dim

Fact_Loading Jobs – S.no Job Name

  1. Fact_movie_accomplishments
  2. Fact_movie_revenue

Bridge_Loading Jobs – S.no Job Name

  1. Genre_movie_bridge
  2. Region_bridge_loading

III. Dimensional Modelling

image

IV. Mapping of source and target

image

Categorised into 4 kinds

  1. Movie - image

  2. Revenue/accomplishments - image

  3. Workflow - image

  4. Person - image

imdb-etl-project's People

Contributors

sashank1079 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.