The etl-processes from nebyouaz

Introduction

A startup named Sparkify aims to analyze the data collected on songs and user activity within their new music streaming app.

The analytics team is specifically focused on understanding user song preferences. The project involves creating a Microsoft SQL Database Schema on Azure and an ETL (Extract, Transform, Load) pipeline to optimize queries for song play analysis

Project Description

This project involves modeling data with Microsoft SQL Database and building an ETL pipeline using Python. On the database side, fact and dimension tables are defined for a Star Schema with a specific focus. The ETL pipeline transfers data from files in two local directories into these tables in Microsoft SQL Database using Python and SQL.

Schema for Song Play Analysis

Fact Table

songplays records in log data associated with song plays

Dimension Tables

users in the app

songs in music database

artists in music database

time: timestamps of records in songplays broken down into specific units

Project Design

The database design is optimized, requiring a minimal number of tables and specific joins to extract the most information for analysis.

The ETL design is simplified, involving reading JSON files, parsing them accordingly, and storing the data into specific columns with proper formatting.

Database Script

Running the command python create_tables.py in the terminal facilitates the creation and recreation of tables.

Jupyter Notebook

etl.ipynb: A Jupyter notebook provided for verifying each command and data. The statements from this notebook are copied into etl.py, which can be run in the terminal using the command python etl.py. Afterward, running test.ipynb verifies whether data has been successfully loaded into all tables.

Relevant Files Provided

test.ipnb Displays the first few rows of each table, allowing you to check your database

create_tables.py Drops and creates your tables

etl.ipynb Reads and processes a single file from song_data and log_data, loading it into your tables in a Jupyter notebook

sql_queries.py Contains all the T-SQL queries and is imported into the last three files above

nebyouaz / etl-processes Goto Github PK

etl-processes's Introduction

etl-processes's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent