Introduction
A startup named Sparkify aims to analyze the data collected on songs and user activity within their new music streaming app.
The analytics team is specifically focused on understanding user song preferences. The project involves creating a Microsoft SQL Database Schema on Azure and an ETL (Extract, Transform, Load) pipeline to optimize queries for song play analysis
Project Description
This project involves modeling data with Microsoft SQL Database and building an ETL pipeline using Python. On the database side, fact and dimension tables are defined for a Star Schema with a specific focus. The ETL pipeline transfers data from files in two local directories into these tables in Microsoft SQL Database using Python and SQL.
Schema for Song Play Analysis
Fact Table
songplays records in log data associated with song plays
Dimension Tables
users in the app
songs in music database
artists in music database
time: timestamps of records in songplays broken down into specific units
Project Design
The database design is optimized, requiring a minimal number of tables and specific joins to extract the most information for analysis.
The ETL design is simplified, involving reading JSON files, parsing them accordingly, and storing the data into specific columns with proper formatting.
Database Script
Running the command python create_tables.py
in the terminal facilitates the creation and recreation of tables.
Jupyter Notebook
etl.ipynb: A Jupyter notebook provided for verifying each command and data. The statements from this notebook are copied into etl.py, which can be run in the terminal using the command python etl.py. Afterward, running test.ipynb verifies whether data has been successfully loaded into all tables.
Relevant Files Provided
test.ipnb Displays the first few rows of each table, allowing you to check your database
create_tables.py Drops and creates your tables
etl.ipynb Reads and processes a single file from song_data and log_data, loading it into your tables in a Jupyter notebook
sql_queries.py Contains all the T-SQL queries and is imported into the last three files above