Git Product home page Git Product logo

nebyouaz / etl-processes Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 266 KB

This project involves modeling data with Microsoft SQL Database and building an ETL pipeline using Python. On the database side, fact and dimension tables are defined for a Star Schema with a specific focus. The ETL pipeline transfers data from files in two local directories into these tables in Microsoft SQL Database using Python and SQL.

Python 19.26% Jupyter Notebook 80.74%

etl-processes's Introduction

Introduction

A startup named Sparkify aims to analyze the data collected on songs and user activity within their new music streaming app.

The analytics team is specifically focused on understanding user song preferences. The project involves creating a Microsoft SQL Database Schema on Azure and an ETL (Extract, Transform, Load) pipeline to optimize queries for song play analysis

Project Description

This project involves modeling data with Microsoft SQL Database and building an ETL pipeline using Python. On the database side, fact and dimension tables are defined for a Star Schema with a specific focus. The ETL pipeline transfers data from files in two local directories into these tables in Microsoft SQL Database using Python and SQL.

Schema for Song Play Analysis

Fact Table

songplays records in log data associated with song plays

Dimension Tables

users in the app

songs in music database

artists in music database

time: timestamps of records in songplays broken down into specific units

Project Design

The database design is optimized, requiring a minimal number of tables and specific joins to extract the most information for analysis.

The ETL design is simplified, involving reading JSON files, parsing them accordingly, and storing the data into specific columns with proper formatting.

Database Script

Running the command python create_tables.py in the terminal facilitates the creation and recreation of tables.

Jupyter Notebook

etl.ipynb: A Jupyter notebook provided for verifying each command and data. The statements from this notebook are copied into etl.py, which can be run in the terminal using the command python etl.py. Afterward, running test.ipynb verifies whether data has been successfully loaded into all tables.

Relevant Files Provided

test.ipnb Displays the first few rows of each table, allowing you to check your database

create_tables.py Drops and creates your tables

etl.ipynb Reads and processes a single file from song_data and log_data, loading it into your tables in a Jupyter notebook

sql_queries.py Contains all the T-SQL queries and is imported into the last three files above

etl-processes's People

Contributors

nebyouaz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.