Git Product home page Git Product logo

udend-data-warehouse-project's Introduction

Sparkify Data Warehouse Project for song play analysis

The purpose of this project is to build an ETL pipeline that will be able to extract song data from an S3 bucket and transform that data to make it suitable for analysis. This data can be used with business intelligence and visualization apps that will help the analytics team to better understand what songs are commonly listened to on the app.

Database Schema Design

The project creates a redshift database in the cluster with staging tables that contain all the data retrieved from the s3 bucket and copied over to the tables. They are columnar in nature which helps with parallelizing the execution of one query on multiple CPUs which makes it's peformance much faster for large sets of data as compared to tables in relational databases. Each column on the tables corresponds to the keys in the json files and for the case of the staging_events table, since the column names were different from those in the file, I utilized a json path map file that maps the data elements to the relevant columns.

The project also creates a relational database with a fact and dimension tables, therefore making it a star schema. The reason as to why I utilized star schema as opposed to 3rd normal form is because the star schema is more suitable and optimized for OLAP operations which will be the purpose of the database. This will store data from the staging tables that has been transformed to provide the relevant data in the tables.

ETL Pipeline

The data gets that gets extracted will need to be transformed to to fit the data model in the target destination tables. For instance the source data for timestamp is in unix forrmat and that will need to be converted to timestamp from which the year, month, day, hour values etc can be extracted which will fit in the target database table schema.

Since the datasets, once copied over to the staging tables, will contain quite a number of columns, the query that inserts the data in the respective tables has a nested query that selects only the relevant columns for the table. In the case of the fact table, there is an added constraint for the data from each staging column that ensures that the values are not null.

Datasets used

The datasets used are retrieved from the s3 bucket and are in the JSON format. There are two datasets namely log_data and song_data. The song_data dataset is a subset of the the Million Song Dataset while the log_data contains generated log files based on the songs in song_data.

Getting Started

In order to have a copy of the project up and running locally, you will need to take note of the following:

Prerequisites

  • Python 2.7 or greater.

  • AWS Account.

  • IAM role with AWS service as Redshift-Customizable and permissions set to s3 read only access.

  • Security group with inbound rules appropriately set as below:

    Type: Custom TCP Rule.
    Protocol: TCP.
    Port Range: 5439,
    Source: Custom IP, with 0.0.0.0/0
    
  • Set your AWS access and secret key in the config file.

    [AWS]
    KEY =<your aws key>
    SECRET =<your aws secret>
    

Installation

  • Make a new directory and clone/copy project files into it.

  • Create a virtualenv that will be your development environment, i.e:

    $ virtualenv sparkify-project
    $ source sparkify-project/bin/activate
    
  • Install the following packages in your virtual environment:

       - configparser
       - boto3
       - psycopg2
    
  • Alternatively you can install the requirements in the requirements.txt that's in this project by running the command:

    $ pip install -r requirements.txt
    

Terminal commands

  • To execute the code that creates the cluster, creates and drops the database tables, run the following command on the terminal:

    $ python create_tables.py
    
  • After running the above command, the next step is to execute the ETL pipeline which is done by running:

    $ python etl.py
    
  • Once done with the cluster, you can delete it by running the following:

    $ python delete_cluster.py
    

Screenshots of what the final tables look like

Artists Table

Artists Table

Songplays Table

Songplays Table Songplays Table

Songs Table

Songs Table

Time Table

Time Table

Users Table

Users Table

Built With

  • Python and SQL
  • AWS Redshift

Authors

udend-data-warehouse-project's People

Contributors

dependabot-preview[bot] avatar jonathankamau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

udend-data-warehouse-project's Issues

Weekly Digest (12 July, 2020 - 19 July, 2020)

Here's the Weekly Digest for jonathankamau/udend-data-warehouse-project:


ISSUES

Last week 5 issues were created.
Of these, 4 issues have been closed and 1 issues are still open.

OPEN ISSUES

๐Ÿ’š #74 Bump boto3 from 1.9.205 to 1.14.23, by dependabot-preview[bot]

CLOSED ISSUES

โค๏ธ #73 Bump botocore from 1.12.205 to 1.17.22, by dependabot-preview[bot]
โค๏ธ #72 Bump boto3 from 1.9.205 to 1.14.22, by dependabot-preview[bot]
โค๏ธ #71 Bump botocore from 1.12.205 to 1.17.21, by dependabot-preview[bot]
โค๏ธ #70 Bump boto3 from 1.9.205 to 1.14.21, by dependabot-preview[bot]

NOISY ISSUE

๐Ÿ”ˆ #70 Bump boto3 from 1.9.205 to 1.14.21, by dependabot-preview[bot]
It received 1 comments.


PULL REQUESTS

Last week, 7 pull requests were created, updated or merged.

UPDATED PULL REQUEST

Last week, 4 pull requests were updated.
๐Ÿ’› #74 Bump boto3 from 1.9.205 to 1.14.23, by dependabot-preview[bot]
๐Ÿ’› #63 Bump ipython from 5.8.0 to 7.16.1, by dependabot-preview[bot]
๐Ÿ’› #62 Bump urllib3 from 1.25.3 to 1.25.9, by dependabot-preview[bot]
๐Ÿ’› #12 Bump configparser from 3.7.5 to 5.0.0, by dependabot-preview[bot]

MERGED PULL REQUEST

Last week, 3 pull requests were merged.
๐Ÿ’œ #73 Bump botocore from 1.12.205 to 1.17.22, by dependabot-preview[bot]
๐Ÿ’œ #8 Bump six from 1.12.0 to 1.15.0, by dependabot-preview[bot]
๐Ÿ’œ #6 Bump enum34 from 1.1.6 to 1.1.10, by dependabot-preview[bot]


COMMITS

Last week there were 4 commits.
๐Ÿ› ๏ธ Merge pull request #73 from jonathankamau/dependabot/pip/botocore-1.17.22 Bump botocore from 1.12.205 to 1.17.22 by jonathankamau
๐Ÿ› ๏ธ Bump botocore from 1.12.205 to 1.17.22 Bumps botocore from 1.12.205 to 1.17.22. - Release notes - Changelog - Commits Signed-off-by: dependabot-preview[bot] [email protected] by dependabot-preview[bot]
๐Ÿ› ๏ธ Merge pull request #8 from jonathankamau/dependabot/pip/six-1.15.0 Bump six from 1.12.0 to 1.15.0 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #6 from jonathankamau/dependabot/pip/enum34-1.1.10 Bump enum34 from 1.1.6 to 1.1.10 by jonathankamau


CONTRIBUTORS

Last week there were 2 contributors.
๐Ÿ‘ค jonathankamau
๐Ÿ‘ค dependabot-preview[bot]


STARGAZERS

Last week there were no stargazers.


RELEASES

Last week there were no releases.


That's all for last week, please ๐Ÿ‘€ Watch and โญ Star the repository jonathankamau/udend-data-warehouse-project to receive next weekly updates. ๐Ÿ˜ƒ

You can also view all Weekly Digests by clicking here.

Your Weekly Digest bot. ๐Ÿ“†

Weekly Digest (19 July, 2020 - 26 July, 2020)

Here's the Weekly Digest for jonathankamau/udend-data-warehouse-project:


ISSUES

Last week 12 issues were created.
Of these, 7 issues have been closed and 5 issues are still open.

OPEN ISSUES

๐Ÿ’š #87 Bump botocore from 1.17.22 to 1.17.27, by dependabot-preview[bot]
๐Ÿ’š #86 Bump boto3 from 1.9.205 to 1.14.27, by dependabot-preview[bot]
๐Ÿ’š #85 Bump urllib3 from 1.25.3 to 1.25.10, by dependabot-preview[bot]
๐Ÿ’š #77 Bump pygments from 2.4.2 to 2.6.1, by dependabot-preview[bot]
๐Ÿ’š #76 Bump docutils from 0.14 to 0.16, by dependabot-preview[bot]

CLOSED ISSUES

โค๏ธ #84 Bump boto3 from 1.9.205 to 1.14.26, by dependabot-preview[bot]
โค๏ธ #83 Bump botocore from 1.17.22 to 1.17.26, by dependabot-preview[bot]
โค๏ธ #82 Bump boto3 from 1.9.205 to 1.14.25, by dependabot-preview[bot]
โค๏ธ #81 Bump botocore from 1.17.22 to 1.17.25, by dependabot-preview[bot]
โค๏ธ #80 Bump boto3 from 1.9.205 to 1.14.24, by dependabot-preview[bot]
โค๏ธ #79 Bump botocore from 1.17.22 to 1.17.24, by dependabot-preview[bot]
โค๏ธ #78 Bump botocore from 1.17.22 to 1.17.23, by dependabot-preview[bot]

NOISY ISSUE

๐Ÿ”ˆ #78 Bump botocore from 1.17.22 to 1.17.23, by dependabot-preview[bot]
It received 1 comments.


PULL REQUESTS

Last week, 5 pull requests were created, updated or merged.

UPDATED PULL REQUEST

Last week, 5 pull requests were updated.
๐Ÿ’› #87 Bump botocore from 1.17.22 to 1.17.27, by dependabot-preview[bot]
๐Ÿ’› #86 Bump boto3 from 1.9.205 to 1.14.27, by dependabot-preview[bot]
๐Ÿ’› #85 Bump urllib3 from 1.25.3 to 1.25.10, by dependabot-preview[bot]
๐Ÿ’› #77 Bump pygments from 2.4.2 to 2.6.1, by dependabot-preview[bot]
๐Ÿ’› #76 Bump docutils from 0.14 to 0.16, by dependabot-preview[bot]


COMMITS

Last week there were no commits.


CONTRIBUTORS

Last week there were no contributors.


STARGAZERS

Last week there were no stargazers.


RELEASES

Last week there were no releases.


That's all for last week, please ๐Ÿ‘€ Watch and โญ Star the repository jonathankamau/udend-data-warehouse-project to receive next weekly updates. ๐Ÿ˜ƒ

You can also view all Weekly Digests by clicking here.

Your Weekly Digest bot. ๐Ÿ“†

Weekly Digest (26 July, 2020 - 2 August, 2020)

Here's the Weekly Digest for jonathankamau/udend-data-warehouse-project:


ISSUES

Last week 10 issues were created.
Of these, 8 issues have been closed and 2 issues are still open.

OPEN ISSUES

๐Ÿ’š #98 Bump boto3 from 1.9.205 to 1.14.32, by dependabot-preview[bot]
๐Ÿ’š #97 Bump botocore from 1.17.22 to 1.17.32, by dependabot-preview[bot]

CLOSED ISSUES

โค๏ธ #96 Bump botocore from 1.17.22 to 1.17.31, by dependabot-preview[bot]
โค๏ธ #95 Bump boto3 from 1.9.205 to 1.14.31, by dependabot-preview[bot]
โค๏ธ #94 Bump boto3 from 1.9.205 to 1.14.30, by dependabot-preview[bot]
โค๏ธ #93 Bump botocore from 1.17.22 to 1.17.30, by dependabot-preview[bot]
โค๏ธ #92 Bump botocore from 1.17.22 to 1.17.29, by dependabot-preview[bot]
โค๏ธ #91 Bump boto3 from 1.9.205 to 1.14.29, by dependabot-preview[bot]
โค๏ธ #90 Bump boto3 from 1.9.205 to 1.14.28, by dependabot-preview[bot]
โค๏ธ #89 Bump botocore from 1.17.22 to 1.17.28, by dependabot-preview[bot]

NOISY ISSUE

๐Ÿ”ˆ #89 Bump botocore from 1.17.22 to 1.17.28, by dependabot-preview[bot]
It received 1 comments.


PULL REQUESTS

Last week, 2 pull requests were created, updated or merged.

UPDATED PULL REQUEST

Last week, 2 pull requests were updated.
๐Ÿ’› #98 Bump boto3 from 1.9.205 to 1.14.32, by dependabot-preview[bot]
๐Ÿ’› #97 Bump botocore from 1.17.22 to 1.17.32, by dependabot-preview[bot]


COMMITS

Last week there were no commits.


CONTRIBUTORS

Last week there were no contributors.


STARGAZERS

Last week there were no stargazers.


RELEASES

Last week there were no releases.


That's all for last week, please ๐Ÿ‘€ Watch and โญ Star the repository jonathankamau/udend-data-warehouse-project to receive next weekly updates. ๐Ÿ˜ƒ

You can also view all Weekly Digests by clicking here.

Your Weekly Digest bot. ๐Ÿ“†

Weekly Digest (2 August, 2020 - 9 August, 2020)

Here's the Weekly Digest for jonathankamau/udend-data-warehouse-project:


ISSUES

Last week 11 issues were created.
Of these, 10 issues have been closed and 1 issues are still open.

OPEN ISSUES

๐Ÿ’š #110 Bump boto3 from 1.9.205 to 1.14.37, by dependabot-preview[bot]

CLOSED ISSUES

โค๏ธ #109 Bump botocore from 1.17.22 to 1.17.37, by dependabot-preview[bot]
โค๏ธ #108 Bump boto3 from 1.9.205 to 1.14.36, by dependabot-preview[bot]
โค๏ธ #107 Bump botocore from 1.17.22 to 1.17.36, by dependabot-preview[bot]
โค๏ธ #106 Bump botocore from 1.17.22 to 1.17.35, by dependabot-preview[bot]
โค๏ธ #105 Bump boto3 from 1.9.205 to 1.14.35, by dependabot-preview[bot]
โค๏ธ #104 Bump boto3 from 1.9.205 to 1.14.34, by dependabot-preview[bot]
โค๏ธ #103 Bump botocore from 1.17.22 to 1.17.34, by dependabot-preview[bot]
โค๏ธ #102 Bump botocore from 1.17.22 to 1.17.33, by dependabot-preview[bot]
โค๏ธ #101 Bump boto3 from 1.9.205 to 1.14.33, by dependabot-preview[bot]
โค๏ธ #100 Bump ipython from 5.8.0 to 7.17.0, by dependabot-preview[bot]

NOISY ISSUE

๐Ÿ”ˆ #101 Bump boto3 from 1.9.205 to 1.14.33, by dependabot-preview[bot]
It received 1 comments.


PULL REQUESTS

Last week, 10 pull requests were created, updated or merged.

UPDATED PULL REQUEST

Last week, 1 pull request was updated.
๐Ÿ’› #110 Bump boto3 from 1.9.205 to 1.14.37, by dependabot-preview[bot]

MERGED PULL REQUEST

Last week, 9 pull requests were merged.
๐Ÿ’œ #109 Bump botocore from 1.17.22 to 1.17.37, by dependabot-preview[bot]
๐Ÿ’œ #100 Bump ipython from 5.8.0 to 7.17.0, by dependabot-preview[bot]
๐Ÿ’œ #85 Bump urllib3 from 1.25.3 to 1.25.10, by dependabot-preview[bot]
๐Ÿ’œ #77 Bump pygments from 2.4.2 to 2.6.1, by dependabot-preview[bot]
๐Ÿ’œ #76 Bump docutils from 0.14 to 0.16, by dependabot-preview[bot]
๐Ÿ’œ #59 Bump pexpect from 4.7.0 to 4.8.0, by dependabot-preview[bot]
๐Ÿ’œ #49 Bump prompt-toolkit from 1.0.16 to 3.0.5, by dependabot-preview[bot]
๐Ÿ’œ #12 Bump configparser from 3.7.5 to 5.0.0, by dependabot-preview[bot]
๐Ÿ’œ #7 Bump jmespath from 0.9.4 to 0.10.0, by dependabot-preview[bot]


COMMITS

Last week there were 11 commits.
๐Ÿ› ๏ธ Merge pull request #109 from jonathankamau/dependabot/pip/botocore-1.17.37 Bump botocore from 1.17.22 to 1.17.37 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #100 from jonathankamau/dependabot/pip/ipython-7.17.0 Bump ipython from 5.8.0 to 7.17.0 by jonathankamau
๐Ÿ› ๏ธ Bump botocore from 1.17.22 to 1.17.37 Bumps botocore from 1.17.22 to 1.17.37. - Release notes - Changelog - Commits Signed-off-by: dependabot-preview[bot] [email protected] by dependabot-preview[bot]
๐Ÿ› ๏ธ Bump ipython from 5.8.0 to 7.17.0 Bumps ipython from 5.8.0 to 7.17.0. - Release notes - Commits Signed-off-by: dependabot-preview[bot] [email protected] by dependabot-preview[bot]
๐Ÿ› ๏ธ Merge pull request #85 from jonathankamau/dependabot/pip/urllib3-1.25.10 Bump urllib3 from 1.25.3 to 1.25.10 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #77 from jonathankamau/dependabot/pip/pygments-2.6.1 Bump pygments from 2.4.2 to 2.6.1 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #76 from jonathankamau/dependabot/pip/docutils-0.16 Bump docutils from 0.14 to 0.16 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #59 from jonathankamau/dependabot/pip/pexpect-4.8.0 Bump pexpect from 4.7.0 to 4.8.0 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #49 from jonathankamau/dependabot/pip/prompt-toolkit-3.0.5 Bump prompt-toolkit from 1.0.16 to 3.0.5 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #12 from jonathankamau/dependabot/pip/configparser-5.0.0 Bump configparser from 3.7.5 to 5.0.0 by jonathankamau
๐Ÿ› ๏ธ Merge pull request #7 from jonathankamau/dependabot/pip/jmespath-0.10.0 Bump jmespath from 0.9.4 to 0.10.0 by jonathankamau


CONTRIBUTORS

Last week there were 2 contributors.
๐Ÿ‘ค jonathankamau
๐Ÿ‘ค dependabot-preview[bot]


STARGAZERS

Last week there were no stargazers.


RELEASES

Last week there were no releases.


That's all for last week, please ๐Ÿ‘€ Watch and โญ Star the repository jonathankamau/udend-data-warehouse-project to receive next weekly updates. ๐Ÿ˜ƒ

You can also view all Weekly Digests by clicking here.

Your Weekly Digest bot. ๐Ÿ“†

Dataset

Hi,
Please add the dataset to repo. You can compress it and add a .tar.gz file. It is very difficult to find or convert data to json from the dataset link given in the README file

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.