This is a project to demonstrate how to create a data lake using Spark. This project is part of the Data Engineering nano-degree program at Udacity.
A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The project consists in building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables.
Submit the etl.py script to your spark cluster. Run the following command to submit the job to spark:
$ spark-submit etl.py
In order to have access to the S3 bucket, you need to configure your AWS credentials. Create a dl.cfg file with the following content:
[KEYS]
aws_access_key_id = <your access key>
aws_secret_access_key = <your secret key>
The datasets used in this project are the following:
-
song_data - JSON file with metadata on songs. The song data is stored in JSON format. It contains the following fields:
song_id
- a unique ID for each songtitle
- the title of the songartist_id
- the ID of the artistartist_name
- the name of the artistartist_location
- the location of the artistartist_latitude
- the latitude of the artistartist_longitude
- the longitude of the artistyear
- the year the song was releasedduration
- the length of the song in secondsnum_songs
- the number of songs by the artist
-
log_data - JSON file with metadata on user activity. The log data is stored in JSON format. It contains the following fields: 11.
artist
- the name of the artist 12.auth
- the authentication method used 13.firstName
- the first name of the user 14.gender
- gender of the user 15.itemInSession
- number of items in the session 16.lastName
- the last name of the user 17.lenght
- length of the session 18.level
- level of the user 19.location
- location of the user 20.method
- the method used to access the data 21.page
- the page accessed 22.registration
- the registration date of the user 23.sessionId
- the ID of the session 24.song
- the name of the song 25.status
- the status of the request 26.ts
- the timestamp of the request 27.userAgent
- the user agent of the user 28.userId
- the ID of the user
The spark job will create a data lake in S3. The data lake will contain the following dimensional tables:
- artists - the artists table
- users - the users table
- time - the time table
- songs - the songs table
And the following factual table:
- songplays - the songplays table
The following references were used to create this project:
This project is licensed under the MIT License.