Git Product home page Git Product logo

big_data_repository's Introduction

Spotify Song Data Analysis with PySpark

Overview

This project utilizes PySpark, a Python library for Apache Spark, to analyze Spotify song data. The script covers a range of tasks, including data exploration, clustering, dimensionality reduction, and collaborative filtering for song recommendations.

Dataset

The analysis is conducted on two main datasets:

  1. Final Spotify Database (Final_database.csv): This dataset contains detailed information about various Spotify songs, including features like danceability, energy, instrumentalness, valence, and more.
  2. Database for Calculating Popularity (Database_to_calculate_popularity.csv): This dataset includes information about the popularity and listening statistics of songs, such as position, track URI, and country.

Both datasets are utilized to extract insights and patterns related to song popularity, artist trends, and user preferences.

Requirements

Python

PySpark

Pandas

Seaborn

Scikit-learn

Sql

Hdfs

Clustering

Mapreduce

Install the required dependencies:

pip install pyspark pandas seaborn scikit-learn

EDA(Exploratory Data Analysis)

EDA is the procedure which is used to gather deep and hidden information about the dataset by categorizing the data in various different ways such as finding duplictes, finding and handling null values,and visulaization of data through plots, charts and graphs. Here we are performing SQL queries, Plots and Figures, and using pyspark to filter results from the dataset.

image

Clustering

Clustering is the task of dividing the unlabeled data or data points into different clusters such that similar data points fall in the same cluster than those which differ from the others. In simple words, the aim of the clustering process is to segregate groups with similar traits and assign them into clusters.

image

Key Features

  1. Data Loading:The script loads Spotify song data from CSV files (Final_database.csv and Database_to_calculate_popularity.csv) using PySpark.

  2. Exploratory Data Analysis (EDA): Conducts EDA to understand various aspects of the dataset, such as popular artists, genres, and trends over time.

  3. Clustering: Applies clustering techniques, including KMeans, to identify patterns and similarities among songs.

  4. Dimensionality Reduction:Uses PCA (Principal Component Analysis) for reducing the dimensionality of the data and visualizing it in two dimensions.

  5. Collaborative Filtering:Implements collaborative filtering with the ALS (Alternating Least Squares) algorithm to provide song recommendations based on user listening history.

  6. Song Similarity: It allows users to find the most similar songs to a given input song using cosine similarity.

big_data_repository's People

Contributors

shams261 avatar jaspreet0411 avatar vamsi-sirasanagandla avatar vivek77777 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.