saaqebs / spotify-songs Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 25.96 MB

ORIE 4741 Final Project Repository

Jupyter Notebook 100.00%

spotify-songs's Introduction

ORIE 4741: Spotify Song Trends & Influencers

ORIE 4741 Learning with Big Messy Data Final Project

Kevin Van Vorst (kpv23), Manya Walia (mtw62), Saaqeb Siddiqi (ss3759)

spotify-songs's People

Contributors

Stargazers

Watchers

spotify-songs's Issues

Peer Review (wmx2)

This data analysis project aims to predict what songs will be come popular, and the qualities of a song that make it so. It achieves this by using a large dataset collected via the Spotify API of 160K+ songs and their features (e.g. name, danceability, valence, speechiness, etc). Through this project, one can begin to determine what kinds of features make a song "popular", and thus music producers can start to tailor their music in such a way as to increase its chances of being popular. It also aims to come up with a recommendation system based off of these findings.

The points I enjoyed the most about this proposal were:

I enjoyed the specific examples that were gone over to add credibility to your points, rather than speaking in general. It gives me an impression that you know where you want to take this project. I particularly enjoyed the reference to Taylor Swift's folkore album - nice taste in music!
I like that you didn't stop just at what made music popular, but also wanted to take it a step further to also consider how to improve Spotify's recommendation algorithm (which is already quite good). What kinds of methods are you thinking of using here to ensure quality recommendations? k-means clustering? Something else? How will you measure the quality of these recommendations?
You were very specific about the dataset you chose, and made sure to list all of the important points, like its size and what fields were in it, which gives the impression that you did a good amount of looking into the data. Again, this was quite thorough!

The points I am most concerned about for this proposal are:

There are many similar projects to this one done throughout the course. For example, Hot Spotify Tracks is a project done last year with an almost identical premise and goal. The Spotify Effect is a project from another group this year that has the same dataset, but a different goal. How will you make sure to differentiate yourself from these other projects and not create a carbon copy?
How is this data going to be publicized for a general (e.g. music producer) audience, if at all? What I mean by this: how will music producers be able to accurately classify the, e.g. "speechiness" of their song (which is a float in 0..1)? This seems rather subjective unless Spotify has a publicly released formula metric to determine these fields. Does the song have to be uploaded to Spotify first? Then this defeats the purpose of being able to create and finalize a hit song before it goes live.
Are you going to find the general traits that make a song popular, or are you going to try to separate it out (e.g. by genre)? In my previous experience with the Spotify API, there was no way to get the genre of a song directly, requiring imperfect workarounds to try to integrate this data into the dataset. This could add to the messiness of the data, at the cost of some accuracy. Because if this project aims to find general traits, I suspect that a large number of popular songs will be classified as "pop", as that is what gets played on the radio often. If you try to separate it out by genre, how are you going to (a) find the genre data, which is not in the dataset, and (b) how are you going to deal with "relative" popularity? Since within a certain genre / group of people, a certain song might be a smash hit, but is virtually unknown in the rest of the world.

Midterm peer review

3 things I like:

1, The description of the data is very clear and well-structures including how they clear the data, data types of each columns. And the visualization provides readers with a straightforward insight into the data.
2, I am amazed at their high R-square and low MSE both on the training set and especially on their test set. What’s more they also implement the F-test to see whether the explanatory variables are joint significant, which shows their strictness towards this model.
3, It’s worth mentioning that this group has done very deep analysis about their model. For example, they try to investigate the effectiveness excluding the “year” variable. The techniques they take could create a greater and in-depth understanding of the dataset and opportunities to lower the errors.

Suggestions on this report:

1, I think it could be better if they can include some explanations on each variable at the very beginning. Maybe one sentence is enough. This can help readers more clearly capture what intuition is behind this report especially for those who don’t know music a lot.
2, Discrepancy exists in the labeling of the figure. It seems this is also mentioned in some other review suggestions.
3, Maybe this report could include more explanatory variables. Because there are so many datapoints, we don’t need to worry about the overfitting problem as we do in the homework.

Final peer review

Summary:
This project investigates what features are relevant to popularity of a song, using data from Spotify API.

What I love about this project:
Your dataset has a good coverage of data, and you provided good description and visualization of data and features. I also love that your usage of statistics to evaluate your model and your explanation of what models are used and why. You have a great overvation on old songs popularity are predicted more correctly. I think the report is beautifully written and I especially enjoy reading the conclusion, although you could've bring more detail to that (since there is new information).

What I think could be improved:
My research interest is actually in information diffusion, i.e., how stuff gets popular, so this project really intrigues me and I'm super happy to review it. I think it would be better if you kept the artist, because popular artists tend to produce popular songs. By removing this feature, you are actually leaving out a quite important confounding variable. Additionally, some features could be better defined - what exactly are danceability and energy and how are they measured? I understand that these are coming from the dataset, but it would've been clearer to the reader if they are defined. I don't quite understand why you decided to make it a binary classification project instead of using regression, and it is not clear why you chose to use 0.8 as the cutoff for popularity. I'd also love to see your reasoning of why older songs are easier to predict. What do popularity look like by year? Are older songs generally less popular?

Overall, good modeling and a well written report!

Final Peer Review

This project's goal is to understand what makes a particular song popular on Spotify. Their intention is to build a model that correctly classifies songs as popular and not popular. In order to do this, they gather historical data for songs from the past 100 years and study it by building several kinds of models.

Some necessary feature transformations and data cleaning is performed. After this, they fit four different kinds of models (LSR, kNN SVM, and logReg) Performance from all models is high and the final model chosen can accurately predict a songs' popularity on Spotify.

Three positive things about the project are:

The feature transformations and data preparation process is highly detailed and clearly explained.
Data modeling has been conducted with several different approaches. This allows for better comparison benchmarks to measure the accuracy of the models.
Overall project is easy to read and the findings are insightful and adequately backed.

Three possible areas of improvement are:

The initial approach regarding the popular/not popular division may benefit from some further evidence supporting the split that was conducted.
Some of the plots used were not too useful and other plots to explain the dataset could have been added (maybe something related to popularity over the years)
I think that even though I agree with the fact that this is not a WMD, it may have some dangerous aspects. The performance metric is being provided by Spotify, which is also the provider of the rest of the information and the main beneficiary from promoting/hiding some content. I am somewhat skeptical of the high performance of the model which suggests that there is some correlation that may be hidden within your variables with the target feature.

All in all, the project is interesting and well explained. However, I wonder if the approach is the most accurate one as I would guess that music tastes shift through time, and the "popularity" measure predicted is decided by Spotify which may create a feedback loop. Nevertheless, I think the work was conducted in an orderly manner and the result was very insightful.

Nico Oriol

peer review (xc66)

spotify-songs.pdf

Midterm Peer Review-yz2772

Summary
This project is about predicting the popularity of Spotify songs based on couple of features such as year, danceability, tempo, and acousticness.

Things I like

Your dataset seem to be fairly large with about 160,000 rows and 19 columns. Since you are planning to use most of the features (17 out of 19), I feel like you could achieve some impressive prediction results if you explore and analyze your dataset in depth.
I really like how you analyzed each column and selected your features with the backup of sound reasoning and explanations.
I like that you presented the R-squared and F-statistics from your regression results to analyze your model, and you also explored the correlations between several features.

Things can be improved

I think for such a large dataset I agreed with you that you definitely would want to use k-fold cross-validation on your model, which will probably reduce the error a lot.
You mentioned figure 3 and 4 in your report, but I only saw figure 1 and 2.
I wish you could state your plans for analyzing the rest of the features in your dataset. For example, how are you going to treat features like artist, duration_ms, key, mode, explicit etc.? Also, you could provide more information on how some of the continuous features are calculated. To me, features like "energy", "danceability", "liveness" are kind of vague because, for instance, I am not sure what a song with energy = 0.6 would mean.
Since your model was underfitting, I would suggest your group to try to fit different models to songs from different years. In other words, you could consider fitting a model for songs from 1920-1960 when Country, Blues, and Jazz were the most popular genres; and then fit another one for songs from 1960-1975 when Rock, Folk, Funk, and Soul music were some of the popular genres: and maybe another model for songs from 1975-1990 when Disco, New Wave and some other electronic music genres were booming. My point is, the definition of popular music constantly changed throughout the past decades, and different music genres could have completely different features represented in your dataset. So instead of trying to fit one universal model that predicts all songs from 1920-2020, which will very likely underfit, you could try to generate several models with different selected features to fit songs from different eras.

Final Peer Review - mk848

What's it about?
The report is about identifying features that make a song popular on Spotify.

What data are they using?
Spotify Web API was used. In particular, the dataset consisted of 160,000 songs from 1921 to 2020.

What's their objective?
The objective is to use the Spotify API in order to perform various supervised learning analysis on it and create models that accurately predict if the song will be popular on Spotify.

Three things I like about the Midterm Report
• I find the data processing sections to be very detailed and useful as imputation methods such as dropping unnecessary columns were completely dropped.
• The explanation of the algorithms used and the process by which it was applied allowed me to easily comprehend the overall methodologies used in the Data Modeling portion.
• The use of the R2 value was unique and helped me understand how well the various models used fit the data quantitatively.

Three areas for improvement
• The data visualization section was incredibly terse due to a lack of explanation of how the various scatterplots and histograms of features help differentiate successful and unsuccessful songs.
• The use of the confusion matrix was confusing, as it seems unhelpful when figuring out how the metrics accuracy test score was obtained.
• The report does not take into consideration the changing popularities across years and the means by which identifying successful songs can be adapted with time.

Midterm Peer Review (mcm382)

What I liked:

I like your plan for the future to incorporate genre into your model. It will be interesting to see how you do that and I think it will help.
Dimension reduction when dropping key and tempo columns was a good idea, maybe go further with this using PCA?
My group is using this dataset as well! It's cool seeing it be used differently.

What could improve:

There are discrepancies in figure labeling. I'm not sure where to find Figures 3 and 4.
How is popularity defined? Is it a measure of how popular a song is currently or when it was first released? If the latter, then I would suggest keeping the year column in your model to account for changes in music tastes over the years.
I would suggest trying out different loss functions, there are some outliers in Figure 1 at the bottom right. I'm curious if they affect your model significantly.

Midterm Review-qw273

Summary

This project is about to predict a song's popularity from a number of descriptive features, such as acousticness, danceability. In the midterm report, the team presents the data-cleaning and some preliminary data exploration.

Positives

I like the six histograms at the beginning. With the label “acousticness”, ”danceability”, it is really hard to understand the meaning without looking at the histogram distributions.
I like the Preliminary Analysis part, especially how you did the preliminary regressions and decided to drop some features such as “tempo” to get a more accurate prediction
The year feature is definitely worth looking at. As you brought up, there is an overreliance on the year the song was released. My personal reasoning is that the popularity is defined at the current era, and it is possible that songs from is specific time is regarded as “popular trend” now.

Points of Improvement

For the correlation part, I feel like a correlation table could help the audience to see what’s going on, which is also more accurate than “largely positively” or “largely negatively.”
It would be helpful if you could show some data to the audience so we could understand what kind of dataset you are dealing with. For example, you can show the first 10 rows of the data set.
I saw you mentioned figure 2,3,4 in the Going Forward part, but I could only see figure 1,2 attached below.

Midterm Peer Review

This project is aiming to predict a song's popularity from a number of descriptive features.

Good job with data preparation and visualizations, Maybe add a sentence restating your objective (I had to go to your proposal to figure out your target variable at first).
One thing for your preliminary analyses I might recommend is to look at a correlation matrix to get exact values for ‘largely positively correlated’. Also you could look at the correlation coefficient between features to see which ones might be redundant.
Good job creating the baseline model and reporting some stats on it. It is helpful to refer back to these and gives a good sense of baseline performance.
In your effectiveness section, you mention figure 3 and figure 4 but I was not sure where to find them (the figures on page 3 are labeled 1 and 2).
I definitely agree in using cross validation and investigating combinations or transformations of features going forward.
The over reliance on the year might be due to the fact that Spotify grew its popularity recently, and thus was not around to capture the ‘popularity’ generated from the release of a song for the older years (kind of like how the streams of a song within its first week of release greatly determine the future success of the song - I hope I’m making sense here).
I like your idea to investigate the three separate files to see if it can improve performance. I think this is a good idea (especially by genre).
It also might be beneficial to investigate a subset of artists. I would imagine there’s probably a lot of artist who have only one or two songs. Constructing a model using data from a couple artists with a lot of songs might be interesting to look at.
Overall good job! I think it's an interesting project that can be very useful and applicable.

Midterm Peer Review

Summary

This project is about predicting trends in Spotify song data. Key features in the dataset include metrics on danceability, energy, etc provided by the Spotify API. For the midterm report, they presented some preliminary regression analysis that gave them insights on feature selection.

Positives

I liked that you only handpicked a couple histograms to show.
I liked that you actually tried to run a simple regression analysis to decide which features can be dropped. You also computed some error values/F-statistics to back up your findings/model.
Good work investigating the year feature. There definitely can be a collinear relationship between year and popularity.

Points of Improvement

Pictures speak a thousand words and it would've been great instead of describing some of the correlations, you also added some visualizations.
Did you observe anything special about the histograms you included on page 1?
You also could have described more about what your dataset is/what your features are.

Overall good start!

Project Proposal Peer Review (hl2362)

Summary
The project is about the popularity of songs listed on Spotify. The dataset the team will be using is from Kaggle, which includes variables describing the audio characteristics of the song as well as metadata such as name of artist, release date, etc. The main goal of the project is to determine whether or not a song will become popular based on the previously mentioned predictors.

Things I liked

The dataset has a large number of rows, covering a long period of time (1921-2020) and many different types of songs. This can be used to build a very robust prediction model.
The dataset is very well documented and highly usable. The team would have less trouble dealing with data quality issues.
The model can potentially have many practical use. For the artists, they would be able to know what types of songs the listeners enjoy; for the company (Spotify), the are able to feature and promote the songs likely to be popular to increase revenue.

Ideas and areas for improvement

The team did not provide a threshold that determines whether or not the song is popular (the "popularity" feature in the dataset is on a 0-100 scale). I would suggest the team try out different thresholds to see if it makes a difference on the model.
Some of the predictors, most notably the artist, may overshadow other predictors like audio characteristics in deciding the popularity of the song. I suggest the team think of some way to "normalize" the influence of different artists so different artists can have a fair comparison.
I suggest the team perform exploratory data analysis to see if the different audio characteristics make a big difference on the popularity of the song. For example, if there are many songs in the dataset that have similar characteristics but are heavily spread out on the popularity scale, then there might be a significant impact on the quality of the model.

Final Report Peer Review

This project is about predicting trends in Spotify song data. They utilized different models to try to predict whether a given song popular or not popular based on given features from Spotify's API.

Overall, the report was really well written. The structure and organization was super clear and I was able to follow the flow of their logic and the different models that they ran. Another strength was that the explanations behind why each model was used was really clear. It was clearly demonstrated that the members of this group were knowledgeable about each model and the algorithm behind it. Finally, I thought the confusion matrix was really well done and helped us understand the results of the model.

An area of improvement is in the data visualizations. In general, the histograms were useful but I thought including the scatterplots were not as useful since there were so many data points that there's no clear trend or conclusion I can draw from it. I think there could have been other visualizations that would've been better for your objective. Another area of improvement I can think of is to consider using a different popularity threshold since there are very little songs that actually pass the threshold. This might've caused the models to appear more effective than they actually are. Finally, I would've liked to see a deeper explanations to why the models worked as well as they did rather in addition to the numerical results.

Good job overall! I really enjoyed reading it :)

Peer Review

This project includes determining whether a song will be popular and whether to recommend it to other users through Spotify. The team use data from Kaggle which shows different characteristics of Spotify songs, such as acoustics, danceability, energy and popularity. Since there are plenty of characteristics of each song, researchers believe that they can use different supervised model to predict the popularity of songs.
Which I like about the proposal:

The datasets they choose is great and I believe there are plenty of features they can use in fitting their model.
The model they build can be used in predicting which kind of music will be popular, which can be widely use in a lot of music services.
The model they build can widely be used in many areas besides music, such as predicting popular movie, popular TV programs.
Which I'll suggest them to improve:
There are a lot of supervised models, how do you determine which model is best to use? It would be better to think about that more.
There are enough characteristics about music in your datasets, how do you determine whether one feature is useful?
How to determine the training data size and testing data size? How to prevent under-fit and overfit?

Peer Review from Shelley Li

The project is about predicting popularity of Spotify features. The objective is to find a model using features of Spotify music that predicts whether the song is popular or not and a recommendation system that recommends similar songs. A dataset containing 169,909 data points each with 19 features are used.

I like that the results of this project can be quite useful since most of us listens to music in our life. I also think the dataset is large enough to generate a good model. Also, the feature space is quite large. It probably sufficiently covered all features that need to be considered, thus would give a good model.

I am, however, confused how you would use the model generated to produce a recommendation system. I assume in your proposal you want to use those “unique characteristics” that makes a song popular also as factors when considering whether two songs are similar. I am thinking maybe what makes a song popular is not enough to indicate whether it is similar to another song. I think if you’d like to do that, it’s better to train another model that predicts whether two songs are similar or not. In addition, I would recommend deleting some features that may not affect popularity since too many features may overfit. There are also 33375 unique values for the feature ‘artist’ which are too many. I would recommend rating artists by how popular (maybe need to find some other information) they were when the song was released and take a weighted average among all artists of this song.

Final Peer Review

Through this project, the team is trying to use features presented on Spotify to identify which song would be popular. The dataset they are using are songs available on Spotify from 1921 to 2020 and their relating features. By doing this project, they hope to shade some idea on what makes a song popular to what extent.

What I like about this project:

The report is clear and concise. Each method they used is clearly explained and the effectiveness of each model is clearly shown using confusion matrixes.
What their data set includes is clearly explained using a table and relevant data cleaning is carried out and why they did it that way are clearly explained.
They have tried out different methods that are relevant to fit the model.

Some suggestions that I think could be helpful:

For the data visualization part, instead of simply including all the graphs in the report, maybe it is a better idea if the team can select some that they think are useful, explain what information they draw from the graph and how did it help them to make future decisions on how they decide to model the data.
I wish the report had talked a little bit more about specifically what features make a song more popular, through discussing the parameters they got for each model, rather than simply stating the performance of each model.
When talking about the popularity of music, we would expect the type of music that is popular to be changing from year to year (music types that are trending now could be very different than those that are trending in 2000). The team's training set and testing set all have feature year in it, meaning that the testing set is being predicted given the information about how 80% of popular/non-popular music are like. But in reality, especially when we are early in the year, we wouldn’t know that information. Maybe it is a good idea to explore whether the model is also performing well when it is trained on data from previous years and tested on future songs.

saaqebs / spotify-songs Goto Github PK

spotify-songs's Introduction

ORIE 4741: Spotify Song Trends & Influencers

spotify-songs's People

Contributors

Stargazers

Watchers

spotify-songs's Issues

3 things I like:

Suggestions on this report:

Summary

Positives

Points of Improvement

Summary

Positives

Points of Improvement

Recommend Projects

Recommend Topics

Recommend Org