Git Product home page Git Product logo

tmci_project's Introduction

Thematic Analysis of Album Lyrics

Navigating the Repo

Our main code can be found in the Project Pipeline.ipynb file. In order to run our code, make sure to extract the 380000-lyrics-from-metrolyrics.zip file to obtain the lyrics.csv file, which is used in our code to obtain the data from the dataset. The visualizations folder contains the .png files of our visualizations, but these visualizations can also be seen in the notebook itself. The lda_tuning_results.csv contains all intermediate results of the grid search we used to optimize our parameters.

Abstract

The aim of the project is to evaluate to what extent are a musical artist’s work reflective of their cultural world. More specifically, we intend to conduct topic analyses on the lyrics of well-established artists with considerably large discographies. These artists include Pink Floyd, David Bowie, Black Sabbath, Joy Division, and Metallica. The artists that were picked published music in similar timeframes, so that it is easier to draw a correlation between the results of each discography. Each artist was also selectively chosen because of their distinct background genre blends. Even when Black Sabbath and Metallica are both metal bands, they still differ in their respective subgenres. The relevance and motivation of this project lies comfortably in the disciple of cultural studies; i.e. to see if there is any within-period, cross-genre similarity between artist’s lyrics. If any similarity is found, we will try to link this similarity to historical events belonging to the common knowledge sphere of their respective time periods, to see if there is any relationship between an artist’s lyrics and their socio-cultural environment at the time of release.

Research questions

A list of research questions you would like to address during the project:

  • Is there any within-period, cross-genre similarity between artists’ lyrics?
  • To what extent do lyrics from different time periods share similar content?
  • To what extent do lyrics from varying musical genres share similar content?
  • Is there a relationship between an artist’s lyrics and their socio-cultural environment at the time of release?

Dataset

For this project we need data containing the lyrics of at least some of the songs pertaining to each album of each artist in the period into consideration (1970-1980), and metadata on the artist, the name of the song and, hopefully, the date of publication of the song. Useful datasets containing song lyrics and some metadata are the following:

We have not been able to check whether the artists we want to are in there, but we believe that they most probably are.

The datasets are on average 250000 lines, of which each one contains the text of the whole song. We are concerned that importing these databases in our laptops might be heavy on the memory, at least until we filter out all the songs we do not want to consider. Hopefully, the format of the lyrics is a string with a decently normal encoding.

We will process the lyrics string using the standard Natural Language Processing pipeline.

A tentative list of milestones for the project

Add here a sketch of your planning for the coming weeks. Please mention who does what.

Milestones
Week 1:

  • Translate datasets to a usable format, see if there is enough data on the selected artist, if necessary, decide on new artists - Jaël
  • Filter selected artists out of datasets, clean data if necessary - Ludovica
  • Develop natural language processing pipeline (tokenize sentences, lemmatize and normalize) - Jaël

Week 2:

Week 3:

  • Visualization of results using Matplotlib - António
  • Results analysis - Jaël, António, Ludovica

Week 4:

  • Documentation - Jaël, António, Ludovica

Week 5:

  • Presentation - Jaël, António, Ludovica

Documentation

This can be added as the project unfolds. You should describe, in particular, what your repo contains and how to reproduce your results.

Sources

A preliminary list of sources we would like to use:

Implementations of topic analysis of lyrics:

Some nice analyses of lyrics:

A nice explanation of topic modeling:

tmci_project's People

Contributors

coppercity17 avatar ludovicaschaerf avatar jaelkortekaas avatar

Watchers

Giovanni Colavizza avatar  avatar

tmci_project's Issues

Update 3

Thanks to the lab we managed to get a lot done this week. We have now finished modelling the topics and are working on enhancements and visualizations. Contrary to our expectations, the topics do not magically make sense immediately. We think that this is both due to the fact that they require careful filtering and tuning, but also because songs do not typically use very topical words, but rather only a bunch of general words.
For this reason we have now left the master branch behind and are working on three points: improving the filtering and introducing PoS tagging (jael-branch), modelling for different cardinalities and layers (ie topics from all songs from all artists, all songs from one artist, all songs from all artists from one year ...) (model_different_cardinalities branch) and using different methods for visualization of our results (antonio-branch).
For the next couple of days we hope to wrap up this stuff and focus on the analysis of our results and some evaluation of our model (probably against manual annotations of topics done by us on some songs picked at random).
That being said, we noticed that some words tend to appear in almost all of our topics and these words are usually rather meaningless. We were hoping you could give us some suggestions on what would be best doing in this case (e.g. can we train again the already trained model after we delete those words or immediately delete words that appear too often in each song ... )!

Update 1

This week we have extracted the data we wanted from our csv file, however we couldn't find three of the artists we wanted to consider, so we will need to either also use another csv file or find three new artists.
For the time being, we have started the pipeline following this tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ and we went as far as finishing the preprocessing.
The most recent version of our code is in https://github.com/ludovicaschaerf/TMCI_Project/blob/master/Project%20Pipeline.ipynb.
We have been following the code from the tutorial very closely, so we were wondering whether that was okay or if we were meant to program it from scratch ourselves?

Update 2

We are now ready to start with the topic modelling, which we are waiting for next class' explanation to implement. Currently, we have songs from 5 artists (we added 3 new ones) and we have a column (added to the original dataframe) that contains the bag of words corresponding to each lyric and including the 20 most popular bigrams.
As we talked about in class, we stopped following the tutorial and the current code is all programmed by us.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.