Git Product home page Git Product logo

gov-cuomo's Introduction

Analyzing sentiments of tweets mentioning Gov Andrew Cuomo through the COVID-19 pandemic.

Overview

In this project, I trained tweets that mentioned 'cuomo' over three months of the pandemic. I used the pre-trained library flair to get sentiments of all the tweets and visualized it.

The first case of COVID-19 in the U.S. state of New York during the pandemic was confirmed on March 1, 2020. I have used the tweets from 02/01/2020 to 05/27/2020 to run sentiment analysis on it to get a prior reference of mentions.

Scope of the project:

I have excluded retweets, replies and links as I believe that retweets/replies are more of a response to a tweet/article rather than a general sentiment towards Gov Cuomo/or his decisions. In doing so, I have also filtered out news articles as they are associated with links. Finally, I excluded tweets that contain 'chris' in them to avoid getting sentiments for Chris Cuomo (brother of Gov. Andrew Cuomo)

Part 1: Data Collection

The conventional tweeter API - Tweepy can be used to download the tweets. However, it is important to note the limitations of tweepy such as accessing historical tweets and rate limits.

Other libraries such as GetOldTweets3 and twitterscraper provide excellent alternatives, specially when downloading historical data.

There are a few ways of downloading the tweets. They are all provided here. Note: Due to errors such as Request timeouts/handling errors, it is advisable to download batches of tweets (eg: one day at a time/handle timeouts through code). A total of 327894 tweets were extracted in json format. Full raw data can be found in the 'data' sub folder.

The secondary data (covid-19 counts for New York) was collected from the New York City gov website.

Part 2: How does flair work?

Flair is a state of the art library for NLP. Sentiment analysis done using the distilBERT: a framework built on top of BERT.

$ pip install flair

Flair sentiment is based on character level pretrained LSTM network built on PyTorch which takes the context of the words into account while predicting the overall label. It is an open source library with many contributions for training the models which gives it the strength in being able to make good predictions. Due to that, the library is very well equipped to handle typos as well. This works perfectly well with tweets as they are bound to have typos in them.

from flair.models import TextClassifier
from flair.data import Sentence
example_tweet = "insert tweet here"
tagger = TextClassifier.load('sentiment')
tagger.predict(example_tweet)

The tweets were trained individually by flair. Here is how it works under the hood:

The left column of the heatmap shows the overall score for the sentence followed by individual scores for the words. The second tweet has the word 'f*cked' in it. As mentioned above, even when the words are mispelled or written incorrectly, flair recognizes it and tags it with a negative score.

Part 3: Analysis

With the rise in the covid cases, the number of tweets increased almost symmetrically as shown by the graph below. Important to note that the data date range is 02/01/2020 to 05/27/2020

However, upon stack-plotting the count for tweets of sentiments, the count for negative tweets are much larger compared to the neutral and positive tweets. Also, I have annontated the graph with some factual events to provide perspective.

We can see the largest spike on March 24th. Upon looking further into that date, I found out that it was when Gov Cuomo made a compelling press conference against Coronavirus and also asking the federal goverment for more ventilators. His tweets were quite 'strong' in that day.

Upon plotting the tweets as a 'clock-plot' it became evident that most of the tweets were after 1:00PM / 13:00 UTC which was when he held daily press conferences. So, it was quite interesting to see the correlation in that. Even in days with lesser counts, the cluster during that timeframe remained the same.

The times are in UTC zone

The code to produce the results/visualizations is in analysis.ipynb

Part 4: Conclusion

It was interesting to compare the sentiments and the volume of tweets. The trends in the numbers followed an expected pattern but to visually observe it validated the prior belief. Projects like this can be vital in understanding reactions to events and concerns. Since social media provides abundance of resources for data analytics, the potential to leverage that data to draw out insights is very beneficial for both small and large scale analytics.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.