- Overview
- Methodology
- Components
- Data Extraction and Pre-Processing
- Emotion Classification using NRC Lexicon and LSTM based DNN
- Emotion Classification using Vader Lexicon and LSTM+CNN based DNN
- Sentiment Classification using Vader Lexicon and LSTM+CNN based DNN
- Sentiment Polarity Analysis using Vader Lexicon and Bi-Directional LSTM based DNN
- Running the Code
- Screenshots
- System Configuration steps
- File Descriptions
- Credits and Acknowledgements
Twitter tweets play an important role in every organisation. This project is based on analysing the English tweets and categorizing the tweets based on the sentiment and emotions of the user. The literature survey conducted showed promising results of using hybrid methodologies for sentiment and emotion analysis. Four different hybrid methodologies have been used for analysing the tweets belonging to various categories. A combination of classification and regression approaches using different deep learning models such as Bidirectional LSTM, LSTM and Convolutional neural network (CNN) are implemented to perform sentiment and emotion analysis of the tweets. A novel approach of combining Vader and NRC lexicon is used to generate the sentiment and emotion polarity and categories. The evaluation metrics such as accuracy, mean absolute error and mean square error are used to test the performance of the model. The business use cases for the models applied here can be to understand the opinion of customers towards their business to improve their service. Contradictory to the suggestions of Google’s S/W ratio method, LSTM models performed better than using CNN models for categorical as well as regression problems.
The below diagram shows the methodology followed for the project and the analysis therein:
File 'Data Cleaning and Pre-Processing.ipynb' :
- Imports the full dataset containing twitter tweets for 1 day (01-Aug-2019)
- Filters the data using Language, Retweets and Hashtags
- Exports the filtered and fina data into a .csv file
File 'NRC_Emotion Category.ipynb' :
- Imports the filtered and final data of twitter tweets
- Performs text analysis on the data
- Applies NRC Lexicon to generate the emotions for each tweet
- Applies the LSTM based DNN to create a model that predicts the emotion based on the tweet
- Generates evaluation metrics for comparison
File 'Vader_Emotion Category.ipynb' :
- Imports the filtered and final data of twitter tweets
- Performs text analysis on the data
- Applies Vader Lexicon along with clustering to generate the emotions for each tweet
- Applies the LSTM and CNN based DNN to create a model that predicts the emotion based on the tweet
- Generates evaluation metrics for comparison
File 'Sentiment Category.ipynb' :
- Imports the filtered and final data of twitter tweets
- Performs text analysis on the data
- Applies Vader Lexicon to generate the sentiment for each tweet
- Applies the LSTM and CNN based DNN to create a model that predicts the sentiment based on the tweet
- Generates evaluation metrics for comparison
File 'Sentiment Polarity.ipynb' :
- Imports the filtered and final data of twitter tweets
- Performs text analysis on the data
- Applies Vader Lexicon to generate the sentiment polarity scores for each tweet
- Applies the Bi-Directional LSTM based DNN to create a model that predicts the sentiment polarity based on the tweet
- Generates evaluation metrics for comparison
Download the base dataset from the below link and store it in the same folder as the codes - https://archive.org/details/twitterstream?and[]=year\%3A"2019"
(Only download the 01-Aug-2019 data zip file)
-
Execute the "Data Cleaning and Pre-processing.ipynb" file to generate the final dataset used for analysis
-
Execute the respective model ipynb files to perform the analysis and see the results.
In order to run the code, below are the necessary requirements:
- Python and Jupyter Notebook: As the code for data extraction and merging is written in Python, Python along with Jupyter Notebook as IDE is required for the execution of the same. Below are the packages that are required as part of the pre-requisites for the same:
os, tarfile, pandas, pyspark, vaderSentiment, matplotlib, numpy, re, tensorflow, sklearn, bs4, string, nltk, emoji, nrclex, seaborn, keras, itertools, scikitplot, gensim, operator, pickle, pathlib, nlp_utils
Below are the files and the folders that are part of the project implementation:
- Cleaned Data:
- August01_Tweets_Final.csv: Contains the data used for analysis after filtering the raw tweets.
- Code:
- Data Cleaning and Pre-Processing.ipynb: Contains the code to clean, pre-process and filter the raw twitter tweets data
- NRC_Emotion Category.ipynb: Contains to code to apply Emotion Classification using NRC Lexicon and LSTM based DNN model
- Sentiment Category.ipynb: Contains the code to apply Sentiment Classification using Vader Lexicon and LSTM+CNN based DNN model
- Sentiment Polarity.ipynb: Contains the code to apply Sentiment Polarity Analysis using Vader Lexicon and Bi-Directional LSTM based DNN model
- Vader_Emotion Category.ipynb: Contains the code to apply Emotion Classification using Vader Lexicon and LSTM+CNN based DNN model
- Archive Team: The Twitter Stream Grab for providing the data used for this project.
- NCI for a challenging project as part of their full-time masters in data analytics course subject 'Data Mining and Machine Learning 2'