Topic Modelling and Sentiment Analysis on Tweets Using LDA
This project utilizes LDA to model topics amongst queried tweets, and performs sentiment analysis on the tweets. The potential use cases of some of the concepts utilized in this project are almost infinite.
Note: I'll give it my best explanation which is very simplified.
Assume we had only 3 topics in the world, and that each topic is represented by a collection of probabilities of words that belong to them
- Programming (0.10Python + 0.10C++ + 0.10*C .... = 1)
- Animals (0.10Panda + 0.10Bear + .... = 1)
- Video Games (0.10*Dota2 + 0.10LeagueofLegends + ... = 1)
Now if we had an article about programming:
"Python is my favorite programming language. It is a dynamically typed language, whereas C++ is static and as a result you run into a lot more errors. However, pointers and references are life so I kinda like C++ a lot too"
As a Human, it is very obvious that this article is about Topic 1 (or programming) isnt it?
Well if we scrambled the text:
"It errors. a lot programming However, and references a too static and typed kinda you is into pointers a more Python result run are is so I whereas like language. C++ life C++ a language, my dynamically as lot favorite is"
As a Human, we can still kind of tell what the text is about can't we?
LDA (latent dirichlet allocation) Works in a very similar way.
We feed it a bag of words which it assumes to be related (syntax, order, punctuation all don’t matter!). And it tries to generate what the topics might have been, which made up these bag of words.
For such a reason, it makes sense to train the model using Wikipedia Articles where all the words in each article, ARE related (to an extent).
Requires Python 2.7.x
Gensim requires NumPy and SciPy
pip install numpy
pip install scipy
Requires Gensim
pip install gensim
Requires Flask
pip install flask
Rquires VaderSentiment
pip install vadersentiment
Requires NLTK
pip install nltk
Requires Twython
pip install twython
Requires Langdetect
pip install langdetect
After we have NLTK installed, we have to install its data. We do that by running:
import nltk
nltk.download()
and selecting download all data.
Navigate to:
Download the file named:
"enwiki-latest-pages-articles.xml.bz2"
cd into the directory and run
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2 wiki_en
This will make a file that maps each word to unique id, and another one that stores a file that contains tf–idf (term frequency index document frequency) vectors. (we need these to train the model). This process literally took 8-10 hours on my computer
Run the script in this repo named:
"ldatraining.py2"
with both all the filed obtained from the preprocessing step in the same directory
You need to obtain your own Twitter API keys and place them in the place holders
APP_KEY = 'hiE4tBmkZaf1wC5PT4hUBaxMz'
APP_SECRET = '7OEgCKwGrQYAwl5I1kYwXwk14wZc2HBC6GnXFUxZlrrTCnge3C'
twitter = Twython(APP_KEY, APP_SECRET)
The current keys, are just placeholders. (They will not work) Keys can be obtained from
Surprisingly, some of my previous approaches where I didn't use the entire Wikipedia corpus, but instead used individual tweets themselves yielded better results. I'm going to try to change the chunk-size and the number of passes and the number of topics and see what happens. Unfortunately, the corpus is so large and it takes so much time to train the model that you don't really have the option to "play around" as much.
I've been experimenting with Word2Vec a lot recently, but I haven’t quite gotten the success I was looking for yet. I trained the tool using the brown corpus in the nltk library, but this corpus is way too small so I didn't get the results I was looking for. I tried to load a pre-trained vectors from the Google News Dataset, but this simply took far too much RAM and I couldn't get it to work properly on my computer. I'll definitely look into this more, and try to get some desirable results.
While the current Flask-Application works well to visualize the calculated, much more could be done. I plan to build a REST-API, and build an express/angular/node Web-App that would communicate with the API and display the information in a much more visually pleasing way.