This repository hosts the source code for Class Assignment A2, focusing on analyzing text data from stock analysts’ reports on Zacks.com and descriptions of various startups. It employs several sentiment analysis techniques to evaluate the analysts' opinions and uses industry classification and topic modeling to analyze the startup descriptions.
After properly configuring the dependencies with poetry install
, you can initiate and run the entire analytical workflow end-to-end by executing the command python -m src.main
.
Remark: it takes around 45 minutes to finish running the entire pipeline. The bottleneck seems to be coming from the string matching using fuzzywuzzy
-
Text preprocessing for both input dataframes (
zacks_arguments.csv
andstartups.xlsx
) and financial sentiment dictionaries- The following steps are executed sequentially:
- Conversion to lower case
- Removal of special characters via regular expressions
- Trimming of extra spaces
- Application of the Spacy pipeline for stop word elimination and lemmatization
Remark: The preprocessed text is saved in two distinct formats to suit different analytical needs. For dictionary-based approaches, a list of words format is more practical, while string format is preferred for conducting LDA topic modeling, facilitating ease of analysis in each context.
- The following steps are executed sequentially:
-
Sentiment score calculation using 3 different techniques:
- Without negation
- With negation
- NLTK's Vador compound score
Remark: Due to the inherent contextual ambiguity in language, achieving perfect accuracy with lemmatization is unattainable. Rather than strictly matching text against a dictionary, I opt for identifying words that are highly similar—with a similarity threshold of 90 or above and closely matched in length using the package called
fuzzywuzzy
. This approach ensures minimal loss of information.The
Buy
andSell
label distribution of the top 100 sentiment scores across different techniques are shown below: -
Construcut a TF-IDF vector representation of the description column from the processed startup data. We fit a LightGBM classification model using the TF-IDF matrix as the features and the industry as the response. The hyperparameter tuning is done through optuna and the model constructed with the best params achieved
{'accuracy': 0.8923263825325767, 'f1': 0.8916645781780539, 'roc_auc': 0.9694871832469132}
. The best combination of hyperparameters are contained in the json file in the output folder. -
Construct the Bag of Words matrix from the text, limiting the max features to 1000 so it will not run into a memory error. Run the LDA for both 3 topics and 10 topics:
-
For the 3-topic case, they all resemble
Information Technology
-
For the 10-topic case, topic 5 is the most similar with
Computer Software and Services
-
-
Hyperparameter optimization was conducted using Optuna, as GridSearchCV proved to be excessively slow for our requirements. Through this process, it was determined that a configuration of 10 topics is most suitable for the text data under consideration.