The nlp_class_assignment from whanyu1212

Summary

This repository hosts the source code for Class Assignment A2, focusing on analyzing text data from stock analysts’ reports on Zacks.com and descriptions of various startups. It employs several sentiment analysis techniques to evaluate the analysts' opinions and uses industry classification and topic modeling to analyze the startup descriptions.

After properly configuring the dependencies with poetry install, you can initiate and run the entire analytical workflow end-to-end by executing the command python -m src.main.

Remark: it takes around 45 minutes to finish running the entire pipeline. The bottleneck seems to be coming from the string matching using fuzzywuzzy

Processing and Modelling Logic:

Text preprocessing for both input dataframes (zacks_arguments.csv and startups.xlsx) and financial sentiment dictionaries
- The following steps are executed sequentially:
  - Conversion to lower case
  - Removal of special characters via regular expressions
  - Trimming of extra spaces
  - Application of the Spacy pipeline for stop word elimination and lemmatization
Remark: The preprocessed text is saved in two distinct formats to suit different analytical needs. For dictionary-based approaches, a list of words format is more practical, while string format is preferred for conducting LDA topic modeling, facilitating ease of analysis in each context.
Sentiment score calculation using 3 different techniques:
1. Without negation
2. With negation
3. NLTK's Vador compound score
Remark: Due to the inherent contextual ambiguity in language, achieving perfect accuracy with lemmatization is unattainable. Rather than strictly matching text against a dictionary, I opt for identifying words that are highly similar—with a similarity threshold of 90 or above and closely matched in length using the package called fuzzywuzzy. This approach ensures minimal loss of information.

The Buy and Sell label distribution of the top 100 sentiment scores across different techniques are shown below:
Construcut a TF-IDF vector representation of the description column from the processed startup data. We fit a LightGBM classification model using the TF-IDF matrix as the features and the industry as the response. The hyperparameter tuning is done through optuna and the model constructed with the best params achieved {'accuracy': 0.8923263825325767, 'f1': 0.8916645781780539, 'roc_auc': 0.9694871832469132}. The best combination of hyperparameters are contained in the json file in the output folder.
Construct the Bag of Words matrix from the text, limiting the max features to 1000 so it will not run into a memory error. Run the LDA for both 3 topics and 10 topics:
- For the 3-topic case, they all resemble Information Technology
- For the 10-topic case, topic 5 is the most similar with Computer Software and Services
Hyperparameter optimization was conducted using Optuna, as GridSearchCV proved to be excessively slow for our requirements. Through this process, it was determined that a configuration of 10 topics is most suitable for the text data under consideration.

whanyu1212 / nlp_class_assignment Goto Github PK

nlp_class_assignment's Introduction

Summary

Processing and Modelling Logic:

nlp_class_assignment's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent