Git Product home page Git Product logo

nlp_class_assignment's Introduction

Summary

This repository hosts the source code for Class Assignment A2, focusing on analyzing text data from stock analysts’ reports on Zacks.com and descriptions of various startups. It employs several sentiment analysis techniques to evaluate the analysts' opinions and uses industry classification and topic modeling to analyze the startup descriptions.

After properly configuring the dependencies with poetry install, you can initiate and run the entire analytical workflow end-to-end by executing the command python -m src.main.

Remark: it takes around 45 minutes to finish running the entire pipeline. The bottleneck seems to be coming from the string matching using fuzzywuzzy

Processing and Modelling Logic:

  1. Text preprocessing for both input dataframes (zacks_arguments.csv and startups.xlsx) and financial sentiment dictionaries

    • The following steps are executed sequentially:
      • Conversion to lower case
      • Removal of special characters via regular expressions
      • Trimming of extra spaces
      • Application of the Spacy pipeline for stop word elimination and lemmatization

    Remark: The preprocessed text is saved in two distinct formats to suit different analytical needs. For dictionary-based approaches, a list of words format is more practical, while string format is preferred for conducting LDA topic modeling, facilitating ease of analysis in each context.

  2. Sentiment score calculation using 3 different techniques:

    1. Without negation
    2. With negation
    3. NLTK's Vador compound score

    Remark: Due to the inherent contextual ambiguity in language, achieving perfect accuracy with lemmatization is unattainable. Rather than strictly matching text against a dictionary, I opt for identifying words that are highly similar—with a similarity threshold of 90 or above and closely matched in length using the package called fuzzywuzzy. This approach ensures minimal loss of information.

    The Buy and Sell label distribution of the top 100 sentiment scores across different techniques are shown below:

  3. Construcut a TF-IDF vector representation of the description column from the processed startup data. We fit a LightGBM classification model using the TF-IDF matrix as the features and the industry as the response. The hyperparameter tuning is done through optuna and the model constructed with the best params achieved {'accuracy': 0.8923263825325767, 'f1': 0.8916645781780539, 'roc_auc': 0.9694871832469132}. The best combination of hyperparameters are contained in the json file in the output folder.

  4. Construct the Bag of Words matrix from the text, limiting the max features to 1000 so it will not run into a memory error. Run the LDA for both 3 topics and 10 topics:

    • For the 3-topic case, they all resemble Information Technology

    • For the 10-topic case, topic 5 is the most similar with Computer Software and Services

  5. Hyperparameter optimization was conducted using Optuna, as GridSearchCV proved to be excessively slow for our requirements. Through this process, it was determined that a configuration of 10 topics is most suitable for the text data under consideration.

nlp_class_assignment's People

Contributors

whanyu1212 avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.