Git Product home page Git Product logo

bert-topic-modeling's Introduction

Topic Modeling

A topic model is a type of statistical model for discovering abstract "topics" that occur in a collection of documents. It provides a general summary of topics being discussed in the data and association between those topics.

Project Motivation

I was familiar with traditional topic modeling techniques like Latent Dirichlet Allocation (LDA), but I wanted to explore BERTopic which is more sophisticated and recent algorithm in topic modeling. It overcomes certain drawbacks of LDA like pre-processing requirements, unstable results and high processing power by leveraging techniques like UMAP, c-TF-IDF and word embeddings.

Problem Statement

With the emergence of meme stocks and several online forums for stock trading, it has become necessary for financial institutions to factor in market sentiments from such sources while making investment decisions. Case in point - GameStop short squeeze causing major financial consequences for certain hedge funds and large losses for short sellers.

Wallstreetbets on Reddit is one of the many public forums where people discuss such recent market trends and express their sentiments about them. This project focuses on WallStreetBets (responsible for the GME short-squeeze linked above) to identify popular topics and stocks being discussed in the comments and posts, and recommend stocks to buy based on that analysis.

Installation

  • BERTopic
  • praw (Python Reddit API Wrapper)
  • pmaw (Pushshift Multithread API Wrapper)
  • sklearn
  • joblib==1.1.0 (needed because of conflicts with BERTopic) pip install --upgrage joblib==1.1.0

Data Source

For Topic Modelling

  • Scraped ~500k comments & posts from the subreddit r/Wallstreetbets
  • Date range: Sept 1, 2022 to Sept 30, 2022

For Sentiment Analysis

  • Scrape data to get top stock tickers from Sept 1, 2022 to Sept 30, 2022
  • Scrape data for only those top tickers to perform sentiment analysis

Stock Simulator - Retrospect

  • Scraped Yahoo finance for actual monthly closing prices

Approach

image

For Topic Modelling

  • Remove comments with less than 10 words to ensure only opinions are well-explained
  • Only keep content with more than 5 upvotes to weed out irrelevant content
  • Remove outlier topics (taken care by BERT)

For Sentiment Analysis

  • Modify VADER's sentiment score on lexicons we got from topic modelling
  • +2 for positive sentiment & -10 for negative sentiment

Stock Simulator - Retrospect

  • Perform sentiment analysis on top 10 tickers in windows of 90, 60 and 30 days, and
  • Get actual data from yahoo finance for the same top 10 tickers to compare our recommendations

Novel insight

image

Queen of England's death is associated closely with topics related to inflation (cluster of topics at center).

Future scope

  • Use an embedding-based sentiment analyzer (e.g., "Flair"), instead of heuristic-based technique like VADER. We stopped at VADER since it was giving good results (8 out of 10 recommended stocks were in profit)
  • Create a multi-processing scraper using joblib for faster scraping

bert-topic-modeling's People

Contributors

aniketcomps avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.