This is a data mining project. There are three goals of this project.
First, I want to find the characteristics of securities derived from social media factors have significant power in explaining the time-series variation in daily returns. The Social Media Factor, the “sixth” factor, is distinct from the traditional five factors authored by Fama and French.
Second, I predict excess return using sentiment scores and Fama-French factors, as well as understand what features explain the variance the most.
Third, I want to explore how our extracted sentimental scores are related or affecting the fluctuations of stock returns.
In this project, I applied several data mining and machine learning techniques.
- Fin-BERT and LSTM NLP models that convert text into sentiment scores.
- Fama-French model with social media as the six factor
- PCA dimension reduction
- data wrangling/feature engineering/model engineering
- K-mean clustering
- Regression models: linear, SVR, decision tree, random forest, bagging, voting, gradient boosting and ada boost.
To access the full data source, please visit https://www.kaggle.com/priyapitre/ff-project and scroll to the bottom to download csvs and pre-trained LSTM model.