The carbon footprint left during experiments and GPU model training is 4.06 kg and calculated using Eco2AI library
In this work, we aim to test the hypothesis that semantic features and context are important in predicting financial market trends, and compare our approach to baseline sentiment-based solutions.
In the conducted experiment, the correlation between sentiment score and price volatility in financial markets was analyzed. The study was conducted using dataset of historical market data during a 5 year period and sentiment scores obtained from Twitter social network. The sentiment scores were calculated based on the sentiments expressed in social media posts throughout the day.
The results of the correlation analysis showed a strong positive relationship between sentiment score and price volatility. This indicates that as the sentiment score increases, the price volatility also increases, and vice versa. This relationship was found to be statistically significant.
This study provides evidence that sentiment score and price volatility are highly correlated and can be used together to improve the performance of predictive models for financial markets. The results suggest that incorporating sentiment score as a feature in predictive models indeed leads to improved predictions of price movements and can provide valuable insights into the underlying market dynamics.
The goal of this paper was to investigate whether sentence embeddings would yield better results in stock price prediction compared to sentiment analysis approach. The results of the experiments showed that the sentiment polarity extraction approach outperformed sentence embeddings in terms of accuracy and training time for predicting the stock closing price 3 and 5 days ahead.
- Project_Files
- Financial has notebooks regarding stock market data
- Twitter twitter dataset exploration, sentence embeddings and exploration of sentiment score feature in the separate notebook
- Preprocessed_Files files made by prediction and preprocessing functions saved for a later use in pickle format not to repeat calculations aall over again every time
- sentence embeddings
- historical predictions of validation datasets
- sentiment score from multiple models
- BERTopic models for each company
- total dataframes combining all the data required for training
- twitter files for all companies
- etc.
- TimeSeries_Prediction
- darts_logs stores checkpoint files for trained models
- jupyter notebooks used for model training
- helper_funcs contains multiple helper functions used throught the project to collect the data, make predictions as well as preprocessing, etc.
- models stores the init files of TFT, N-Linear and other models used for prediction
- scinet an implementation of SCINet paper used for prediction of timeseries without using the covariates
- Others residual files
- png stores pictures generated by predictions and other visualization
- Datasets
- kaggle twitter dataset downloaded from kaggle
- market market data downloaded manually from yahoo finance
- emission.csv file generated by Eco2AI library containing information about CO2 emissions generated during training process
To run this project, install it locally using either conda or pip.
The environment.yml
file is stored in the root folder of this repository and lists all Python libraries on which the notebooks depend, you can replicate the environment using the following conda
commands.
- First you need to create the environment:
conda env create -f environment.yml
- Then, activate it:
conda activate hse-stock
- Verify that the new environment was installed correctly:
conda env list
The following command will install the packages according to the configuration file requirements.txt
that is stored in the root folder of this repository:
pip install -r requirements.txt
Sone of the files could not be uploaded to the Github repository due to the storage limitations, that is why all the notebooks stored in the project repository is completed with info and should not be rerun to see the results. All of the missing files will be generated automatically if you ran the functions but it will require quite some time to run.