Attempt to predict future stock prices based on Google Trends data.

License: MIT License

Python 100.00%

trends stock-price-prediction machine-learning google-trends data-visualisation stock-prices stock-market google-trends-api bollinger-bands mlpclassifier

stock-market-prediction-via-google-trends's Introduction

About
Results
- MLPClassifier
- XGBoost
Data
Features
- Simple Moving Average Delta
- Bollinger Bands
Project Organisation
License

About

The data used is downloaded from Google Trends. The concept for this project came from research by Tobias Preis, Helen Susannah Moat, and H. Eugene Stanley, "Quantifying Trading Behavior in Financial Markets Using Google Trends". In this research was found that the search volume for certain (financial) words are linked to the stock price of the Dow Jones Industrial Average stock price, and can in most cases predict a dip in the market. The purpose of this project is to combine this research with machine learning.

Results

Two machine learning algorithms have been explored for this project: XGBoost and MLPClassifier. The MLPClassifier clearly performed better than XGBoost. The best annual return, which XGBoost got is 44.2%. In contrast, MLPClassifier's best model got a 91.3% between 2008 and the present. A big contribution towards these insanely high annual returns was the coronavirus. Because of the coronavirus, the stock market crashed, which could be a major source of profits for these algorithms.

MLPClassifier

MLPClassifier performed very well on the test data. This algorithm was very strong in identifying that it was impossible to predict the small changes in the market in between crashes. Thus, for the most part, it held a buy-and-hold strategy, but during a stock market crash (like corona) or other, slightly bigger, changes, it performed well. As can be seen in figure 8.

Figure 8. Comparison of the mean plus and minus 1 standard deviation of 10.000 random simulations, MLPClassifier algorithm and a buy-and-hold strategy.

XGBoost

XGBoost did not have the insight, which MLP did. It tried to predict the small changes, which it ultimately failed at. However, XGBoost was still able to predict the stock market crash caused by the coronavirus. This was the reason why XGBoost still had such a large annual return (44.2%).

Figure 9. Comparison of the mean plus and minus 1 standard deviation of 10.000 random simulations, MLPClassifier algorithm, XGBoost algorithm and a buy-and-hold strategy.

Data

Data Collection

Two datasets were needed for this project; the Google Trends daily data for a specific keyword and the stock price daily data for a specific ticker. To collect the Google Trends daily data, you have to download all 6-month increments, 5-year increments, and 2004—present within the 2004—2020 timespan. All this data will eventually be adjusted to be relative to each other, instead of only within its respective timespan. To collect the stock price daily data for a specific ticker you want to predict, you have to download it from a website like Yahoo Finance, where you can download the historical data of any ticker.

Data Visualisation

Correlation

To prove that there indeed is a correlation between Google Trends data (e.g. 'debt') and stock prices (e.g. Dow Jones Industrial Average). I plotted the DJIA stock price with indicators of peaks in the search volume for "stock market". As you can see, before a major stock market crash, there are usually some peaks to be observed. There are also some peaks in the middle of a crash, but the peaks before the crash are quite indicative.

Figure 1. A graph where the stock price of DJIA is plotted with red dots where a peak in search volume for "stock market" has been observed. From this graph can be observed that erratic movement in search volume precedes a major stock crash.

Adjusted

After all adjustments of the data to eventually get relative daily data, which is relative to each other, the data visually looks as follows:

Figure 2. A graph in which the adjusted daily data is visualised.

Restrictions

All data on Google Trends is relative (0—100) to each other within one timeframe and you can only get daily data in 6-month increments, weekly data in 5-year increments, and only monthly data is provided for the entire timespan available. So to aggregate all data needed for this project was quite a challenge and because of these restrictions aren't completely accurate, however, the method I used was the only method to getting daily data over the entire timespan available (which is crucial for this project).

Method

To get all the data relative to each other, instead of only within its 6-month increment, I had to merge them based on weekly data. However, the weekly data is only available in 5-year increments, so I had to merge these 5-year increments based on the monthly data, which is available for timespan needed for this project. To merge all the 6-month, and 5-year increments, I computed the percentage change of each data point within its respective increment. Afterwards, I got one data point from the higher up periodicity data per increment and computed the missing days by applying the percentage change to the provided data point.

Example

An example of the search term 'debt' ('debt' is the best search term to predict market change, according to the research mentioned earlier) in the timespan 2007—2009:

Before adjustments

Figure 3. A graph where the unadjusted relative daily data is visualised. The black vertical lines indicate the edges of the 6-month increments.

After adjustments

Figure 4. A graph where the adjusted relative daily data is visualised. The graph follows the actual weekly data much better.

Weekly

Figure 5. The actual weekly data.

Features

To get better results, the raw data had to be feature engineered. Features used include:

A feature, where today's data point is subtracted by the previous n-days simple moving average. This is inspired by the research this project is based on;
The delta (difference) between two values with n days in between;
The percentage change between two values with n days in between;
A simple moving average;
An exponential moving average;
Upper and lower Bollinger bands.

Following the computation for these features, all of them are shifted 3 through 10 days. This is because Google Trends data is available three days after the fact and the target may correlate well with further shifted data. Afterward, there are 272 features. The top 50 correlating (with the target, according to the Pearson correlation coefficient) are used in the training and predicting of the direction of the Dow Jones Industrial Average.

Simple Moving Average Delta

Figure 6. When this feature becomes more volatile, the close price follows. This is a good indicator for a machine learning algorithm. It can also be seen that the close price percentage change loosely follows the line of the feature.

Bollinger Bands

Figure 7. When the 20-day simple moving average crosses the upper Bollinger band, the close price becomes more volatile. The stock close percentage change also loosely follows the lower Bollinger band.

Project Organisation

    ├── LICENSE
    ├── Makefile           <- Makefile with commands like `make data` or `make train`
    ├── README.md          <- The top-level README for developers using this project.
    ├── data
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.
    │
    ├── docs               <- A default Sphinx project; see sphinx-doc.org for details
    │
    ├── models             <- Trained and serialized models, model predictions, or model summaries
    │
    ├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
    │                         the creator's initials, and a short `-` delimited description, e.g.
    │                         `1.0-jqp-initial-data-exploration`.
    │
    ├── references         <- Data dictionaries, manuals, and all other explanatory materials.
    │
    ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
    │   └── figures        <- Generated graphics and figures to be used in reporting
    │
    ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
    │                         generated with `pip freeze > requirements.txt`
    │
    ├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
    └── src                <- Source code for use in this project.
        ├── __init__.py    <- Makes src a Python module
        │
        ├── data           <- Scripts to download or generate data
        │   └── make_dataset.py
        │
        └── features       <- Scripts to turn raw data into features for modeling
            └── build_features.py

License

MIT License

stock-market-prediction-via-google-trends's People

Contributors

Stargazers

Watchers

stock-market-prediction-via-google-trends's Issues

README for `data`-folder

Description

Create a README.md for the data-folder with multiple graphs containing data from the data in this folder. The plots should be made using seaborn. I will be making all plots for this project with seaborn so that all READMEs are consistent with each other.

Acceptance Criteria

When a README has been made. The graphs should be clear. There should be multiple plots (look in plot galleries for inspiration.

Why?

This is required because it shows people what the data looks like. They can of course also look at the deployment, but graphs in a README are more accessible.

Configurable model

Description

Make a neural network, which is configurable to the user's likings.

Acceptance Criteria

When a neural network model, which is configurable is made. This will most likely be made using TensorFlow.

Why?

Because one size doesn't fit all in neural networks. The model has to be configurable to account for this.

"Explore" line doesn't stay within graph.

Search box change

Description

The search box needs to only change the graph upon selecting a search term.

Acceptance Criteria

When after hitting either "enter" or pressing on a search term suggestion, the graph changes, and not in another way.

Why?

Because else the graph will be empty half the time, and it causes lag, because the graph keeps updating.

Search box suggestions

Description

The suggestions made by the search box while typing need to 1. look nice, and 2. be clickable.

Acceptance Criteria

When above criteria are fulfilled.

Why?

This makes it easier and look nicer.

Manipulate data for feeding to neural network

Self-explanatory.

The first two lines of csv file have to be deleted

Description

Automatically delete the first two lines of the csv files on download.

Why?

To make it better integrate with a pandas Dataframe.

Combine ML features

Description

Combine features and figure out which features are the best for making predictions.

Acceptance Criteria

When the best possible - with the best possible features - machine learning algorithm has been found.

Why?

Better features mean a better algorithm. This could enforce the machine learning model and better its accuracy.

Feature engineer percentage changes.

Description

Instead of using absolute values, relative values would be more valuable. Would also be easier to compute in the future to make predictions when deployed.

Why?

Because it is easier to retrieve after the fact. Don't need to normalize the future data points with the existing data points (google trends is quite annoying in this regard). But percentage changes are the same always. That the Google Trends data is relative doesn't affect percentage changes.

Update docstrings in `make_dataset.py`

Description

Make all docstrings in make_dataset.py comply with the Google Docstrings Style.

Why?

This is the convention that has been chosen for this project. It has already been implemented in build_features.py.

Feature based on days since last peak

Description

A feature, which is essentially just how many days the last peak was. Peaks can be found using the scipy.signals library. Only major peaks should be used, not all the small little peaks (where there are hundreds of). For example, only the peaks preceding a stock crash. Define the characteristics of these peaks, so that the algorithm can start looking for the same kind of peaks in the future. This could definitely help with prediction.

Also, this feature could be one-hot encoded. For example:

Days since last peak

0	1	2	3	4
1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

Acceptance Criteria

Three criteria:

Whether or not this feature is possible;
Definition of the characteristics of the peaks preceding a stock crash (may have to be done per keyword);
Best way of presenting this feature to the machine learning model.

Why?

All features, which could help with better accuracy, is a feature that should be explored.

README for the scripts folder

The scripts need documentation.

new complementary tool

My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.

I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.

The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.

With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).

I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating I am also willing, and if not file it in the box.

README for the deployment folder

The deployment needs to be shown via GIFs. No one will open the index.html in a live server. This is the easiest/best way to show it.

Add text on the right y-axis (last graph), change colour of the left y-axis text.

Pull stock data

Description

The stock price data has to be pulled, so it can be fed to the neural network. However the weekends aren't incorporated in the stock price data, so it has to be manipulated so that they fit. Either the weekends have to be skipped, the weekends have to be indicated as NaN, or it has to be the same data has the Friday before (or work day before the NaN). Research has to be done on this subject.

Acceptance Criteria

When a viable solution to the weekend/holiday problem has been found, and incorporated.

Why?

The neural network has to be given an outcome of all the Google Trends data.

Update documentation for `build_features.py`

Description

The documentation for the scripts is still documenting previous code. Thus it has to be updated to document the current code.

Adjust Google Trends data

Description

The data pulled from Google Trends has to be adjusted after the data has been pulled. The adjustments have to be made according to Method in the README.md.

Acceptance Criteria

This issue will be considered done when after the data from Google Trends has been pulled, it will be automatically adjusted, and exported to a .csv-file, as described in Method

Why?

This feature is required, because in order to feed data to the neural network, the structure of the data has to stay consistent.

Links not changing color when hovered over in Google Chrome

Description

Links should be changing color to blue when hovered over, but they are remaining white in google chrome.

Add text under the page

Description

Credits, links to profiles, conclusion...

How can i set the location to Global?

Hello, thank you for this repository, i've been doing some research and all of your methodology was very helpful. But, I have a question, how can I set the API to get daily data without the geo parameter? I was trying and every time i get some error. Thanks for your time!

Pull data from Google Trends

Description

The pulling of the data from Google Trends has to be quick and automatic. Google Trends doesn't have an API, so it will have to be done from scratch.

Acceptance Criteria

This issue will be considered closed when a script is able to pull data from Google Trends; an unofficial API.

Why?

This feature is necessary, because if a user wants to use another search term for their instance, they would have to spend hours collecting all data. However, with this feature it would only require seconds/minutes. This would also provide a foundation to being able to collect new data from Google Trends when actively using this program.

Structure

Description

Structure the project according to the cookiecutter data science template.

Why?

Because it makes it more structured and helps people navigate through the project.

Update the main README

Description

The main README is quite outdated, a lot more progress has been made since its creation. Thus an updated README should be made. Things to cover:

New plots (made with Seaborn);
Look over the existing text and determine whether it is still usable;
More text, containing information on;
- Machine Learning model;
- The deployment of the webpage;
- The feature engineering and its various methods.

Acceptance Criteria

When a good-looking README has been made. It should be up to par with the Google documentation style.

Why?

This feature is required to get more people interested in the project. It helps others with understanding the project and the decisions made.

Initial letter doesn't work on Google Chrome

Description

On Google Chrome, initial-letter is not an option (it is in safari). Thus a way of making sure the initial letter also works on Google Chrome will have to be figured out. This could be the fix.

Acceptance Criteria

When the drop caps works on all web browsers.

Why?

Because accessibility is important, and accessibility means all browsers.

Opacity problem

Description

The opacity doesn't go to 1 when the page is being scrolled quickly.

Update README.md with Data Collection information

Description

Update the README.md with information on how the data_collector.py script works.

Acceptance Criteria

When a technically written paragraph has been added to the README.md about the data_collector.py script.

Why?

To improve the documentation.

Change all " to '

Description

Consistency, consistency, consistency...

A lot of " in make_dataset.py in particular.

Complete the README

Description

Make sure that all information that should be in the documentation (readme), is present. Also write following the Google developer documentation style guide.

Acceptance Criteria

When all documentation, which is needed, is written according to the Google developer documentation style guide, and that anyone who is not necessarily an expert (but is interested), understands everything written in the documentation.

Why?

This feature is required, because the documentation is the first thing that people look at, and good-looking documentation inclines people to leave a star, or get engaged in the project.

Hyperparameter tuning

Description

Use a cloud computing service (AWS, Google Cloud, Microsoft Azure ...) to find the best hyperparameters for the model.

Acceptance Criteria

When a model with > 0.65 accuracy has been found.

Why?

Hyperparameter tuning is a big part of machine learning and can make or break the algorithm. Good hyperparameters mean better accuracy, better accuracy means more money in this case.

Search term with 2 words can't be downloaded.

Description

Make it possible to download Google Trends data of keywords with more than 1 word.

Merge data_adjuster.py and data_collector.py

Description

Change data_adjuster and data_collector, so that they are one script, and not all files from Google Trends will be downloaded, only the adjusted daily data is outputted.

Acceptance Criteria

When one script does the job of both of these scripts.

Why?

This saves space on hard drives, it declutters, and it looks more clean.

Have `data.py` import keywords from a `keywords.txt` file.

Description

This makes it easier to import a big amount of keywords.

Unable to select text in the introduction

Description

The text in the introduction is not selectable, probably because the graph (with opacity 0) is overlapping it.

Figures for README.md (and deployment).

Plots needed

A graph, where the Google Trends data is a heatmap with a line plot of the stock price data over it;
- This is to indicate the correlation between the Google Trends data and the stock price.
Various graphs where the adjustments made are clear and concise, perhaps an example.
- This is to indicate why the adjustments are needed, and how they were made.

Various graphs could be added to this issue.

Some data in the CSV files overlap

Description

The last and the first of the daily data are the same days, however only the daily data files shouldn't end on the first, but on the last day of a particular month.

Acceptance Criteria

When - after export - the data doesn't overlap anymore.

Why?

Makes it easier to manipulate them as panda dataframes.

Update README.md with data adjustment information

self-explanatory.

Google Trends data has to be 0-100

Convert notebooks to python files

Feed data to neural network

Description

There has to be a method to how the data will be fed to a neural network. So that is what this issue has to solve. This issue has to answer some questions, like "How many weeks back will the model be fed?".

Acceptance Criteria

This issue will be considered done when a viable method of feeding data to the neural network has been made.

Why?

This is required, because the data the neural network gets fed, will also determine the accuracy of it.

Search box sometimes only uses a part of the search term specified

Description

Happens sometimes when a suggestion has been clicked. The search term doesn't work anymore after that and the webpage has to be refreshed for the search box to be used again.

Feature engineering

Description

Determine which features are worth keeping and which aren't.

Acceptance Criteria

When a model has been made - with features - which can outperform the stock market using a buy-and-hold strategy.

Why?

This is an essential part of the machine learning workflow.

Use many features with one keyword

Description

Use only one keyword (stock market) and create many types of features with that (Bollinger Bands, EMA, MA, etc.). Use stock market, because what of what I can find out, it is the best search term that correlates with the stock market.

Acceptance Criteria

When good features for this problem have been found, implement, and visualised.

Why?

This will make for some great visualisations and I might be able to find out which feature(s) is the best for multiple keywords

Text under graphs

Description

There has to be a text under the graphs to explain what is being visualised.

Use K-fold cross-validation

Description

Utilise the powers of k-fold cross-validation in the machine learning model.

Acceptance Criteria

When K-fold cross-validation has been implemented.

Why?

All machine learning models that perform well are very overfitted (training accuracy = 1.0). This may help against that.

Search box compare to stock price.

Description

There has to be the ability to compare the search terms to the stock price of ^DJI.

Acceptance Criteria

When the line for the stock price of ^DJI is also in the "Explore" graph.

Why?

The interesting part of the project is to compare the search terms to the stock price, since that is the point of the project.

0	1	2	3	4
1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

0	1	2	3	4
1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

cristianpjensen / stock-market-prediction-via-google-trends Goto Github PK

stock-market-prediction-via-google-trends's Introduction

Table of Contents

About

Results

MLPClassifier

XGBoost

Data

Data Collection

Data Visualisation

Correlation

Adjusted

Restrictions

Method

Example

Before adjustments

After adjustments

Weekly

Features

Simple Moving Average Delta

Bollinger Bands

Project Organisation

stock-market-prediction-via-google-trends's People

Contributors

Stargazers

Watchers

Forkers

stock-market-prediction-via-google-trends's Issues

Description

Acceptance Criteria

Why?

Description

Acceptance Criteria

Why?

Description

Acceptance Criteria

Why?

Description

Acceptance Criteria

Why?

Description

Why?

Description

Acceptance Criteria

Why?

Description

Why?

Description

Why?

Description

Acceptance Criteria

Why?

Description

Acceptance Criteria

Why?

Description

Description

Acceptance Criteria

Why?

Description

Description

Description

Acceptance Criteria

Why?

Description

Why?

Description

Acceptance Criteria

Why?

Description

Acceptance Criteria

Why?

Description

Description

Acceptance Criteria

Why?

Description

Description

Acceptance Criteria

Why?

0	1	2	3	4
1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1