Git Product home page Git Product logo

stock-market-prediction-via-google-trends's Introduction

Table of Contents

About

The data used is downloaded from Google Trends. The concept for this project came from research by Tobias Preis, Helen Susannah Moat, and H. Eugene Stanley, "Quantifying Trading Behavior in Financial Markets Using Google Trends". In this research was found that the search volume for certain (financial) words are linked to the stock price of the Dow Jones Industrial Average stock price, and can in most cases predict a dip in the market. The purpose of this project is to combine this research with machine learning.

Results

Two machine learning algorithms have been explored for this project: XGBoost and MLPClassifier. The MLPClassifier clearly performed better than XGBoost. The best annual return, which XGBoost got is 44.2%. In contrast, MLPClassifier's best model got a 91.3% between 2008 and the present. A big contribution towards these insanely high annual returns was the coronavirus. Because of the coronavirus, the stock market crashed, which could be a major source of profits for these algorithms.

MLPClassifier

MLPClassifier performed very well on the test data. This algorithm was very strong in identifying that it was impossible to predict the small changes in the market in between crashes. Thus, for the most part, it held a buy-and-hold strategy, but during a stock market crash (like corona) or other, slightly bigger, changes, it performed well. As can be seen in figure 8.

Comparison of the MLPClassifier, 10.000 random and a buy-and-hold strategy

Figure 8. Comparison of the mean plus and minus 1 standard deviation of 10.000 random simulations, MLPClassifier algorithm and a buy-and-hold strategy.

XGBoost

XGBoost did not have the insight, which MLP did. It tried to predict the small changes, which it ultimately failed at. However, XGBoost was still able to predict the stock market crash caused by the coronavirus. This was the reason why XGBoost still had such a large annual return (44.2%).

Comparison of the MLPClassifier, 10.000 random and a buy-and-hold strategy

Figure 9. Comparison of the mean plus and minus 1 standard deviation of 10.000 random simulations, MLPClassifier algorithm, XGBoost algorithm and a buy-and-hold strategy.

Data

Data Collection

Two datasets were needed for this project; the Google Trends daily data for a specific keyword and the stock price daily data for a specific ticker. To collect the Google Trends daily data, you have to download all 6-month increments, 5-year increments, and 2004—present within the 2004—2020 timespan. All this data will eventually be adjusted to be relative to each other, instead of only within its respective timespan. To collect the stock price daily data for a specific ticker you want to predict, you have to download it from a website like Yahoo Finance, where you can download the historical data of any ticker.

Data Visualisation

Correlation

To prove that there indeed is a correlation between Google Trends data (e.g. 'debt') and stock prices (e.g. Dow Jones Industrial Average). I plotted the DJIA stock price with indicators of peaks in the search volume for "stock market". As you can see, before a major stock market crash, there are usually some peaks to be observed. There are also some peaks in the middle of a crash, but the peaks before the crash are quite indicative.

DJIA stock price data with peak-indicators of 'stock market'.

Figure 1. A graph where the stock price of DJIA is plotted with red dots where a peak in search volume for "stock market" has been observed. From this graph can be observed that erratic movement in search volume precedes a major stock crash.

Adjusted

After all adjustments of the data to eventually get relative daily data, which is relative to each other, the data visually looks as follows:

Adjusted daily data over entire timespan.

Figure 2. A graph in which the adjusted daily data is visualised.

Restrictions

All data on Google Trends is relative (0—100) to each other within one timeframe and you can only get daily data in 6-month increments, weekly data in 5-year increments, and only monthly data is provided for the entire timespan available. So to aggregate all data needed for this project was quite a challenge and because of these restrictions aren't completely accurate, however, the method I used was the only method to getting daily data over the entire timespan available (which is crucial for this project).

Method

To get all the data relative to each other, instead of only within its 6-month increment, I had to merge them based on weekly data. However, the weekly data is only available in 5-year increments, so I had to merge these 5-year increments based on the monthly data, which is available for timespan needed for this project. To merge all the 6-month, and 5-year increments, I computed the percentage change of each data point within its respective increment. Afterwards, I got one data point from the higher up periodicity data per increment and computed the missing days by applying the percentage change to the provided data point.

Example

An example of the search term 'debt' ('debt' is the best search term to predict market change, according to the research mentioned earlier) in the timespan 2007—2009:

Before adjustments

Before adjustments of example.

Figure 3. A graph where the unadjusted relative daily data is visualised. The black vertical lines indicate the edges of the 6-month increments.

After adjustments

After adjustments of example.

Figure 4. A graph where the adjusted relative daily data is visualised. The graph follows the actual weekly data much better.

Weekly

Actual monthly data.

Figure 5. The actual weekly data.

Features

To get better results, the raw data had to be feature engineered. Features used include:

Following the computation for these features, all of them are shifted 3 through 10 days. This is because Google Trends data is available three days after the fact and the target may correlate well with further shifted data. Afterward, there are 272 features. The top 50 correlating (with the target, according to the Pearson correlation coefficient) are used in the training and predicting of the direction of the Dow Jones Industrial Average.

Simple Moving Average Delta

SMA delta.

Figure 6. When this feature becomes more volatile, the close price follows. This is a good indicator for a machine learning algorithm. It can also be seen that the close price percentage change loosely follows the line of the feature.

Bollinger Bands

Bollinger bands.

Figure 7. When the 20-day simple moving average crosses the upper Bollinger band, the close price becomes more volatile. The stock close percentage change also loosely follows the lower Bollinger band.

Project Organisation

    ├── LICENSE
    ├── Makefile           <- Makefile with commands like `make data` or `make train`
    ├── README.md          <- The top-level README for developers using this project.
    ├── data
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.
    │
    ├── docs               <- A default Sphinx project; see sphinx-doc.org for details
    │
    ├── models             <- Trained and serialized models, model predictions, or model summaries
    │
    ├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
    │                         the creator's initials, and a short `-` delimited description, e.g.
    │                         `1.0-jqp-initial-data-exploration`.
    │
    ├── references         <- Data dictionaries, manuals, and all other explanatory materials.
    │
    ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
    │   └── figures        <- Generated graphics and figures to be used in reporting
    │
    ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
    │                         generated with `pip freeze > requirements.txt`
    │
    ├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
    └── src                <- Source code for use in this project.
        ├── __init__.py    <- Makes src a Python module
        │
        ├── data           <- Scripts to download or generate data
        │   └── make_dataset.py
        │
        └── features       <- Scripts to turn raw data into features for modeling
            └── build_features.py

MIT License

Copyright (c) 2020 Cristian Perez Jensen

stock-market-prediction-via-google-trends's People

Contributors

cristianpjensen avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

stock-market-prediction-via-google-trends's Issues

README for `data`-folder

Description

Create a README.md for the data-folder with multiple graphs containing data from the data in this folder. The plots should be made using seaborn. I will be making all plots for this project with seaborn so that all READMEs are consistent with each other.

Acceptance Criteria

When a README has been made. The graphs should be clear. There should be multiple plots (look in plot galleries for inspiration.

Why?

This is required because it shows people what the data looks like. They can of course also look at the deployment, but graphs in a README are more accessible.

Configurable model

Description

Make a neural network, which is configurable to the user's likings.

Acceptance Criteria

When a neural network model, which is configurable is made. This will most likely be made using TensorFlow.

Why?

Because one size doesn't fit all in neural networks. The model has to be configurable to account for this.

Search box change

Description

The search box needs to only change the graph upon selecting a search term.

Acceptance Criteria

When after hitting either "enter" or pressing on a search term suggestion, the graph changes, and not in another way.

Why?

Because else the graph will be empty half the time, and it causes lag, because the graph keeps updating.

Search box suggestions

Description

The suggestions made by the search box while typing need to 1. look nice, and 2. be clickable.

Acceptance Criteria

When above criteria are fulfilled.

Why?

This makes it easier and look nicer.

Combine ML features

Description

Combine features and figure out which features are the best for making predictions.

Acceptance Criteria

When the best possible - with the best possible features - machine learning algorithm has been found.

Why?

Better features mean a better algorithm. This could enforce the machine learning model and better its accuracy.

Feature engineer percentage changes.

Description

Instead of using absolute values, relative values would be more valuable. Would also be easier to compute in the future to make predictions when deployed.

Why?

Because it is easier to retrieve after the fact. Don't need to normalize the future data points with the existing data points (google trends is quite annoying in this regard). But percentage changes are the same always. That the Google Trends data is relative doesn't affect percentage changes.

Update docstrings in `make_dataset.py`

Description

Make all docstrings in make_dataset.py comply with the Google Docstrings Style.

Why?

This is the convention that has been chosen for this project. It has already been implemented in build_features.py.

Feature based on days since last peak

Description

A feature, which is essentially just how many days the last peak was. Peaks can be found using the scipy.signals library. Only major peaks should be used, not all the small little peaks (where there are hundreds of). For example, only the peaks preceding a stock crash. Define the characteristics of these peaks, so that the algorithm can start looking for the same kind of peaks in the future. This could definitely help with prediction.

Also, this feature could be one-hot encoded. For example:

Days since last peak

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > 15
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

Acceptance Criteria

Three criteria:

  • Whether or not this feature is possible;
  • Definition of the characteristics of the peaks preceding a stock crash (may have to be done per keyword);
  • Best way of presenting this feature to the machine learning model.

Why?

All features, which could help with better accuracy, is a feature that should be explored.

new complementary tool

My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.

I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.

The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.

With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).

I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating I am also willing, and if not file it in the box.

README for the deployment folder

The deployment needs to be shown via GIFs. No one will open the index.html in a live server. This is the easiest/best way to show it.

Pull stock data

Description

The stock price data has to be pulled, so it can be fed to the neural network. However the weekends aren't incorporated in the stock price data, so it has to be manipulated so that they fit. Either the weekends have to be skipped, the weekends have to be indicated as NaN, or it has to be the same data has the Friday before (or work day before the NaN). Research has to be done on this subject.

Acceptance Criteria

When a viable solution to the weekend/holiday problem has been found, and incorporated.

Why?

The neural network has to be given an outcome of all the Google Trends data.

Adjust Google Trends data

Description

The data pulled from Google Trends has to be adjusted after the data has been pulled. The adjustments have to be made according to Method in the README.md.

Acceptance Criteria

This issue will be considered done when after the data from Google Trends has been pulled, it will be automatically adjusted, and exported to a .csv-file, as described in Method

Why?

This feature is required, because in order to feed data to the neural network, the structure of the data has to stay consistent.

How can i set the location to Global?

Hello, thank you for this repository, i've been doing some research and all of your methodology was very helpful. But, I have a question, how can I set the API to get daily data without the geo parameter? I was trying and every time i get some error. Thanks for your time!

Pull data from Google Trends

Description

The pulling of the data from Google Trends has to be quick and automatic. Google Trends doesn't have an API, so it will have to be done from scratch.

Acceptance Criteria

This issue will be considered closed when a script is able to pull data from Google Trends; an unofficial API.

Why?

This feature is necessary, because if a user wants to use another search term for their instance, they would have to spend hours collecting all data. However, with this feature it would only require seconds/minutes. This would also provide a foundation to being able to collect new data from Google Trends when actively using this program.

Structure

Description

Structure the project according to the cookiecutter data science template.

Why?

Because it makes it more structured and helps people navigate through the project.

Update the main README

Description

The main README is quite outdated, a lot more progress has been made since its creation. Thus an updated README should be made. Things to cover:

  • New plots (made with Seaborn);
  • Look over the existing text and determine whether it is still usable;
  • More text, containing information on;
    • Machine Learning model;
    • The deployment of the webpage;
    • The feature engineering and its various methods.

Acceptance Criteria

When a good-looking README has been made. It should be up to par with the Google documentation style.

Why?

This feature is required to get more people interested in the project. It helps others with understanding the project and the decisions made.

Initial letter doesn't work on Google Chrome

Description

On Google Chrome, initial-letter is not an option (it is in safari). Thus a way of making sure the initial letter also works on Google Chrome will have to be figured out. This could be the fix.

Acceptance Criteria

When the drop caps works on all web browsers.

Why?

Because accessibility is important, and accessibility means all browsers.

Opacity problem

Description

The opacity doesn't go to 1 when the page is being scrolled quickly.

Update README.md with Data Collection information

Description

Update the README.md with information on how the data_collector.py script works.

Acceptance Criteria

When a technically written paragraph has been added to the README.md about the data_collector.py script.

Why?

To improve the documentation.

Change all " to '

Description

Consistency, consistency, consistency...

A lot of " in make_dataset.py in particular.

Complete the README

Description

Make sure that all information that should be in the documentation (readme), is present. Also write following the Google developer documentation style guide.

Acceptance Criteria

When all documentation, which is needed, is written according to the Google developer documentation style guide, and that anyone who is not necessarily an expert (but is interested), understands everything written in the documentation.

Why?

This feature is required, because the documentation is the first thing that people look at, and good-looking documentation inclines people to leave a star, or get engaged in the project.

Hyperparameter tuning

Description

Use a cloud computing service (AWS, Google Cloud, Microsoft Azure ...) to find the best hyperparameters for the model.

Acceptance Criteria

When a model with > 0.65 accuracy has been found.

Why?

Hyperparameter tuning is a big part of machine learning and can make or break the algorithm. Good hyperparameters mean better accuracy, better accuracy means more money in this case.

Merge data_adjuster.py and data_collector.py

Description

Change data_adjuster and data_collector, so that they are one script, and not all files from Google Trends will be downloaded, only the adjusted daily data is outputted.

Acceptance Criteria

When one script does the job of both of these scripts.

Why?

This saves space on hard drives, it declutters, and it looks more clean.

Figures for README.md (and deployment).

Plots needed

  • A graph, where the Google Trends data is a heatmap with a line plot of the stock price data over it;
    • This is to indicate the correlation between the Google Trends data and the stock price.
  • Various graphs where the adjustments made are clear and concise, perhaps an example.
    • This is to indicate why the adjustments are needed, and how they were made.

Various graphs could be added to this issue.

Some data in the CSV files overlap

Description

The last and the first of the daily data are the same days, however only the daily data files shouldn't end on the first, but on the last day of a particular month.

Acceptance Criteria

When - after export - the data doesn't overlap anymore.

Why?

Makes it easier to manipulate them as panda dataframes.

Feed data to neural network

Description

There has to be a method to how the data will be fed to a neural network. So that is what this issue has to solve. This issue has to answer some questions, like "How many weeks back will the model be fed?".

Acceptance Criteria

This issue will be considered done when a viable method of feeding data to the neural network has been made.

Why?

This is required, because the data the neural network gets fed, will also determine the accuracy of it.

Feature engineering

Description

Determine which features are worth keeping and which aren't.

Acceptance Criteria

When a model has been made - with features - which can outperform the stock market using a buy-and-hold strategy.

Why?

This is an essential part of the machine learning workflow.

Use many features with one keyword

Description

Use only one keyword (stock market) and create many types of features with that (Bollinger Bands, EMA, MA, etc.). Use stock market, because what of what I can find out, it is the best search term that correlates with the stock market.

Acceptance Criteria

When good features for this problem have been found, implement, and visualised.

Why?

This will make for some great visualisations and I might be able to find out which feature(s) is the best for multiple keywords

Text under graphs

Description

There has to be a text under the graphs to explain what is being visualised.

Use K-fold cross-validation

Description

Utilise the powers of k-fold cross-validation in the machine learning model.

Acceptance Criteria

When K-fold cross-validation has been implemented.

Why?

All machine learning models that perform well are very overfitted (training accuracy = 1.0). This may help against that.

Search box compare to stock price.

Description

There has to be the ability to compare the search terms to the stock price of ^DJI.

Acceptance Criteria

When the line for the stock price of ^DJI is also in the "Explore" graph.

Why?

The interesting part of the project is to compare the search terms to the stock price, since that is the point of the project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.