karelze / thesis Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 1.0 76.59 MB

My thesis 🏅

License: BSD 3-Clause "New" or "Revised" License

Python 15.03% TeX 49.78% Jupyter Notebook 34.92% Makefile 0.02% Perl 0.07% Shell 0.17%

classification ml quantitive-finance karlsruhe-institute-of-technology kit

thesis's Introduction

thesis

Overview

This repository contains all the resources for my thesis on option trade classification at Karlsruhe Institute of Technology.

notes 📜	schedule ⌚	experiments 🧪	computing resources ☄️	document 🎓
See `references` folder. Download obsidian from obsidian.md to easily browse the notes.	Link to tasks and mile stones.	Link to weights & biases (requires login).	Link to gcp (requires login), and to bwHPC (requires login).	see `releases`.

Development

Set up git pre-commit hooks 🐙

Pre-commit hooks are pre-checks to avoid committing error-prone code. The tests are defined in the .pre-commit-config.yaml. Install them using:

pip install .[dev]
pre-commit install
pre-commit run --all-files

Run tests🧯

Tests can be run using tox. Just type:

tox

Acknowledgement

The authors acknowledge support by the state of Baden-Württemberg through bwHPC.

Our implementation is based on:

Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. Advances in Neural Information Processing Systems, 34, 18932–18943.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Proceedings of the 32nd International Conference on Neural Information Processing Systems, 32, 6639–6649.

Rubachev, I., Alekberov, A., Gorishniy, Y., & Babenko, A. (2022). Revisiting pretraining objectives for tabular deep learning (arXiv:2207.03208). arXiv. http://arxiv.org/abs/2207.03208

thesis's People

Contributors

Stargazers

Watchers

Forkers

boolwigo7

thesis's Issues

Set-up baseline 🧸

Chapter on simulation results🧸

Data preprocessing and feature engineering 🚜

Write proposal for feature sets based on EDA / literature research
Write script for feature engineering
Create more features from quote and price data. See e. g., Rosenthal or Prado
Combine relative bid and ask into one measure. Look at distribution first. Look at information density feature found Rosenthal paper.
Perform adversarial validation on newly created feature sets e. g., with min-max-scaling, w /o log-transform etc. and get feature importances
Create features that are hard to learn for neural nets and gradient boosting machines
When loading data, verify wandb hashes
Add economic intuition to each feature. Which paper suggest the feature and why does it make sense? (Feedback from @CaroGrau). Research why log-transform on prices makes sense from a theoretical perspective?

Set-up writing environment

Set-up obsidian vault
Renew Grammarly subscription
Migrate old notes on transformer, GBMs etc. to obsidian
Set-up GitHub action to built document
Write checklist / script to test for common mistakes like long sentences, nested sentences, repeat main learning... (#4)

Model training and optimization 🎯

Run studies for `SelfTrainingClassifier`🅾️

Add support to insert unlabelled trades chronologically
Regenerate unlabelled feature sets
Verify predicted probabilities are high enough
Run on feature set classical
Run on feature set classical-size
Run on feature set ml

Add chapter on Attention⚡

Chapter on feature importance results🪄

Chapter on discussion💬

Print-ready tables🖨️

Typeset my auto-generated tables
Title case index and columns
Add proper captions
Indent first column as done here. → not possible with pandas. Do manually.
Show best in class in bold
Fix warning in universal result notebook
Decide which columns to include

Add review comments from Christian

Page 2 (Related Work (3 p)): Einleitungstext?
Page 2 (Related Work (3 p)): Warum machst du hier nicht Kapitel 2.1?
Page 2 (Related Work (3 p)): Leerzeichen
Page 2 (Related Work (3 p)): Leerzeichen
Page 3 (Related Work (3 p)): Kapitel 2.2?
Page 4 (Related Work (3 p)): sehr schönes Kapitel
Page 6 (Quote Rule): .
Page 6 (Quote Rule): ist das Komma zu viel?
Page 6 (Tick Test): Die Fußnote geht noch auf die nächste Seite. Word ist ein Segen :)
Page 9 (Trade Size Rule): schreibt man das so?
Page 12 (Ellis-Michaely-O'Hara Rule): "cp."?
Page 13 (Stacked Rule): würde ich an deiner Stelle nicht angeben, da alles von dir stammt, was nicht anders angegeben ist, kannst du aber natürlich so machen
Page 14 (Stacked Rule): "cp."?
Page 19 (Architectural Overview): "over"
Page 24 (Residual Connections and Layer Normalization): was ist denn das für ein Zeichen? :D
Page 45 (Glossary):

warum stehen hier nach den Erklärungen immer Zahlen? Sind das die Seiten, auf denen das verwendet wird?
wenn ja, finde ich unnötig :D wenn du es drin haben willst, kannst du ggf. "pp." davor schreiben oder so

Paragraph on Random Feature Permutation / Partial Dependence Plots📑

Research, if random feature permutaton is a good choice
Research, how random feature permutation can be applied to groups
Find consistent way to denote random feature permutation

Update chapter on dataset/results📑

Find out timespans
Use different reasoning why professional customers are excluded
Extend that unlabelled dataset also includes non-customer trades
Fill in gaps with regard to timespan
Fill in gaps with regard to trade initiator
Update visualizations
Update text on CBOE
Improve typesetting of graphics
Improve typesetting of images / tables

Implement and Study Feature Importances🪄

Visualize embeddings with t-SNE See here and here. and here..
Implement feature importances with SAGE as done here.
Implement attention maps as documented here.

Allow unclassified in ClassicalClassifier🏦

As discussed with @CaroGrau, evaluate the percentage of trades that cannot be classified.

Add option to classical classifier to disable random assignment of unclassified trades and assign 0 instead
Create table in results notebook similar to Table 3 in Grauer et. al
Run results notebook with unclassified trades
Create new tests
Send results to @CaroGrau
Update docs and pass precommit hooks

Application Study

research possible applications
do prototypical implementation

Set up dataset 🔢

Request data from @CaroGrau
Load csv data into pandas data frame
Infer correct datatypes
Create sub samples e. g., 2015
Create train-test-split. Consider leakage.
Clean up requirements / fix version
Create tests / assertations against @CaroGrau paper
Run adversarial validation to check for differences in the training and test set

Add chapter on TokenEmbeddings💤

Shorten existing draft
Add some nice viz
Rewrite for final document

Chapter on Data Preprocessing🏭

Collect notes on preprocessing applied by Grauer et al. This is now part of section on data construction.
Distinguish how data preprocessing would be different from a paragraph on sample construction in the chapter dataset
Allign chapter with a chapter on feature engineering. Feature engineering is tracked in #168. Will become part of data preprocessing chapter.
Write a draft

Section on Feature Engineering🪄

Shorten prewritten text / remove redundant parts
Motivate feature sets
Do more research on what my models can learn e.g., linear models and why it makes sense to perform feature engineering
Discuss how missing values are treated
Discuss why we put so much emphasize on the feature set
Better motivate why feature engineering is required in the first place

Chapter on conclusion🔚

Write expose

Load and preprocess unlabelled ISE data🪄

Calculate feature day_vol
Remove labelled duplicates
Verify option characteristics etc. are the same for trades in the labelled dataset and unlabelled dataset
Calculate timeseries correlation between volume of professional customer and market maker → Currently not possible, as I don't have the ISE trade profile. Asked @CaroGrau via mail if I should calculate time series calculation.
Save a joint dataset of unlabelled and labelled trades → I save unlabelled trades only.

Code review of data preparation notebooks😈

Fill in gap for trade initiator definition🧑‍🌾

Fill in gaps in chapter 3.1
Fill in gaps in chapter 6.1.1

Improve feature engineering notebook🤏

Change naming / location of result files e. g, train set in properly named subfolder in feature engineering notebook
Switch between log_standardized and unscaled mode
Remove highly infrequent classes → keep as is
Fix calculation of proximity to quotes and midspread
Regenerate datasets
Remove outdated code / features

Implement proper training setup for transformers🤖

Clip outliers in dataset
Add gradient norm
Unscale gradient
Add learning rate scheduler with learning rate warmup
Removed exp. weighting
Fixes a bug when saving compiled model
Adds support for pytorch 2.0
Allign search space with https://arxiv.org/pdf/2106.11959.pdf

Experiemnts tracked here: https://wandb.ai/fbv/thesis/reports/FT-Transformer-ISE-downsampled--Vmlldzo0MDQzOTgy

Add chapter on Gradient Boosting Procedure 🐈

Pre-writing in Zettelkasten 🗄️

Finish integration of `SelfTrainingClassifier`🅾️

I accidentially overwrite pretraining. See here.
Persist sqllite db
Pass name of study between wandb and optuna to allow one continuous run
Experiment in wandb how to use an existing run.

Fix midpoint / spread in `ClassicalClassifier` , feature engineering, and `effective spread`🐞

Update results_universal.ipynb. During the implementation of the effective spread it became appearant, that the results highly dependent on a correct estimate of the midspread. Grauer et al. also incorporate this criterion. Midspread can be wrong if bid > ask, as it is calculated by (ask + bid) * 0.5. Implement different calculation. → Done in #234 already.
Update ClassicalClassifier
Update feature engineering notebook → Do in #231

Evaluation 💯

Study feature importance by calculating SHAP values against random features Boruta-Shap or this notebook.
Visualize model output with t-SNE
Study feature importance with permutation importance
Visualize feature importance as done here.
Calculate confusion matrix as done here.
Create partial dependence plots here. and in sklearn
Create callibration plots to quantify the uncertainty in prediction. Some interesting video is posted here.
Effective spread
Add improvements / error estimates using using siunitx https://tex.stackexchange.com/a/306636/169093 and https://tex.stackexchange.com/a/430290/169093 (better)

Chapter on Self-Training⭕

Chapter on semi-supervised learning
Introduce notation for semi-supervised learning
Introduce self-training

Set up experiment tracking 🧪

Set up model and data tracking using weights and bias
Save results to Google Cloud Storage as in this notebook

Review and Test #137 🐞

Review and test #137

Remove from feature set mode `none` the zero imputation 🐞

I noticed, that in none mode in the feature engineering notebook, we performed zero-permutation. This might lead to wrong results, when classical rules are applied afterfwards.

Fix imputation in feature engineering notebook
Reran result generation

Use database, as done here.
Update name of files
Update defaults for parameters

Extend result generation🏁

Edit in annotations from review 2👩‍💻

Detailed comments

Page 2 (Related Work (3 p)): "the tick rule, quote rule" -- Ich weiß, der Introduction-Teil kommt noch, aber sagst du da dann kurz was zu den einzelnen Rules? Ich weiß du erklärst sie alle genauer im Theorie-Teil, aber wenn man die hier zum ersten mal hört, wird man etwas ins kalte Wasser geschmissen (imo sollte also in der Introduction zumindest in ein paar kurzen Sätzen erklärt werden, was rule-based approaches sind / wie sie funktionieren).
Page 2 (Related Work (3 p)):

Consistently for options traded at the International Securities Exchange (ISE), and CBOE classical rules like the popular Lee and Ready algorithm only achieve accuracies of 62.53 % or 62.03 % and are thus significantly smaller than in the stock market.

Den Satz verstehe ich nicht ganz. Kann es sein, dass hier das komma vor dem and weg muss und ein komma nach CBOE hingehört?
Page 3 (Related Work (3 p)):

Hier kommt jetzt ein etwas allgemeiner Absatz, der die rule-based approaches mit dem ml-approaches vergleicht und erklärt, was eure work zu erreichen versucht. Sowas könnte man auch zumindest kurz am Anfang des Related Work Teils einfügen (Vor die rule-based section), damit man einen groben überblick bekommt, bevor man auf die Details eingeht.

Natürlich hängt das auch wieder davon ab, was in deiner Introduction stehen wird, das wird dann nochmal sehr interresant.

Hier könnte ich auch noch hinzufügen, dass es imo wichtig ist, dass die dann entsprechend inhaltlich hier abgestimmt wird (quasi auf den related work vorbereitet wird). In dem Rahmen kann dann erklärt werden, was die einzelnen approaches grob sind, wie sie miteinander vergleichen / im Verhältnis stehen, in den letzten Jahren eingesetzt wurden (nicht historisch betrachtet, aber eben das ganze als Einleitung kurz umreißen, als Motivation, warum / wo diese Arbeit ansetzt).
Page 4 (Related Work (3 p)): Wie siehts mit Transformers aus in diesem Bereich? Wurden die noch garnicht verwendet? Wenn das so ist, würde ich hier irgendwo einen kurzen Disclaimer unterbringen, weil du die ja auch einsetzen möchtest. Wenn das hier nirgendwo erwähnt wird, Transformer dann aber später eingesetzt werden, könnte dass ggf. beim Leser etwas Verwirrung auslösen, was da der Stand in der Literatur ist.
Page 5 (Rule-Based Approaches): "next section." -- Ist damit Section 3.1 gemeint? Oder Section 4? Ich persönlich würde hier ggf. die konkrete Section hinschreiben (ist aber i.d.R. Präferenzsache).
Page 7 (Tick Test): "in" -- Hier würde ich "by" statt "in" schreiben.
Page 7 (Tick Test): "," -- Das Komma würde ich hier weglassen.
Page 7 (Tick Test): "donated" -- Ich schätze hier meinst du "denoted".
Page 7 (Tick Test): Hier würde ich ein Komma setzen.
Page 8 (Depth Rule): "the" -- Braucht man das "the" hier?
Page 8 (Depth Rule): Hier könnte wieder ein Komma hin. Im Englischen ist es i.d.R. freigestellt, ob an solchen Stellen ein Komma hinkommt oder nicht. Würde dann aber darauf achten, dass es in deiner Arbeit einheitlich bleibt.
Page 9 (Trade Size Rule): "the" -- s.o. bzgl. "the".
Page 9 (Trade Size Rule): "the" -- Hier ggf. "their" statt "the" (glaube auch wieder Präferenz).
Page 11 (Ellis-Michaely-O'Hara Rule):

By analysing miss-classified trades with regard to the proximity of the trade to the quotes, they observe, that the quote rule and by extension of the LR algorithm performs particularly well at classifying trades executed at the bid and the ask price but trail the performance of the tick rule for trades inside or outside the spread (Ellis et al., 2000:pp. 535–536).

Der Satz ist etwas lang & schwer zu lesen (insbesondere der Teil "and by extention of the LR algorithm...", hier bin ich mir nicht ganz sicher was konkret gemeint ist. Meinst du sowohl die quote rule und (by extension) auch der LR algorithm performen ....? Oder meinst du die quote rule (by extension of the LR algorithm) performt ....?)
Page 13 (Stacked Rule): Hier könnte der Einheitlichkeit wegen wieder ein Komma hin.
Page 13 (Stacked Rule): "," -- Hier würde ich einen Punkt setzen und einen neuen Satz beginnen.
Page 13 (Stacked Rule): "Chakrabarty, Li, et al." -- Gibt es einen Grund, warum du hier 2 Namen und dann et al. und nicht 1 Namen und dann et al. schreibst?

Generate ISE / CBOE supervised results of Gradient Boosting🐈

Run supervised training on fbv/thesis/ise_supervised_log_standardized:latest for classical features
Run supervised training on fbv/thesis/cboe_supervised_log_standardized:latest for classical features
Run supervised training on fbv/thesis/ise_supervised_log_standardized:latest for classical-size features
Run supervised training on fbv/thesis/cboe_supervised_log_standardized:latest for classical-size features
Run supervised training on fbv/thesis/ise_supervised_log_standardized:latest for ml features
Run supervised training on fbv/thesis/cboe_supervised_log_standardized:latest for ml features
Generate result sets e. g., cboe_gbm_supervised_test

Add visualizations of hyperparameter search space🌔

Manually create optuna visualizations, as I couldn't adjust style / wrong size
strip outputs from notebooks → added pre-commit hook
Add to document

Fixed some errors along the way 🐛:

CBOE (ML) results were generated from a incomplete run. → results regenerated and are slightly improved now
Proximity to quotes was calculated wrong in the results tables, so that to many samples ended up in unknown.

Edit in annotations from review🎒

Annotations from @lxndrblz .

Detailed comments

Page 2 (Related Work (3 p)): "are" -- were?
Page 2 (Related Work (3 p)): "62.5 % or 62.0 %" -- Kleinere Anzahl zuerst nennen?
Page 2 (Related Work (3 p)): "reported" -- Anderes Wort?
Page 2 (Related Work (3 p)): "bid or ask." -- Hier evtl. noch eine Erklärung einfügen?
Page 2 (Related Work (3 p)):

surpassing primary rules by more than 10.0 %, at the cost of data efficiency.

Vielleicht erklären, was das bedeutet?
Page 3 (Related Work (3 p)): "important" -- relevant
Page 3 (Related Work (3 p)):

Rosenthal (2012, pp. 481–482)

Quelle eher ans Satzende setzen?
Page 3 (Related Work (3 p)): "thus" -- Bin mir nicht sicher, ob man das wirklich so im Satz verwendet. Sehe es meistens am Satzanfang.
Page 3 (Related Work (3 p)):

Rosenthal (2012, p. 15)

Falsche Quellenangabe? Rosenthal gehört doch dicherlich in die Klammern.
Page 4 (Related Work (3 p)): "of" -- by?
Page 4 (Related Work (3 p)): "To the best of our knowledge," -- Dangling modifier.
Page 5 (Rule-Based Approaches): "that sign trades on a trade-bytrade basis" -- Was ist damit gemeint?
Page 6 (Quote Rule):

By definition, the quote rule cannot classify trades at the midpoint of the quoted spread.

Ist das dann nicht wiedersprüchlich zur vorherigen Defintion von m?
Page 6 (Quote Rule): "to couple" -- couepling?
Page 7 (Tick Test): "Equation 2" -- Equation 3?
Page 7 (Tick Test): "bracketed" -- Constrained
Page 7 (Tick Test):

In practice, Grauer et al. (2022, pp. 29–32) observe higher accuracies for the reverse tick test on a sample of option trades, but both cannot compete with quote-based approaches and calls for more sophisticated approaches. These findings contradict the results from the stock market (Lee & Ready, 1991, p. 737).

Bitte noch erklären, was das bedeutet. Der letzte Satz wirkt ziemlich verloren.
Page 9 (Hybrid Rules):

Popular variants include the Lee-Ready (LR) algorithm, the EMO rule, and the Chakrabarty-Li-Nguyen-Van-Ness (CLNV) method.

Gibt es dafür auch eine Quelle?
Page 11 (Lee and Ready Algorithm): "its parts." -- its subparts
Page 11 (Lee and Ready Algorithm): "cp." -- Warum cp?
Page 12 (Ellis-Michaely-O'Hara Rule): "caused" -- causes?
Page 12 (Chakrabarty-Li-Nguyen-Van-Ness Method):

(Chakrabarty et al., 2012, p. 3809)

Wieso steht die Quelle hier mittendrin?
Page 13 (Stacked Rule): "trend" -- Würde ein anderes Wort wählen.
Page 13 (Stacked Rule): "realize" -- unleash/unfold
Page 13 (Stacked Rule): "A obvious question is" -- Würde das weglassen.
Page 15 (Architectural Overview): "Transformer" -- Wieso hier kursiv und oben nicht?
Page 15 (Architectural Overview):

At times we fall back to the Transformer for machine translations, to develop a deeper understanding of the architecture and its components.

Kann weg, da es hier ja noch um die Hintergründe und nicht deine Arbeit geht.
Page 18 (Architectural Overview): "observation" -- anderes Wort?
Page 19 (Positional Encoding):

We visualize the positional encoding in Figure 4 with an embedding dimension of 96 and 64 tokens. One can see the alternating pattern between even and odd columns and the unique pattern for each token's position.

Basierend auf welchen Daten wurde das visualisiert oder gilt das Schaubild allgemein gültig?
Page 19 (Positional Encoding): "zero-centred, and" -- Komma kann weg?
Page 20 (Positional Encoding): "lement-wisely" -- per element?
Page 20 (Position-wise Feed-Forward Networks): "To retain general information on the task" -- Dangling modifier
Page 21 (Position-wise Feed-Forward Networks): "linearities to the network." -- Der Umbruch sieht komisch aus.
Page 21 (Residual Connections and Layer Normalization):

(He et al., 2015, pp. 1–2)

Die Quelle kann weg oder da es ja um die Arbeit von Vaswani geht?
Page 22 (Residual Connections and Layer Normalization): "calculated with the statistic" -- Wo gehört das dazu?
Page 22 (Residual Connections and Layer Normalization):

Until now it remains unclear, how the layer normalization intertwines with the sublayers and the residual connections.

Kann weg.
Page 23 (Transformer Networks For Tabular Data):

Both architectures are depicted in Figure 6a and Figure 6b, respectively

Würde die zugehörige Grafik auf der selben Seite platzieren.
Page 24 (TabTransformer): "To maintain the overall embedding dimension of de," -- Noch ein Dangling Modifier.
Page 24 (TabTransformer): "probabilities" -- probabilities outputs -> Dann wäre es gleich wie beim Input.
Page 25 (TabTransformer):

Somepalli et al. (2021, p. 2).

Huh?
Page 34 (Outlook (0.5 p=67.5 p)):

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition.

Evtl. einmal den Biber check laufen lassen, um fehlende Felder zu finden. Ist das ein Buch?
Page 37 (Glossary): "Item" -- Formatierung.

Optional model training 🍭

# found at: https://www.kaggle.com/competitions/ieee-fraud-detection/discussion/108575
temp = df['card1'].value_counts().to_dict()
df['card1_counts'] = df['card1'].map(temp)

Study feature interactions:

feature_interaction = [[X.columns[interaction[0]], X.columns[interaction[1]], interaction[2]] for i,interaction in interactions.iterrows()]
feature_interaction_df = pd.DataFrame(feature_interaction, columns=['feature1', 'feature2', 'interaction_strength'])
feature_interaction_df.head(10)

Compare different feature scaling e. g., normalization, z-score-normalization, robust scaler, quantile transformer etc. approaches for neural net. See here. and here.
Study effects of quantization. How can one assist quantization with feature engineering? For technical background see here. and here.
Try out nicer progress bar. See here.

Implement Self-Training Classifier⭕

Implement Self-Training Classifier after https://dl.acm.org/doi/10.3115/981658.981684

Issues from code review 🐛

Removes outdated notebook
Unifies some cells
Fixes import sorting
Removes Google Colab badges
Completes doc strings
Unified use of | instead of Union[...] or Optional[...]
Replaces some lengthy code with list comprehensions

Adressed in other issues:

try out moving window approach. See #97.
try out learning rate scheduling. See #7.
change progress bar. See #97.
update variable name of logits. See #118 .

karelze / thesis Goto Github PK

thesis's Introduction

thesis

Overview

Development

Set up git pre-commit hooks 🐙

Run tests🧯

Acknowledgement

thesis's People

Contributors

Stargazers

Watchers

Forkers

thesis's Issues

Detailed comments

Detailed comments

Recommend Projects

Recommend Topics

Recommend Org