Git Product home page Git Product logo

thesis's Introduction

badge thesis badge code badge code coverage

thesis

Overview

This repository contains all the resources for my thesis on option trade classification at Karlsruhe Institute of Technology.

notes 📜 schedule ⌚ experiments 🧪 computing resources ☄️ document 🎓
See references folder. Download obsidian from obsidian.md to easily browse the notes. Link to tasks and mile stones. Link to weights & biases (requires login). Link to gcp (requires login), and to bwHPC (requires login). see releases.

Development

Set up git pre-commit hooks 🐙

Pre-commit hooks are pre-checks to avoid committing error-prone code. The tests are defined in the .pre-commit-config.yaml. Install them using:

pip install .[dev]
pre-commit install
pre-commit run --all-files

Run tests🧯

Tests can be run using tox. Just type:

tox

Acknowledgement

The authors acknowledge support by the state of Baden-Württemberg through bwHPC.

Our implementation is based on:

Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. Advances in Neural Information Processing Systems, 34, 18932–18943.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Proceedings of the 32nd International Conference on Neural Information Processing Systems, 32, 6639–6649.
Rubachev, I., Alekberov, A., Gorishniy, Y., & Babenko, A. (2022). Revisiting pretraining objectives for tabular deep learning (arXiv:2207.03208). arXiv. http://arxiv.org/abs/2207.03208

thesis's People

Contributors

dependabot[bot] avatar github-actions[bot] avatar karelze avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

boolwigo7

thesis's Issues

Set-up baseline 🧸

  • Calculate some stats from @CaroGrau paper
  • Implement classical rules
  • Convert classical rules to sklearn classifier
  • Document code
  • Clean up requirements / fix version
  • Set-up a hyperparameter search with optuna as in this notebook.
  • Set up simple model e. g., logistic regression, lightgbm, or catboost
  • Fill missing with -1
  • Plot learning curves as done here.
  • Set up simple robustness checks as in this notebook.

Data preprocessing and feature engineering 🚜

  • Write proposal for feature sets based on EDA / literature research
  • Write script for feature engineering
  • Create more features from quote and price data. See e. g., Rosenthal or Prado
  • Combine relative bid and ask into one measure. Look at distribution first. Look at information density feature found Rosenthal paper.
  • Perform adversarial validation on newly created feature sets e. g., with min-max-scaling, w /o log-transform etc. and get feature importances
  • Create features that are hard to learn for neural nets and gradient boosting machines
  • When loading data, verify wandb hashes
  • Add economic intuition to each feature. Which paper suggest the feature and why does it make sense? (Feedback from @CaroGrau). Research why log-transform on prices makes sense from a theoretical perspective?

Set-up writing environment

  • Set-up obsidian vault
  • Renew Grammarly subscription
  • Migrate old notes on transformer, GBMs etc. to obsidian
  • Set-up GitHub action to built document
  • Write checklist / script to test for common mistakes like long sentences, nested sentences, repeat main learning... (#4)

Model training and optimization 🎯

  • Decide whether I want to treat option trade classification as a proabilistic classification problem or not
  • Bundle training in a docker contrainer that can run in any pod or on scc infrastructure
  • Add parametrized script to run studies
  • Get access to BwUniCluster2.0
  • Write script to start studies
  • Free memory with gc.collect()
  • Set-up pre-commit hooks e. g., mypy
  • Implement gradient boosting using catboost
  • Implement TabTransformer using TabSurvey
  • Try out TabTransformer closer to original paper without einops and default PyTorch implementations for attention. (see. e. g., here.
  • Write custom DataSet for PyTorch
  • Improve training performance of TabTransformer
  • Look into data pipes / data loader 2 for PyTorch
  • Implement save callback for PyTorch models
  • PyTorch use multiple GPUs. (dataparallel) See here.
  • Implement FTTransformer
  • Add timing code
  • Implement TabNet using TabSurvey
  • Set up weights and biases integration as shown here or here
  • Set up test cases for gbms
  • Set up test cases for classical classifier
  • Set up test cases for TabTransformer.
  • Set up test cases for FTTransformer. See here
  • Set up test cases for TabNet
  • Simplify objective code through train() and test() method as done here.
  • Add weighted loss for neural networks
  • Adjust early stopping of neural networks to work with accuracy
  • Migrate heroku db
  • Save completed study objects to gcs and track in wandb
  • Track saved models in wandb
  • Visualize learning curves for best CatBoost model
  • Research, if early stopping in neural nets should also be done based on the accuracy
  • Add code to obtain feature importances from trained models
  • Add shared embeddings to TabTransformer. See paper or here.
  • Adress problem how to adress high dimensionality of categorical variables
  • Research if there is a broader theory / concept to decay e. g. exponential smooting or weighted regression etc.
  • Visualize decay parameter and find optimal decay factor
  • Study samples / probabilites where the prediction is wrong
  • Fully understand how the target value should be like for gradient boosting -1/1 or 0/1 like in neural nets?
  • Add code to study learning curves e. g., in wandb
  • Add code for pre-training
  • Experiment with learning rate scheduling
  • Add code for attention visualization. First viz done in #85, but still have to research best approach to combine maps over different attention heads and layers. Finally decided for a method proposed by Chefer et. al. See here and here.
  • Get slurm running
  • go through Google playbook
  • Define rounds for what I want to optimize
  • Do searches more systematically. See this article here. and here. and here
  • Extend experiment tracking as shown here.
  • normalize data
  • Set up configuration for training
  • Set up concrete action plan how to improve training
  • pytorch 2.0 integration
  • Differentiate into exploration and exploitation phase
  • Think about folding the validation set into the training set and retrain the best configuration
  • Check out retraining
  • change early stopping criterion
  • Set up option to fix some hyperparams through a config or so
  • Verify hyperparameter search space
  • Use batch size finder (Implemented in #125)
  • Add some option to generate results fast
  • Implement a retraining
  • Replace early stopping with checkpointing?
  • Check if logits is the right word in code
  • What samples does the model get wrong?
  • Add training curves to wandb
  • Check in wandb if the hyperparameter search space is chosen optimal
  • Rerun studies with different initializations. See how it affects the results
  • Add code to average results from different initializations
  • Add code for visualizations e. g., hyperparameter search space, influence of randomness etc.
  • Do code review with @pheusel or @lxndrblz

Run studies for `SelfTrainingClassifier`🅾️

  • Add support to insert unlabelled trades chronologically
  • Regenerate unlabelled feature sets
  • Verify predicted probabilities are high enough
  • Run on feature set classical
  • Run on feature set classical-size
  • Run on feature set ml

Print-ready tables🖨️

  • Typeset my auto-generated tables
  • Title case index and columns
  • Add proper captions
  • Indent first column as done here. → not possible with pandas. Do manually.
  • Show best in class in bold
  • Fix warning in universal result notebook
  • Decide which columns to include

Add review comments from Christian

  • Page 2 (Related Work (3 p)): Einleitungstext?

  • Page 2 (Related Work (3 p)): Warum machst du hier nicht Kapitel 2.1?

  • Page 2 (Related Work (3 p)): Leerzeichen

  • Page 2 (Related Work (3 p)): Leerzeichen

  • Page 3 (Related Work (3 p)): Kapitel 2.2?

  • Page 4 (Related Work (3 p)): sehr schönes Kapitel

  • Page 6 (Quote Rule): .

  • Page 6 (Quote Rule): ist das Komma zu viel?

  • Page 6 (Tick Test): Die Fußnote geht noch auf die nächste Seite. Word ist ein Segen :)

  • Page 9 (Trade Size Rule): schreibt man das so?

  • Page 12 (Ellis-Michaely-O'Hara Rule): "cp."?

  • Page 13 (Stacked Rule): würde ich an deiner Stelle nicht angeben, da alles von dir stammt, was nicht anders angegeben ist, kannst du aber natürlich so machen

  • Page 14 (Stacked Rule): "cp."?

  • Page 19 (Architectural Overview): "over"

  • Page 24 (Residual Connections and Layer Normalization): was ist denn das für ein Zeichen? :D

  • Page 45 (Glossary):

    warum stehen hier nach den Erklärungen immer Zahlen? Sind das die Seiten, auf denen das verwendet wird?
    wenn ja, finde ich unnötig :D wenn du es drin haben willst, kannst du ggf. "pp." davor schreiben oder so

Update chapter on dataset/results📑

  • Find out timespans
  • Use different reasoning why professional customers are excluded
  • Extend that unlabelled dataset also includes non-customer trades
  • Fill in gaps with regard to timespan
  • Fill in gaps with regard to trade initiator
  • Update visualizations
  • Update text on CBOE
  • Improve typesetting of graphics
  • Improve typesetting of images / tables

Allow unclassified in ClassicalClassifier🏦

As discussed with @CaroGrau, evaluate the percentage of trades that cannot be classified.

  • Add option to classical classifier to disable random assignment of unclassified trades and assign 0 instead
  • Create table in results notebook similar to Table 3 in Grauer et. al
  • Run results notebook with unclassified trades
  • Create new tests
  • Send results to @CaroGrau
  • Update docs and pass precommit hooks

Set up dataset 🔢

  • Request data from @CaroGrau
  • Load csv data into pandas data frame
  • Infer correct datatypes
  • Create sub samples e. g., 2015
  • Create train-test-split. Consider leakage.
  • Clean up requirements / fix version
  • Create tests / assertations against @CaroGrau paper
  • Run adversarial validation to check for differences in the training and test set

Chapter on Data Preprocessing🏭

  • Collect notes on preprocessing applied by Grauer et al. This is now part of section on data construction.
  • Distinguish how data preprocessing would be different from a paragraph on sample construction in the chapter dataset
  • Allign chapter with a chapter on feature engineering. Feature engineering is tracked in #168. Will become part of data preprocessing chapter.
  • Write a draft

Section on Feature Engineering🪄

  • Shorten prewritten text / remove redundant parts
  • Motivate feature sets
  • Do more research on what my models can learn e.g., linear models and why it makes sense to perform feature engineering
  • Discuss how missing values are treated
  • Discuss why we put so much emphasize on the feature set
  • Better motivate why feature engineering is required in the first place

Write expose

  • Finish research on SOTA methods for tabular data
  • Finish draft for structure
  • Write first draft of expose
  • Finalize expose
  • Send to @CaroGrau

Load and preprocess unlabelled ISE data🪄

  • Calculate feature day_vol
  • Remove labelled duplicates
  • Verify option characteristics etc. are the same for trades in the labelled dataset and unlabelled dataset
  • Calculate timeseries correlation between volume of professional customer and market maker → Currently not possible, as I don't have the ISE trade profile. Asked @CaroGrau via mail if I should calculate time series calculation.
  • Save a joint dataset of unlabelled and labelled trades → I save unlabelled trades only.

Improve feature engineering notebook🤏

  • Change naming / location of result files e. g, train set in properly named subfolder in feature engineering notebook
  • Switch between log_standardized and unscaled mode
  • Remove highly infrequent classes → keep as is
  • Fix calculation of proximity to quotes and midspread
  • Regenerate datasets
  • Remove outdated code / features

Pre-writing in Zettelkasten 🗄️

  • Proper notation for recursion in tick rule / reverse tick rule. See: https://math.stackexchange.com/questions/4640910/notation-when-accessing-sequence-elements-in-recusive-formula
  • Separate chapter for hybrid approaches through stacking? See questions.md
  • Separate chapter to discuss the trade initiator? See questions.md
  • Research if there are any other works that use majority voting or stacking
  • Check specialist / market maker at CBOE.
  • "Gipfeln in" Indicate that Grauer et al didn't invent this concept, but applied it most excessively
  • Add Murjajev to concept of stacking (quote rule (ex) -> quoterule (nbbo))
  • Motivate depth rule with Glosten 1994 (found in Hagströmer) -> Why depth is informative
  • Uppercase $p$, $m$ etc.
  • Denote sells by $-1$ instead of $0$.
  • Add two ideas of LR algorithm as "intuition"
  • Better motivate quote rule. See. e. g., Roll (1984) (found in Peterson and Fialkowski)
  • One sentence about trade initiator -> technical limitation (see peterson and sirri 263)

Fix midpoint / spread in `ClassicalClassifier` , feature engineering, and `effective spread`🐞

  • Update results_universal.ipynb. During the implementation of the effective spread it became appearant, that the results highly dependent on a correct estimate of the midspread. Grauer et al. also incorporate this criterion. Midspread can be wrong if bid > ask, as it is calculated by (ask + bid) * 0.5. Implement different calculation. → Done in #234 already.
  • Update ClassicalClassifier
  • Update feature engineering notebook → Do in #231

Evaluation 💯

Chapter on Self-Training⭕

  • Chapter on semi-supervised learning
  • Introduce notation for semi-supervised learning
  • Introduce self-training

Remove from feature set mode `none` the zero imputation 🐞

I noticed, that in none mode in the feature engineering notebook, we performed zero-permutation. This might lead to wrong results, when classical rules are applied afterfwards.

  • Fix imputation in feature engineering notebook
  • Reran result generation

Exploratory data analysis 🌋

  • Get understanding, how options or trades are distributed
  • Check out these general tips for EDA
  • Recreate summary statistics from table 1
  • Check for correlated features (see e. g., here and here.)
  • Check for multi-collinearity
  • Check for necessary transformations e. g., log transform
  • Perform EDA with e. g., AutoViz or umap or pandas-profiling
  • Is there a pattern for the missing values?
  • Are there disguised missing values? E. g. ask of zero. Will discuss in #106.
  • Summarize the most important findings for the plots
  • Integrate findings into feature proposal from #30

Check writing style / format / content 👔

Check for the following mistakes:

  • Cref, units, spaces between units (\SI{10}{\percent}), plot ranges, supress titles, [H]. See this guide
  • Is everything rounded to the right precision?
  • Are page numbers correct?
  • Make sure toc is not so much fragmented / chapters have an equal length (feeback from @CaroGrau)
  • Does the text follow the Gopen rules?
  • Does the text contain inconsistent capitalization?
  • Does the text contain overly long sentences?
  • Does the text contain deeply nested sentences?
  • Check for tense
  • Check for inconsistent decimal point
  • Check for inappropriate adjectives
  • Check for correct citations
  • Is the notation in formulas consistent?
  • Is the notation in formulas complete?
  • Is the economic intuition always clear?
  • Check if document is printable in copyshop
  • Check your spelling and grammar e. g., with Grammarly
  • Check academic grammer with this tool
  • Check grammar with MS office
  • Check for active voice
  • Within your text, two headlines must not be next to each other, instead, add a separating sentence that introduces the topic.
  • paper should have an outline.
  • paper should have an abstract. (The abstract should be placed before the table of contents.)
  • Formulae are followed by a punctuation, e.g. 2+2 = 4.
  • Check the number of number of decimal places. A number such as 1.23456 might be correct, but given possible
    perturbations and errors of the original data, it is common to restrict oneself to roughly one to three decimal places
    (e.g. 1.23). This is done by rounding correctly. This does not affect your original R output; copy this “as it is”.
  • Explain all the variables (especially in formulae) you use.
  • Variables must be in italic, such as 𝑥𝑥 instead of x.
  • Do not put a colon “:” in the line before formulae.
  • athematical explanations are not clear/comprehensible.
  • Longer equations should be placed in a separate line – either aligned to the left or centered.
  • The usage of variable names is not coherent. This issue results e.g. from using some variable 𝑒𝑒 as an error term and then as a time series.
  • Replace “×” in formulae by “∙” as this is more common. See .\check_formalia.py.
  • Check if some of your figures have a low resolution or appear pixelated.
  • Check your captions beneath figures. Correct examples are: Fig. 1. Some text. Figure 2. Some other text.
    Pay attention that the text starts with a capital letter and the sentence is accompanied by a punctuation.
  • Check if figures/tables are not referenced in the text.
  • When referencing figures and tables in the text, it is more common to use a capital letter such as “Equation (3)”, “Figure 1” and “Table 2”. See .\check_formalia.py.
  • The first line of your table (the one with the column names) should always be printed in bold.
  • Do some non-trivial definitions/explanations need a reference?
  • When citing websites, add the date when accessed.
  • When adding books in your bibliography, add publisher and location of publishing.
  • When citing from textbooks, please also include the page numbers in the reference.
  • Are the references in the text are in a correct format?
  • Kindly fix the capitalization in your bibliography in order to make it consistent.
  • Check you if your introduction lacks a good, coherent motivation.
  • Add a summarizing sentence in your conclusion that tells which model, after all, you recommend.
  • Add examples where noted.
  • Check if you need to add further evaluations?
  • Check if your explanations are not clear/comprehensible.
  • Check if the usage of capitalization in your headlines is inconsistent. Either use ALWAYS capital initial letters such as in “Table of Contents” or ALWAYS an initial capital letter followed by small ones such as “Table of contents”.
  • Before brackets, there must be a space. Not, “ordinary least squares(OLS)”, but “ordinary least squares (OLS)”.
  • Check with numbers or sentence punctuations (point, comma, etc.) for incorrect spacing.
    In most cases, a footnote at the end of a sentence follows the punctuation as this example shows.
  • Check if spacing after paragraphs is consistent
  • Check if language is biased
  • Check if paragraphs are created consistently
  • Check for consistency between paragraphs
  • Check if hyphenization is correct in babel
  • Check if graphics are consistent to the "Ten Simple Rules for Better Figures"
  • Check citation counts with zotero addon
  • Check validity of doi with zotero plugin / fetch valid dois
  • Check all links in thesis, if they can be resolved via script
  • Check if words appear with and w/o hyphen
  • Check if all abbrevations in the text are wrapped in gls{}
  • Check if there are unnecessary abbrevations. Done automatically with #154.
  • Make sure terms from supervised ml are used consistently. See e. g., here.
  • Set up consistent plotting early on e. g. style. E. g., see SciencePlots.
  • Run LaTeX cleaner from here.
  • Make pdf archivable see here. or here.
  • Check this guide for visualizations and use HSL color space.
  • Polish up the most important plots and findings for use in paper
  • Do final check with writeful
  • Run biber for incomplete entries. See comment of @lxndrblz in #173.
  • Check if I included some American words 🇺‍🇸 https://codewordsolver.com/american-british-english-translator/ or auto-convert with python https://github.com/orsinium-labs/eng/tree/master/eng
  • Run biber --validate_datamodel thesis.bcf
  • Check if links refer to the correct labels. Are there duplicate labels?

Add ability to resume studies⏯️

Long studies take much longer than the 4 hour contingent, so I need the ability to stop and resume studies.
This became appearant when implementing #209.

  • Use database, as done here.
  • Update name of files
  • Update defaults for parameters

Extend result generation🏁

  • Add statistical tests to effective spread
  • Add "All" column
  • Use more robust appraoch to calculate location to quote
  • Add "cline" / group by basic rules, hybrid rules, and ML → Do manually
  • Line wrapping in columns
  • Remove manual editing e. g., remove "issue_type"
  • Load multiple result sets, e. g., gradient boosting, Transformer, etc.
  • Made all notebooks flexible enough to handle different datasets and subsets.
  • Store all datasets with a consistent identifier in wandb
  • Delete outdated datasets from Google Cloud

Edit in annotations from review 2👩‍💻

Detailed comments

  • Page 2 (Related Work (3 p)): "the tick rule, quote rule" -- Ich weiß, der Introduction-Teil kommt noch, aber sagst du da dann kurz was zu den einzelnen Rules? Ich weiß du erklärst sie alle genauer im Theorie-Teil, aber wenn man die hier zum ersten mal hört, wird man etwas ins kalte Wasser geschmissen (imo sollte also in der Introduction zumindest in ein paar kurzen Sätzen erklärt werden, was rule-based approaches sind / wie sie funktionieren).

  • Page 2 (Related Work (3 p)):

    Consistently for options traded at the International Securities Exchange (ISE), and CBOE classical rules like the popular Lee and Ready algorithm only achieve accuracies of 62.53 % or 62.03 % and are thus significantly smaller than in the stock market.

    Den Satz verstehe ich nicht ganz. Kann es sein, dass hier das komma vor dem and weg muss und ein komma nach CBOE hingehört?

  • Page 3 (Related Work (3 p)):

    Hier kommt jetzt ein etwas allgemeiner Absatz, der die rule-based approaches mit dem ml-approaches vergleicht und erklärt, was eure work zu erreichen versucht. Sowas könnte man auch zumindest kurz am Anfang des Related Work Teils einfügen (Vor die rule-based section), damit man einen groben überblick bekommt, bevor man auf die Details eingeht.

    Natürlich hängt das auch wieder davon ab, was in deiner Introduction stehen wird, das wird dann nochmal sehr interresant.

    Hier könnte ich auch noch hinzufügen, dass es imo wichtig ist, dass die dann entsprechend inhaltlich hier abgestimmt wird (quasi auf den related work vorbereitet wird). In dem Rahmen kann dann erklärt werden, was die einzelnen approaches grob sind, wie sie miteinander vergleichen / im Verhältnis stehen, in den letzten Jahren eingesetzt wurden (nicht historisch betrachtet, aber eben das ganze als Einleitung kurz umreißen, als Motivation, warum / wo diese Arbeit ansetzt).

  • Page 4 (Related Work (3 p)): Wie siehts mit Transformers aus in diesem Bereich? Wurden die noch garnicht verwendet? Wenn das so ist, würde ich hier irgendwo einen kurzen Disclaimer unterbringen, weil du die ja auch einsetzen möchtest. Wenn das hier nirgendwo erwähnt wird, Transformer dann aber später eingesetzt werden, könnte dass ggf. beim Leser etwas Verwirrung auslösen, was da der Stand in der Literatur ist.

  • Page 5 (Rule-Based Approaches): "next section." -- Ist damit Section 3.1 gemeint? Oder Section 4? Ich persönlich würde hier ggf. die konkrete Section hinschreiben (ist aber i.d.R. Präferenzsache).

  • Page 7 (Tick Test): "in" -- Hier würde ich "by" statt "in" schreiben.

  • Page 7 (Tick Test): "," -- Das Komma würde ich hier weglassen.

  • Page 7 (Tick Test): "donated" -- Ich schätze hier meinst du "denoted".

  • Page 7 (Tick Test): Hier würde ich ein Komma setzen.

  • Page 8 (Depth Rule): "the" -- Braucht man das "the" hier?

  • Page 8 (Depth Rule): Hier könnte wieder ein Komma hin. Im Englischen ist es i.d.R. freigestellt, ob an solchen Stellen ein Komma hinkommt oder nicht. Würde dann aber darauf achten, dass es in deiner Arbeit einheitlich bleibt.

  • Page 9 (Trade Size Rule): "the" -- s.o. bzgl. "the".

  • Page 9 (Trade Size Rule): "the" -- Hier ggf. "their" statt "the" (glaube auch wieder Präferenz).

  • Page 11 (Ellis-Michaely-O'Hara Rule):

    By analysing miss-classified trades with regard to the proximity of the trade to the quotes, they observe, that the quote rule and by extension of the LR algorithm performs particularly well at classifying trades executed at the bid and the ask price but trail the performance of the tick rule for trades inside or outside the spread (Ellis et al., 2000:pp. 535–536).

    Der Satz ist etwas lang & schwer zu lesen (insbesondere der Teil "and by extention of the LR algorithm...", hier bin ich mir nicht ganz sicher was konkret gemeint ist. Meinst du sowohl die quote rule und (by extension) auch der LR algorithm performen ....? Oder meinst du die quote rule (by extension of the LR algorithm) performt ....?)

  • Page 13 (Stacked Rule): Hier könnte der Einheitlichkeit wegen wieder ein Komma hin.

  • Page 13 (Stacked Rule): "," -- Hier würde ich einen Punkt setzen und einen neuen Satz beginnen.

  • Page 13 (Stacked Rule): "Chakrabarty, Li, et al." -- Gibt es einen Grund, warum du hier 2 Namen und dann et al. und nicht 1 Namen und dann et al. schreibst?

Generate ISE / CBOE supervised results of Gradient Boosting🐈

  • Run supervised training on fbv/thesis/ise_supervised_log_standardized:latest for classical features
  • Run supervised training on fbv/thesis/cboe_supervised_log_standardized:latest for classical features
  • Run supervised training on fbv/thesis/ise_supervised_log_standardized:latest for classical-size features
  • Run supervised training on fbv/thesis/cboe_supervised_log_standardized:latest for classical-size features
  • Run supervised training on fbv/thesis/ise_supervised_log_standardized:latest for ml features
  • Run supervised training on fbv/thesis/cboe_supervised_log_standardized:latest for ml features
  • Generate result sets e. g., cboe_gbm_supervised_test

Add visualizations of hyperparameter search space🌔

  • Manually create optuna visualizations, as I couldn't adjust style / wrong size
  • strip outputs from notebooks → added pre-commit hook
  • Add to document

Fixed some errors along the way 🐛:

  • CBOE (ML) results were generated from a incomplete run. → results regenerated and are slightly improved now
  • Proximity to quotes was calculated wrong in the results tables, so that to many samples ended up in unknown.

Edit in annotations from review🎒

Annotations from @lxndrblz .

Detailed comments

  • Page 2 (Related Work (3 p)): "are" -- were?

  • Page 2 (Related Work (3 p)): "62.5 % or 62.0 %" -- Kleinere Anzahl zuerst nennen?

  • Page 2 (Related Work (3 p)): "reported" -- Anderes Wort?

  • Page 2 (Related Work (3 p)): "bid or ask." -- Hier evtl. noch eine Erklärung einfügen?

  • Page 2 (Related Work (3 p)):

    surpassing primary rules by more than 10.0 %, at the cost of data efficiency.

    Vielleicht erklären, was das bedeutet?

  • Page 3 (Related Work (3 p)): "important" -- relevant

  • Page 3 (Related Work (3 p)):

    Rosenthal (2012, pp. 481–482)

    Quelle eher ans Satzende setzen?

  • Page 3 (Related Work (3 p)): "thus" -- Bin mir nicht sicher, ob man das wirklich so im Satz verwendet. Sehe es meistens am Satzanfang.

  • Page 3 (Related Work (3 p)):

    Rosenthal (2012, p. 15)

    Falsche Quellenangabe? Rosenthal gehört doch dicherlich in die Klammern.

  • Page 4 (Related Work (3 p)): "of" -- by?

  • Page 4 (Related Work (3 p)): "To the best of our knowledge," -- Dangling modifier.

  • Page 5 (Rule-Based Approaches): "that sign trades on a trade-bytrade basis" -- Was ist damit gemeint?

  • Page 6 (Quote Rule):

    By definition, the quote rule cannot classify trades at the midpoint of the quoted spread.

    Ist das dann nicht wiedersprüchlich zur vorherigen Defintion von m?

  • Page 6 (Quote Rule): "to couple" -- couepling?

  • Page 7 (Tick Test): "Equation 2" -- Equation 3?

  • Page 7 (Tick Test): "bracketed" -- Constrained

  • Page 7 (Tick Test):

    In practice, Grauer et al. (2022, pp. 29–32) observe higher accuracies for the reverse tick test on a sample of option trades, but both cannot compete with quote-based approaches and calls for more sophisticated approaches. These findings contradict the results from the stock market (Lee & Ready, 1991, p. 737).

    Bitte noch erklären, was das bedeutet. Der letzte Satz wirkt ziemlich verloren.

  • Page 9 (Hybrid Rules):

    Popular variants include the Lee-Ready (LR) algorithm, the EMO rule, and the Chakrabarty-Li-Nguyen-Van-Ness (CLNV) method.

    Gibt es dafür auch eine Quelle?

  • Page 11 (Lee and Ready Algorithm): "its parts." -- its subparts

  • Page 11 (Lee and Ready Algorithm): "cp." -- Warum cp?

  • Page 12 (Ellis-Michaely-O'Hara Rule): "caused" -- causes?

  • Page 12 (Chakrabarty-Li-Nguyen-Van-Ness Method):

    (Chakrabarty et al., 2012, p. 3809)

    Wieso steht die Quelle hier mittendrin?

  • Page 13 (Stacked Rule): "trend" -- Würde ein anderes Wort wählen.

  • Page 13 (Stacked Rule): "realize" -- unleash/unfold

  • Page 13 (Stacked Rule): "A obvious question is" -- Würde das weglassen.

  • Page 15 (Architectural Overview): "Transformer" -- Wieso hier kursiv und oben nicht?

  • Page 15 (Architectural Overview):

    At times we fall back to the Transformer for machine translations, to develop a deeper understanding of the architecture and its components.

    Kann weg, da es hier ja noch um die Hintergründe und nicht deine Arbeit geht.

  • Page 18 (Architectural Overview): "observation" -- anderes Wort?

  • Page 19 (Positional Encoding):

    We visualize the positional encoding in Figure 4 with an embedding dimension of 96 and 64 tokens. One can see the alternating pattern between even and odd columns and the unique pattern for each token's position.

    Basierend auf welchen Daten wurde das visualisiert oder gilt das Schaubild allgemein gültig?

  • Page 19 (Positional Encoding): "zero-centred, and" -- Komma kann weg?

  • Page 20 (Positional Encoding): "lement-wisely" -- per element?

  • Page 20 (Position-wise Feed-Forward Networks): "To retain general information on the task" -- Dangling modifier

  • Page 21 (Position-wise Feed-Forward Networks): "linearities to the network." -- Der Umbruch sieht komisch aus.

  • Page 21 (Residual Connections and Layer Normalization):

    (He et al., 2015, pp. 1–2)

    Die Quelle kann weg oder da es ja um die Arbeit von Vaswani geht?

  • Page 22 (Residual Connections and Layer Normalization): "calculated with the statistic" -- Wo gehört das dazu?

  • Page 22 (Residual Connections and Layer Normalization):

    Until now it remains unclear, how the layer normalization intertwines with the sublayers and the residual connections.

    Kann weg.

  • Page 23 (Transformer Networks For Tabular Data):

    Both architectures are depicted in Figure 6a and Figure 6b, respectively

    Würde die zugehörige Grafik auf der selben Seite platzieren.

  • Page 24 (TabTransformer): "To maintain the overall embedding dimension of de," -- Noch ein Dangling Modifier.

  • Page 24 (TabTransformer): "probabilities" -- probabilities outputs -> Dann wäre es gleich wie beim Input.

  • Page 25 (TabTransformer):

    Somepalli et al. (2021, p. 2).

    Huh?

  • Page 34 (Outlook (0.5 p=67.5 p)):

    He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition.

    Evtl. einmal den Biber check laufen lassen, um fehlende Felder zu finden. Ist das ein Buch?

  • Page 37 (Glossary): "Item" -- Formatierung.

Optional model training 🍭

  • Calculate cardinality dynamiccally in Classifier
  • Improve gpu utilization of TabTransformer
  • Try out moving window approach e. g., for CatBoost
  • Get PyTorch 2.0 running once stable. See here.
  • Look into callbacks of PyTorch Lightning / huggingface transformers
  • Set-up PyTorch profiler as shown here
  • Try out other samplers beside TPESampler like RandomSampler. See how it affects the results
  • Look into implementing transformers using xformers
  • Let torch.tensors and np.array share memory. Could be useful for pre-training. See here.
  • Add decay hyperparameter for gradient boosting / transformer
  • Run feature selection in CatBoost. See here.. Sample could be fine.
  • Try out target enconding
  • Add frequency encoding
# found at: https://www.kaggle.com/competitions/ieee-fraud-detection/discussion/108575
temp = df['card1'].value_counts().to_dict()
df['card1_counts'] = df['card1'].map(temp)
  • Study feature interactions:
feature_interaction = [[X.columns[interaction[0]], X.columns[interaction[1]], interaction[2]] for i,interaction in interactions.iterrows()]
feature_interaction_df = pd.DataFrame(feature_interaction, columns=['feature1', 'feature2', 'interaction_strength'])
feature_interaction_df.head(10)
  • Compare different feature scaling e. g., normalization, z-score-normalization, robust scaler, quantile transformer etc. approaches for neural net. See here. and here.
  • Study effects of quantization. How can one assist quantization with feature engineering? For technical background see here. and here.
  • Try out nicer progress bar. See here.

Issues from code review 🐛

  • Removes outdated notebook
  • Unifies some cells
  • Fixes import sorting
  • Removes Google Colab badges
  • Completes doc strings
  • Unified use of | instead of Union[...] or Optional[...]
  • Replaces some lengthy code with list comprehensions

Adressed in other issues:

  • try out moving window approach. See #97.
  • try out learning rate scheduling. See #7.
  • change progress bar. See #97.
  • update variable name of logits. See #118 .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.