GitHub Repo with various ML/AI/DS resources that I find useful. I'll populate it with links to articles, libraries, and other resources that I come across. Hopeing for more or less regular, ongoing updates.
- How to avoid machine learning pitfalls: a guide for academic researcher: https://arxiv.org/pdf/2108.02497.pdf
- Tabular Data: Deep Learning is Not All You Need: https://arxiv.org/abs/2106.03253
-
pytorch-widedeep, deep learning for tabular data IV: Deep Learning vs LightGBM. A thorough comparison between DL algorithms and LightGBM for tabular data for classification and regression problems: https://jrzaurin.github.io/infinitoml/2021/05/28/pytorch-widedeep_iv.html
-
SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training: https://github.com/somepago/saint (repo)
-
SAINT paper: https://arxiv.org/abs/2106.01342
-
Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data https://arxiv.org/abs/2106.11189
-
Revisiting Deep Learning Models for Tabular Data: https://arxiv.org/abs/2106.11959v1
-
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data: https://arxiv.org/abs/2005.08314v1
-
Gradient Boosting Neural Networks: GrowNet. Paper: https://arxiv.org/abs/2002.07971, Code: https://github.com/sbadirli/GrowNet
- XGboost documentation: https://xgboost.readthedocs.io/en/latest/
- Every Model Learned by Gradient Descent Is Approximately a Kernel Machine: https://arxiv.org/abs/2012.00152
- The quest for adaptivity: exploring https://francisbach.com/quest-for-adaptivity/
- Automatic Frankensteining: Creating Complex Ensembles Autonomously: https://epubs.siam.org/doi/abs/10.1137/1.9781611974973.83
- vecstack is a handy little library that implements the stacking transformations with your train and test data. It has both the functional interface and the sklearn fit transform interface: https://github.com/vecxoz/vecstack
- RecoTour III: Variational Autoencoders for Collaborative Filtering with Mxnet and Pytorch: https://jrzaurin.github.io/infinitoml/2020/05/15/mult-vae.html
- Michael Jahrer's famous Porto Seguro Kaggle competition solution: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629
- Kaggle Tabular Playground Series February 2021 - 1st place solution writeup: https://www.kaggle.com/c/tabular-playground-series-feb-2021/discussion/222745
- The Shapley Taylor Interaction Index: https://arxiv.org/abs/1902.05622
- Automated Machine Learning: Methods, Systems, Challenges. Probably the single best monograph on AutoML. Published in 2019, so not quite the cutting edge, but still very useful. https://www.amazon.com/Automated-Machine-Learning-Challenges-Springer-ebook/dp/B07S3MLGFW/
- Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools: https://arxiv.org/abs/1908.05557
- AutoML: A Survey of the State-of-the-Art: https://arxiv.org/abs/1908.00709
- Can AutoML outperform humans? An evaluation on popular OpenML datasets using AutoML Benchmark: https://arxiv.org/abs/2009.01564
- H2O Driverless AI documentation: https://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html
- hyperopt-sklearn: https://github.com/hyperopt/hyperopt-sklearn
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929
- Scaling Vision Transformers: โAs a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.โ https://arxiv.org/abs/2106.04560
- Evaluating Large Language Models Trained on Code: https://arxiv.org/abs/2107.03374
- How exactly does word 2 vec work? https://www.semanticscholar.org/paper/How-exactly-does-word-2-vec-work-Meyer/49edbe35390224dc0c19aefe4eb28312e70b7e79
- Attention Is All You Need: https://arxiv.org/abs/1706.03762
- Big Bird: Transformers for Longer Sequences: https://arxiv.org/abs/2007.14062
- Boosting algorithms for a session-based, context-aware recommender system in an online travel domain: https://doi-org.stanford.idm.oclc.org/10.1145/3359555.3359557
- Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data. Analysis of homophily of Politicians on Twitter. http://pablobarbera.com/static/barbera_twitter_ideal_points.pdf
- Leadership Communication and Power: Measuring Leadership in the U.S. House of Representatives from Social Media Data: https://preprints.apsanet.org/engage/apsa/article-details/60c239b28214c646e0a61589
- Final Report: National Security Commission on Artificial Intelligence https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf
- Kaggle: https://www.kaggle.com
- Eval AI: https://eval.ai
- Zindi: https://zindi.africa
- Driven Data: https://www.drivendata.org
- Codalab: https://competitions.codalab.org/
- NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs: https://arxiv.org/abs/2106.12144v1
- Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author): https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full
- A from-scratch tour of Bitcoin in Python http://karpathy.github.io/2021/06/21/blockchain/
- Install CUDA 11.2, cuDNN 8.1.0, PyTorch v1.8.0 (or v1.9.0), and python 3.9 on RTX3090 for deep learning https://medium.com/analytics-vidhya/install-cuda-11-2-cudnn-8-1-0-and-python-3-9-on-rtx3090-for-deep-learning-fcf96c95f7a1
- Autonomy 2.0: Why is self-driving always 5 years away? https://arxiv.org/abs/2107.08142v1