bdzyubak / torch-control Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 1.24 MB

A top-level repo for evaluating natively available models

License: MIT License

Python 100.00%

torch-control's Introduction

Deep Learning Sandbox

Author: Bogdan Dzyubak, PhD

Email: [email protected]

Date: 02/26/2024

Repository: torch-control

Purpose:

The purpose of this project is to explore a wide variety of neural networks and training/inference/preprocessing methods. To that end, I am forking repositories with state-of-the-art architectures, improving interfaces, and adding ways to mix-and-match architectures and training/inference methods. My main background is in medical image analysis. Consequently, to expand horizons, I will be applying image analysis models to other computer vision tasks. I also have a great interest in exploring Natural Language Models, the cutting edge of AI.

The repository fairly recent, started on in March 2024. Currently, it can serve as code samples, with more thorough experimnetation, model complexity, and MLOps to be added over the subsequent months.

This repo is in Pytorch. For Tensorflow, see: tensorflow-sandbox

Installation

Install the following prerequisites:

Anaconda (>=2023)
Version control and git (e.g. GitKraken)
Use run_setup_all.py to install all or some (with command line arguments) of the environments required to run projects (see Repository Organization)
In Pycharm settings, define all submodules torch-control/utils, torch-control/nnUnet as Source. For running in the command line, these conda paths are developed by run_setup_all.py, but Pycharm overrides the system settings
MLFlow:
1. Install git if not installed already and add it to PATH (GitKraken does not appear to add a callable git executable)
2. Configure the mlflow runs directory by changing the environment variable e.g. MLFLOW_TRACKING_URI=D:\Models\mlruns via the control panel on Windows or .bashrc update on Linux.
3. To display logs, navigate to the mlruns folder in the terminal and run: mlflow ui --port 8080
4. Then access via browser: localhost:8080
Docker:
1. Either install docker desktop (which includes the dependencies above, or install Engine/CLI/Compose separately to avoid bloat https://docs.docker.com/engine/
2. To install Docker desktop to a custom location, go to the download directory and run the following in command prompt: start /w “” “Docker Desktop Installer.exe” install --installation-dir=D:\Docker)
3. Install gcc with apt-get on Linux, or MinGW on Windows https://dev.to/gamegods3/how-to-install-gcc-in-windows-10-the-easier-way-422j. MinGW must be installed in default location or it will be missing files

Repository Organization

This repository is a top-level controller for running training/inference on a variety of forked AI repos to compare performance of architectures/training methods. The /projects folder contains entrypoints for experiments on the following topics:

Computer vision:
1. Use the cv environment created by run_setup_all.py
2. Project Kanban
Natural Language Processing:
1. Use the nlp environment created by run_setup_all.py
2. Project Kanban
Machine Learning
1. Use the ml environment created by run_setup_all.py
2. Project Kanban

Utils is a submodule repository which contains base level library code for interacting with models, the OS, plotting etc, which can be imported by other repositories such as tensorflow-sandbox, or forked by users separately from torch-control.

Available experiments

[NLP] Movie Sentiment Analysis

a) Fine-tuned Distilbert on Kaggle Movie Sentiment analysis dataset.

b) WIP: Fine-tune other common networks for comparison

c) WIP: Evaluate freezing all but sentiment analysis-head layers.

d) WIP: Combine with IMDB reviews dataset - compare training on one, validation on the other, then randomly split for cross-val.
[CV] Blood Vessel Segmentation

a) Use the following script to download and organize data

b) Train nnUnet by calling "nnUNet/run_training.py 501 2d" (Dataset ID, architecture template)
[ML] time series segmentation:

a) Use the following script for hyperparameter optimization and model fitting

b) WIP: Add MLOps for model/hyperparameter/data tracking

c) WIP: Further improve model hyperparameters and engineered features.

d) WIP: Add cross validation with datasets from the other companies available.
[MLOps] - MLflow, Docker, Cloud:

a) Version data, training runs, models with MLFlow - implemented for ML and NLP

b) Build docker container - implemented for ML

c) TODO: Deploy to AWS

Operating Systems Notes

This project was developed on Windows. It attempts to be OS agnostic - no hardcoded slashes, reliance on python tools instead of system tools - but testing primarily happens on Windows, so Linux patches will probably be needed.
The project is aimed to work with both CPU only and GPU-capable setups. It is tested on a GTX 3060 GPU with 12 Gig of memory. GPU memory allocation is satic. If you run into an out of memory issue, reduce batch size.

Bug reporting/feature requests

Feel free to raise an issue or, indeed, contribute to solving one (!) on: https://github.com/bdzyubak/torch-control/issues

Testing Installation:

Computer Vision:
1. TODO - add integration test that does not require download of a large dataset.
Natural Language Processing:
1. projects/NaturalLanguageProcessing/LLMs_tutorials/distilbert_question_answering.py
Machine Learning:
1. projects/MachineLearning/semi_supervised_breast_cancer_classification/semi_supervised_svm.py

Testing and Release Process:

Unit:
1. Run pytest on teh following folder tests/unit.
2. Test coverage is a WIP
The master branch is a stable beta where unit tests should all pass and features are reference compatible after every merge. Due to the single user nature of this repo, currently a Release branch is not planned.

torch-control's People

Contributors

Stargazers

Watchers

torch-control's Issues

Build docker container from mlflow and validate it

Make a pipeline to fetch an artifact from MLflow, build a docker container, and validate against recorded performance in the same data to make sure containerization did not have unexpected consequences.

Implement production performance monitoring for power consumption model

Post launch, the model can be subject to data drift, affecting input data and relationships, or concept drift, affecting prediction based on similar data. Implement metrics for input data and model performance for the power consumption model. Start by implementing in an IDE environment; extend to docker service in a subsequent issue.

Add ability to run loading/verification in non-parallel mode

I run into an issue where data loading crashes with no apparent reason in both of the following cases.

nnunetv2/experiment_planning/plan_and_preprocess_api.py->
verify_dataset_integrity(join(nnUNet_raw, dataset_name), num_processes)

nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py
next(self.dataloader_train)

Until that can be addressed more directly, need a feature to turn off parallelization for loading.

Implement Image Classification on dermaMNIST

As I have mostly done image segmentation in the past, let's do a classification project!

dermaMNIST is a publicly available dataset of small RGB images of skin tumors with multi-label disease classifications. Unlike the hand digit dataset which is easy, the benchmark accuracy published in Nature with resnet50 is only 0.73. https://www.nature.com/articles/s41597-022-01721-8/tables/4

Let's see if we can do better.
pip install medmnist
from medmnist import DermaMNIST

Movie review sentiment analysis

Fine-tune Distilbert for movie sentiment review on the following dataset:
https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/data

The IMDB dataset is also interesting for sentiment review. Potentially, implement as a separate experiment and then cross validate training on one or both. Implement the other common networks and compare performance. For those that come with out-of-the-box sentiment analysis, evaluate performance without fine-tuning.

Add MLOps

The number of models, datasets, and experiments is getting substantial enough that there is potential for log and artifact confusion. Need to add MLOps to associate data-config-model-results combos. Going to use MLFlow due to popularity and free nature.

Implement semi-supervised learning using Decision Tree

The efficacy would be interesting to compare to the existing SVM implementation. #11 The DT must have limited depth so that the output prediction confidence is <1. Ideally, a more difficult dataset with more features and accuracy of <0.8 would be used to explore this.

Drop augmented datapoints with variable labels

In the issue #14, the dataset came with heavy resampling where "A series that was amazing" and "A series that was horrible" would each be resampled down to as little as one letter with labels being inherited. This causes "A", "A series" to have highly variable labels and would get in the way of training.

Enhancement: Implement code that runs on augmented data, finds datapoints with the same inputs and variable labels, and drops them. If we augmented the data ourselves and saved it to disk (vs augmenting on the fly), we need to keep track of augmented versus real cases. The implemented code could use the boolean parameter indicating whether the data was augmented to mask on only the generated cases. In the specific sentiment task with pregenerated data, we can detect augmented strings as being fully contained in other strings.

Experiment: See if removing augmented datasets with variable labels improves validation performance.

Implement customizable classification head for LLM fine tuning

Distilbert fine tuning with the default classification head resulted in underfitting (~0.6 train accuracy) compared to tuning all parameters (~0.8 train accuracy, but 0.6 val accuracy). Larger classification heads may yield a better fit. It is also worth trying a few classification head structures.

Identify which classification head structures may be helpful
Implement code to replace the default untrained classification head with a head that has a variable number of parameters and a choice of architecture
Ensure the layer freezing code in the model definition handles freezing all other layers but keeping the new head trainable

Add the IMDB dataset to the sentiment analysis task data

The IMDB is a popular dataset of movie reviews which contains a review and a positive/negative sentiment. https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

As a first experiment, evaluate models trained on the original sentiment analysis dataset https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data implemented to be able to predict IMDB sentiment. Secondarily, evaluate pooling the data together and training/validating on both datasets. This will require quantizing the original dataset's 5 point reviews to positive/negative as I don't have the data labeling budget to reliably expand the IMDB labels. Neutral reviews may have to be dropped.

I would expect the models trained on the original sentiment dataset to generalize poorly due to the data-augmentation in it - see comment in the original issue: #14. Training on both datasets should improve generalizability but may underfit due to the variability in training data labeling and the augmentation.

Implement use of formal parameter search to log neatly with MLFlow

When a formal parameter search like sklearn.GridSearchCV() is used, mlflow will log the individual runs with hyperparameter variations as children of the parent experiment. This is much neater, especially when using mlflow autologging.

Implement in projects\MachineLearning\energy_use_time_series_forecasting\time_series_forecasting_energy_use.py, and test with CV/LLM when it becomes relevant.

Set up nnUnet

nnUnet is a state of the art image segmentation architecture that automatically configures hyperparameters to the GPU memory and dataset properties. I was able to obtain a major out-of-the-box organ segmentation accuracy compared to established Unet-based models at one of the companies I worked for. So, it makes sense to have nnUnet model available in the toolchain to compare to other state-of-the-art models within medical imaging, and in other fields.

Paper:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.

Implement time series power consumption regression analysis

There is an interesting dataset of hourly power consumption across multiple utilities available on Kaggle. A baseline model fails to fit the extremes of power consumption. Implement and improve the tutorial. Then, in separate issues, improve performance.
https://www.kaggle.com/code/robikscube/time-series-forecasting-with-machine-learning-yt/notebook

Fine-tune most popular LLMs for movie sentiment analysis

Fine tune 3 (or more) popular models and compare performance to DistilBERT for the movie sentiment analysis task.

Some choices:
GPT-3
LaMDA
Turing-NLG
XGen
Llama 2 (7 billion)
Gemini

Pick based on suitability for sentiment analysis task, popularity and affordability of tuning.

Implement sentiment chatbot using several public models

Compare the sentiment prediction of 5 top models on the HuggingFace Model Hub. Use custom-generated tricky phrases and see if any models stand out. This is a precursor experiment to gain familiarity with model interactions and the Hub. The resulting code should be clean and readable and may eventually serve as a tutorial for others.

Evaluate fine tuning only the sentiment head of sentiment analysis models

LLM models mostly consist of pretrained language interpretation weights. A small head consisting of a couple of layers is used to customize the model for a given task - sentiment analysis, sentence completion, information extraction. If all weights are tuned, there is a substantial risk of overfitting on the small amount of training data and loosing the ability of the model to interpret language generally. Implement code to freeze all layers except the initialized detector head and only fine tune the head.

Start with DistilBERT where I have already fine tuned all layers (and did see substantial overfitting after the first epoch). Extend to the other models that will be implemented in #15.

Improve power consumption prediction model

The initial model had trouble predicting extremes.
projects\MachineLearning\energy_use_time_series_forecasting\time_series_forecasting_energy_use.py

Improve prediction at extremes.
Add other metrics for better human readability.
No specific RMSE target exists at this time, so just report results.

Error importing Bert from transformers

The following code that would be used to fine tune the Google Bert model gives an error:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

cannot import name 'COMMON_SAFE_ASCII_CHARACTERS' from 'charset_normalizer.constant

nnUnet Training Crashes on Random Batches

Training crashes during the first epoch, validation, or second epoch with an unclear message - worker is no longer alive. Batch sampling is random, so this is tricky to reproduce. Further, data loading is parallelized, so a debug breakpoint cannot be used to see what is wrong with the data. The crash happens at this stage which could be either data loading or training itself:

nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py
self.train_step(next(self.dataloader_train))

Resolve and add tools to debug this in the future.

Run iterative Semi-Supervised labeling experiment

Use or simulate a dataset with only 30% of the data labeled. Run iterative semi-supervised labeling by the model to illustrate (expected) gain in performance vs training on only the original 30% of the labels.

Adapt GPU memory target to system memory

Currently, the target GPU memory footprint hardcoded and needs to be overridden via inputs.

Add feature to adapt this to system memory automatically.

Add oversampling/augmentation to mitigate class imbalance

The dermaMNIST dataset has a substantial class imbalance issue. This causes poor generalization - 0.75 accuracy in validation with >0.95 in training. Code needs to be added to upsample low-data classes, and augment training data to improve generalizability.

Related to the dermaMNIST accuracy. Once done, post performance results with improved validation accuracy here.
#20

Shared dependencies installed to conda base

Should instead be installed to each target environment. The issue is that the following command is not working correctly, at least on Windows:
conda run -n [env_name] pip install [dependency]

Upload Docker container to AWS as demo

Fine tune the LLM model to give answers prefaced with "Bogdan says: " and upload to an AWS instance as a demo.

Set nnUnet default num of preprocessing to 1

Preprocessing crashes with num_preprocessing not set to 1. Suspect an issue with system configuration on Windows as I have not seen these crashes on Unix. Preprocessing only happens once, so speeding it up is not high priority. Workaround is acceptable.