bdzyubak / torch-control Goto Github PK

A top-level repo for evaluating natively available models

License: MIT License

Python 100.00%

torch-control's Issues

Evaluate fine tuning only the sentiment head of sentiment analysis models

LLM models mostly consist of pretrained language interpretation weights. A small head consisting of a couple of layers is used to customize the model for a given task - sentiment analysis, sentence completion, information extraction. If all weights are tuned, there is a substantial risk of overfitting on the small amount of training data and loosing the ability of the model to interpret language generally. Implement code to freeze all layers except the initialized detector head and only fine tune the head.

Start with DistilBERT where I have already fine tuned all layers (and did see substantial overfitting after the first epoch). Extend to the other models that will be implemented in #15.

Implement semi-supervised learning using Decision Tree

The efficacy would be interesting to compare to the existing SVM implementation. #11 The DT must have limited depth so that the output prediction confidence is <1. Ideally, a more difficult dataset with more features and accuracy of <0.8 would be used to explore this.

Drop augmented datapoints with variable labels

In the issue #14, the dataset came with heavy resampling where "A series that was amazing" and "A series that was horrible" would each be resampled down to as little as one letter with labels being inherited. This causes "A", "A series" to have highly variable labels and would get in the way of training.

Enhancement: Implement code that runs on augmented data, finds datapoints with the same inputs and variable labels, and drops them. If we augmented the data ourselves and saved it to disk (vs augmenting on the fly), we need to keep track of augmented versus real cases. The implemented code could use the boolean parameter indicating whether the data was augmented to mask on only the generated cases. In the specific sentiment task with pregenerated data, we can detect augmented strings as being fully contained in other strings.

Experiment: See if removing augmented datasets with variable labels improves validation performance.

Movie review sentiment analysis

Fine-tune Distilbert for movie sentiment review on the following dataset:
https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/data

The IMDB dataset is also interesting for sentiment review. Potentially, implement as a separate experiment and then cross validate training on one or both. Implement the other common networks and compare performance. For those that come with out-of-the-box sentiment analysis, evaluate performance without fine-tuning.

Implement use of formal parameter search to log neatly with MLFlow

When a formal parameter search like sklearn.GridSearchCV() is used, mlflow will log the individual runs with hyperparameter variations as children of the parent experiment. This is much neater, especially when using mlflow autologging.

Implement in projects\MachineLearning\energy_use_time_series_forecasting\time_series_forecasting_energy_use.py, and test with CV/LLM when it becomes relevant.

Add the IMDB dataset to the sentiment analysis task data

The IMDB is a popular dataset of movie reviews which contains a review and a positive/negative sentiment. https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

As a first experiment, evaluate models trained on the original sentiment analysis dataset https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data implemented to be able to predict IMDB sentiment. Secondarily, evaluate pooling the data together and training/validating on both datasets. This will require quantizing the original dataset's 5 point reviews to positive/negative as I don't have the data labeling budget to reliably expand the IMDB labels. Neutral reviews may have to be dropped.

I would expect the models trained on the original sentiment dataset to generalize poorly due to the data-augmentation in it - see comment in the original issue: #14. Training on both datasets should improve generalizability but may underfit due to the variability in training data labeling and the augmentation.

Implement sentiment chatbot using several public models

Compare the sentiment prediction of 5 top models on the HuggingFace Model Hub. Use custom-generated tricky phrases and see if any models stand out. This is a precursor experiment to gain familiarity with model interactions and the Hub. The resulting code should be clean and readable and may eventually serve as a tutorial for others.

Upload Docker container to AWS as demo

Fine tune the LLM model to give answers prefaced with "Bogdan says: " and upload to an AWS instance as a demo.

Add ability to run loading/verification in non-parallel mode

I run into an issue where data loading crashes with no apparent reason in both of the following cases.

nnunetv2/experiment_planning/plan_and_preprocess_api.py->
verify_dataset_integrity(join(nnUNet_raw, dataset_name), num_processes)

nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py
next(self.dataloader_train)

Until that can be addressed more directly, need a feature to turn off parallelization for loading.

Run iterative Semi-Supervised labeling experiment

Use or simulate a dataset with only 30% of the data labeled. Run iterative semi-supervised labeling by the model to illustrate (expected) gain in performance vs training on only the original 30% of the labels.

nnUnet Training Crashes on Random Batches

Training crashes during the first epoch, validation, or second epoch with an unclear message - worker is no longer alive. Batch sampling is random, so this is tricky to reproduce. Further, data loading is parallelized, so a debug breakpoint cannot be used to see what is wrong with the data. The crash happens at this stage which could be either data loading or training itself:

nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py
self.train_step(next(self.dataloader_train))

Resolve and add tools to debug this in the future.

Shared dependencies installed to conda base

Should instead be installed to each target environment. The issue is that the following command is not working correctly, at least on Windows:
conda run -n [env_name] pip install [dependency]

Improve power consumption prediction model

The initial model had trouble predicting extremes.
projects\MachineLearning\energy_use_time_series_forecasting\time_series_forecasting_energy_use.py

Improve prediction at extremes.
Add other metrics for better human readability.
No specific RMSE target exists at this time, so just report results.

Set nnUnet default num of preprocessing to 1

Preprocessing crashes with num_preprocessing not set to 1. Suspect an issue with system configuration on Windows as I have not seen these crashes on Unix. Preprocessing only happens once, so speeding it up is not high priority. Workaround is acceptable.

Detect bounding boxes containing text in images

Typical OCR tools like TrOCR work on textlines which need to be extracted. This could be done algorithmically or via model.

Goal: Develop textline extraction.

Method: Fine tune ViT or Yolo on task 1 labels of SROIE which include bounding box coordinates.

Deliverable: 90% accuracy for predicting bounding boxes within 10% rmse.

Fine-tune most popular LLMs for movie sentiment analysis

Fine tune 3 (or more) popular models and compare performance to DistilBERT for the movie sentiment analysis task.

Some choices:
GPT-3
LaMDA
Turing-NLG
XGen
Llama 2 (7 billion)
Gemini

Pick based on suitability for sentiment analysis task, popularity and affordability of tuning.

Implement customizable classification head for LLM fine tuning

Distilbert fine tuning with the default classification head resulted in underfitting (~0.6 train accuracy) compared to tuning all parameters (~0.8 train accuracy, but 0.6 val accuracy). Larger classification heads may yield a better fit. It is also worth trying a few classification head structures.

Identify which classification head structures may be helpful
Implement code to replace the default untrained classification head with a head that has a variable number of parameters and a choice of architecture
Ensure the layer freezing code in the model definition handles freezing all other layers but keeping the new head trainable

Summarize specific values from extracted text

OCR tools will extract text line by line with a typical problem - e.g. receipts or clinical workflow - having much unnecessary text.

Goal: Set up a tool to summarize important values e.g. patient name or store name from text chunks that do not have clear field identifiers.

Method: A large free summarization model. Try ChatGPT and others that can be fine tuned. Fine-tuning itself would be a separate issue.

Data: SROIE Receipt dataset. Ideally, include test cases which do/don't have a clear label for the target field. Leave negative cases (e.g. no store/total in image) for a separate issue. OK to use a subset of easy images at this stage.

Deliverable: Code that accurately summarizes store name and total amount with <10% error in 90% of images.

Build pipeline to extract structured text from images

The goal of this project is to get a pipeline which is able to extract desired fields from an image practicing LLM fine-tuning/prompt engineering.

Approach: Use image OCR (optical character recognition) to extract unstructured text. Use LLM to summarize desired fields from unstructured text.

Dataset: Found a receipts dataset on Kaggle (https://www.kaggle.com/datasets/trainingdatapro/ocr-receipts-text-detection) which has the desired characteristics:

Some fields that are nearly always present e.g. store
This field may sometimes be absent, which needs to be handled by the pipeline
There are many fields with varying frequency of occurrence, so the list of fields to look for can be varied to product problems of different complexities.

Add oversampling/augmentation to mitigate class imbalance

The dermaMNIST dataset has a substantial class imbalance issue. This causes poor generalization - 0.75 accuracy in validation with >0.95 in training. Code needs to be added to upsample low-data classes, and augment training data to improve generalizability.

Related to the dermaMNIST accuracy. Once done, post performance results with improved validation accuracy here.
#20

Evaluate - Fine tuning the entire LLM network vs default classifier head vs bigger head

The typical approach is to freeze all layers except for a small default head e.g. 768 channels for distilBERT. For this issue, test how does training the full network (assuming it fits in memory) compare to training the default head to a larger custom head.

Metrics:
validation accuracy
training accuracy
training time

Adapt GPU memory target to system memory

Currently, the target GPU memory footprint hardcoded and needs to be overridden via inputs.

Add feature to adapt this to system memory automatically.

Add MLOps

The number of models, datasets, and experiments is getting substantial enough that there is potential for log and artifact confusion. Need to add MLOps to associate data-config-model-results combos. Going to use MLFlow due to popularity and free nature.

Build docker container from mlflow and validate it

Make a pipeline to fetch an artifact from MLflow, build a docker container, and validate against recorded performance in the same data to make sure containerization did not have unexpected consequences.

Implement Image Classification on dermaMNIST

As I have mostly done image segmentation in the past, let's do a classification project!

dermaMNIST is a publicly available dataset of small RGB images of skin tumors with multi-label disease classifications. Unlike the hand digit dataset which is easy, the benchmark accuracy published in Nature with resnet50 is only 0.73. https://www.nature.com/articles/s41597-022-01721-8/tables/4

Let's see if we can do better.
pip install medmnist
from medmnist import DermaMNIST

Implement production performance monitoring for power consumption model

Post launch, the model can be subject to data drift, affecting input data and relationships, or concept drift, affecting prediction based on similar data. Implement metrics for input data and model performance for the power consumption model. Start by implementing in an IDE environment; extend to docker service in a subsequent issue.

Error importing Bert from transformers

The following code that would be used to fine tune the Google Bert model gives an error:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

cannot import name 'COMMON_SAFE_ASCII_CHARACTERS' from 'charset_normalizer.constant

Set up nnUnet

nnUnet is a state of the art image segmentation architecture that automatically configures hyperparameters to the GPU memory and dataset properties. I was able to obtain a major out-of-the-box organ segmentation accuracy compared to established Unet-based models at one of the companies I worked for. So, it makes sense to have nnUnet model available in the toolchain to compare to other state-of-the-art models within medical imaging, and in other fields.

Paper:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.

Implement time series power consumption regression analysis

There is an interesting dataset of hourly power consumption across multiple utilities available on Kaggle. A baseline model fails to fit the extremes of power consumption. Implement and improve the tutorial. Then, in separate issues, improve performance.
https://www.kaggle.com/code/robikscube/time-series-forecasting-with-machine-learning-yt/notebook

bdzyubak / torch-control Goto Github PK

torch-control's Issues

Recommend Projects

Recommend Topics

Recommend Org