urbslab / streamline Goto Github PK

Simple Transparent End-To-End Automated Machine Learning Pipeline for Supervised Learning in Tabular Binary Classification Data

Home Page: https://urbslab.github.io/STREAMLINE/

License: GNU General Public License v3.0

Python 3.90% Shell 0.01% Jupyter Notebook 96.09%

automl-pipeline binary-classification data-science data-visualization feature-selection imputation machine-learning model-application statistical-analysis supervised-learning

streamline's Introduction

Overview

STREAMLINE is an end-to-end automated machine learning (AutoML) pipeline that empowers anyone to easily train, interpret, and apply a variety of predictive models as part of a rigorous and optionally customizable data mining analysis. It is programmed in Python 3 using many common libraries including Pandas and scikit-learn.

The schematic below summarizes the automated STREAMLINE analysis pipeline with individual elements organized into 9 phases.

Detailed documentation of STREAMLINE is available here.
A simple demonstration of STREAMLINE on example biomedical data in our ready-to-run Google Colab Notebook here.
A video tutorial playlist covering all aspects of STREAMLINE is available here

YouTube Overview of STREAMLINE

Pipeline Design

The goal of STREAMLINE is to provide an easy and transparent framework to reliably learn predictive associations from tabular data with a particular focus on the needs of biomedical data applications. The design of this pipeline is meant to not only pick a best performing algorithm/model for a given dataset, but to leverage the different algorithm perspectives (i.e. biases, strengths, and weaknesses) to gain a broader understanding of the associations in that data.

The overall development of this pipeline focused on:

Automation and ease of use
Optimizing modeling performance
Capturing complex associations in data (e.g. feature interactions)
Enhancing interpretability of output throughout the analysis
Avoiding and detecting common sources of bias
Reproducibility (see STREAMLINE parameter settings)
Run mode flexibility (accomodates users with different levels of expertise)
More advanced users can easily add their own scikit-learn compatible modeling algorithms to STREAMLINE

See the About (FAQs) to gain a deeper understanding of STREAMLINE with respect to it's overall design, what it includes, what it can be used for, and implementation highlights that differentiate it from other AutoML tools.

Current Limitations

At present, STREAMLINE is limited to supervised learning on tabular, binary classification data. We are currently expanding STREAMLINE to multi-class and regression outcome data.
STREAMLINE also does not automate feature extraction from unstructured data (e.g. text, images, video, time-series data), or handle more advanced aspects of data cleaning or feature engineering that would likely require domain expertise for a given dataset.
As STREAMLINE is currently in its 'beta' release, we recommend users first check that they have downloaded the most recent release of STREAMLINE before use. We are actively updating this software as feedback is received.

Publications and Citations

The most recent publication on STREAMLINE (release Beta 0.3.4) with benchmarking on simulated data and application to investigate obstructive sleep apena risk prediction as a clinical outcome is available as a preprint on arxiv here.

The first publication detailing the initial implementation of STREAMLINE (release Beta 0.2.4) and applying it to simulated benchmark data can be found here, or as a preprint on arxiv, here.

See citations for more information on citing STREAMLINE, as well as publications applying STREAMLINE and publications on algorithms developed in our research group and incorporated into STREAMLINE.

Installation and Use

STREAMLINE can be run using a variety of modes balancing ease of use and efficiency.

Google Colab Notebook: runs serially on Google Cloud (best for beginners)
Jupyter Notebook: runs serially/locally
Command Line: runs serially or locally
- Locally, serially
- Locally, cpu core in parallel
- CPU Computing Cluster (HPC), in parallel (best for efficiency)
  - All phases can be run from a single command (with a job monitor/submitter running on the head node until completion)
  - Each phase can be run separately in sequence

See the documentation for requirements, installation, and use details for each.

Basic installation instructions for use on Google Colab, and local runs are given below.

Google Colab

There is no local installation or additional steps required to run STREAMLINE on Google Colab.

Just have a Google Account and open this Colab link to run the demo (takes ~ 6-7 min): https://colab.research.google.com/drive/14AEfQ5hUPihm9JB2g730Fu3LiQ15Hhj2?usp=sharing

Local

Install STREAMLINE for local use with the following command line commands:

git clone --single-branch https://github.com/UrbsLab/STREAMLINE
cd STREAMLINE
pip install -r requirements.txt

Now your STREAMLINE package is ready to use from the STREAMLINE folder either from the included Jupyter Notebook file or the command line.

Other Information

Demonstration Data

Included with this pipeline is a folder named DemoData including two small datasets used as a demonstration of pipeline efficacy. New users can easily test/run STREAMLINE in all run modes set up to run automatically on these datasets.

List of Run Parameters

A complete list of STREAMLINE Parameters can be found here.

Disclaimer

We make no claim that this is the best or only viable way to assemble an ML analysis pipeline for a given classification problem, nor that the included ML modeling algorithms will yield the best performance possible. We intend many expansions/improvements to this pipeline in the future. We welcome feedback, suggestions, and contributions for improvement.

Contact

We welcome ideas, suggestions on improving the pipeline, code-contributions, and collaborations!

For general questions, or to discuss potential collaborations (applying, or extending STREAMLINE); contact Ryan Urbanowicz at [email protected].
For questions on the code-base, installing/running STREAMLINE, report bugs, or discuss other troubleshooting issues; contact Harsh Bandhey at [email protected].

Acknowledgements

The development of STREAMLINE benefited from feedback across multiple biomedical research collaborators at the University of Pennsylvania, Fox Chase Cancer Center, Cedars Sinai Medical Center, and the University of Kansas Medical Center.

The bulk of the coding was completed by Ryan Urbanowicz, Robert Zhang, and Harsh Bandhey. Special thanks to Yuhan Cui, Pranshu Suri, Patryk Orzechowski, Trang Le, Sy Hwang, Richard Zhang, Wilson Zhang, and Pedro Ribeiro for their code contributions and feedback.

We also thank the following collaborators for their feedback on application of the pipeline during development: Shannon Lynch, Rachael Stolzenberg-Solomon, Ulysses Magalang, Allan Pack, Brendan Keenan, Danielle Mowery, Jason Moore, and Diego Mazzotti.

Funding supporting this work comes from NIH grants: R01 AI173095, U01 AG066833, and P01 HL160471.

streamline's People

Contributors

Stargazers

Watchers

Forkers

lambda-science pennshenlab syhwng benstear raptor419 harel-coffee matthew-mcfaul kismoken

streamline's Issues

options to Customize Training and Testing Set Proportions in 10-fold Cross-Validation

Hello,

Just wanted to say thanks for the fantastic tool you've created. It's been a huge help as I dive into programming.

I'm new to this field and find it challenging to set an option for the training and testing set proportions. Specifically, I'm looking to set the training set as 30% and the testing set as 70% of the overall data, and I also want to use the 10-fold cross-validation option. Could you please guide me to the specific option?

Thanks!

Best regards,
Minyoung

Issue with Permutation Importance - "ValueError: assignment destination is read-only"

Hello,

Thank you for providing such a great tool. It has been incredibly helpful in my research. However, I recently encountered an issue after downloading the latest version.

When performing analysis, I encountered the following error during the phase 5 modeling, specifically: "ValueError: assignment destination is read-only."

I suspected a parallelization issue and modified the code by setting run_parallel=False, but the problem persists. Could you please provide any assistance or insights into resolving this issue?

Here's the code snippet I used:
` from streamline.runners.model_runner import ModelExperimentRunner
model_exp = ModelExperimentRunner(
output_path, experiment_name, algorithms=algorithms,
exclude=exclude, class_label=class_label,
instance_label=instance_label, scoring_metric=primary_metric,
metric_direction=metric_direction,
training_subsample=training_subsample,
use_uniform_fi=use_uniform_FI, n_trials=n_trials,
timeout=timeout, save_plots=False,
do_lcs_sweep=do_lcs_sweep, lcs_nu=lcs_nu, lcs_n=lcs_N,
lcs_iterations=lcs_iterations,
lcs_timeout=lcs_timeout, resubmit=False)

model_exp.run(run_parallel=False)`

The error details are as follows:
`--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [24], in <cell line: 13>()
1 from streamline.runners.model_runner import ModelExperimentRunner
2 model_exp = ModelExperimentRunner(
3 output_path, experiment_name, algorithms=algorithms,
4 exclude=exclude, class_label=class_label,
(...)
11 lcs_iterations=lcs_iterations,
12 lcs_timeout=lcs_timeout, resubmit=False)
---> 13 model_exp.run(run_parallel=False)

File /N/slate/minycho/tools/python/STREAMLINE/streamline/runners/model_runner.py:238, in ModelExperimentRunner.run(self, run_parallel)
236 job_list.append((job_obj, copy.deepcopy(model)))
237 else:
--> 238 job_obj.run(model)
239 if run_parallel and run_parallel != "False" and not self.run_cluster:
240 # run_jobs(job_list)
241 Parallel(n_jobs=num_cores)(
242 delayed(model_runner_fn)(job_obj, model
243 ) for job_obj, model in tqdm(job_list))

File /N/slate/minycho/tools/python/STREAMLINE/streamline/modeling/modeljob.py:83, in ModelJob.run(self, model)
81 self.algorithm = model.small_name
82 logging.info('Running ' + str(self.algorithm) + ' on ' + str(self.train_file_path))
---> 83 ret = self.run_model(model)
85 # Pickle all evaluation metrics for ML model training and evaluation
86 pickle.dump(ret, open(self.full_path
87 + '/model_evaluation/pickled_metrics/'
88 + self.algorithm + 'CV' + str(self.cv_count) + "_metrics.pickle", 'wb'))

File /N/slate/minycho/tools/python/STREAMLINE/streamline/modeling/modeljob.py:149, in ModelJob.run_model(self, model)
144 self.export_best_params(self.full_path + '/models/' + self.algorithm +
145 '_usedparams' + str(self.cv_count) + '.csv',
146 model.params)
148 if self.uniform_fi:
--> 149 results = permutation_importance(model.model, x_train, y_train, n_repeats=10, random_state=self.random_state,
150 scoring=self.scoring_metric)
151 self.feature_importance = results.importances_mean
152 else:

File ~/.local/lib/python3.10/site-packages/sklearn/inspection/_permutation_importance.py:258, in permutation_importance(estimator, X, y, scoring, n_repeats, n_jobs, random_state, sample_weight, max_samples)
254 scorer = _MultimetricScorer(scorers=scorers_dict)
256 baseline_score = _weights_scorer(scorer, estimator, X, y, sample_weight)
--> 258 scores = Parallel(n_jobs=n_jobs)(
259 delayed(_calculate_permutation_scores)(
260 estimator,
261 X,
262 y,
263 sample_weight,
264 col_idx,
265 random_seed,
266 n_repeats,
267 scorer,
268 max_samples,
269 )
270 for col_idx in range(X.shape[1])
271 )
273 if isinstance(baseline_score, dict):
274 return {
275 name: _create_importances_bunch(
276 baseline_score[name],
(...)
280 for name in baseline_score
281 }

File ~/.local/lib/python3.10/site-packages/sklearn/utils/parallel.py:63, in Parallel.call(self, iterable)
58 config = get_config()
59 iterable_with_config = (
60 (_with_config(delayed_func, config), args, kwargs)
61 for delayed_func, args, kwargs in iterable
62 )
---> 63 return super().call(iterable_with_config)

File ~/.local/lib/python3.10/site-packages/joblib/parallel.py:1863, in Parallel.call(self, iterable)
1861 output = self._get_sequential_output(iterable)
1862 next(output)
-> 1863 return output if self.return_generator else list(output)
1865 # Let's create an ID that uniquely identifies the current call. If the
1866 # call is interrupted early and that the same instance is immediately
1867 # re-used, this id will be used to prevent workers that were
1868 # concurrently finalizing a task from the previous call to run the
1869 # callback.
1870 with self._lock:

File ~/.local/lib/python3.10/site-packages/joblib/parallel.py:1792, in Parallel._get_sequential_output(self, iterable)
1790 self.n_dispatched_batches += 1
1791 self.n_dispatched_tasks += 1
-> 1792 res = func(*args, **kwargs)
1793 self.n_completed_tasks += 1
1794 self.print_progress()

File ~/.local/lib/python3.10/site-packages/sklearn/utils/parallel.py:123, in _FuncWrapper.call(self, *args, **kwargs)
121 config = {}
122 with config_context(**config):
--> 123 return self.function(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/sklearn/inspection/_permutation_importance.py:62, in _calculate_permutation_scores(estimator, X, y, sample_weight, col_idx, random_state, n_repeats, scorer, max_samples)
60 X_permuted[X_permuted.columns[col_idx]] = col
61 else:
---> 62 X_permuted[:, col_idx] = X_permuted[shuffling_idx, col_idx]
63 scores.append(_weights_scorer(scorer, estimator, X_permuted, y, sample_weight))
65 if isinstance(scores[0], dict):

ValueError: assignment destination is read-only`

Your help on this matter would be greatly appreciated.

Thank you,
Min

urbslab / streamline Goto Github PK

streamline's Introduction

Overview

YouTube Overview of STREAMLINE

Pipeline Design

Current Limitations

Publications and Citations

Installation and Use

Google Colab

Local

Other Information

Demonstration Data

List of Run Parameters

Disclaimer

Contact

Other STREAMLINE Tutorial Videos on YouTube

A Brief Introduction to Automated Machine Learning

A Detailed Walkthrough

Input Data

Run Parameters

Running in Google Colab Notebook

Running in Jupyter Notebook

Running From Command Line