Git Product home page Git Product logo

galaxy-ml's Introduction

Galaxy-ML

Galaxy-ML is a web machine learning end-to-end pipeline building framework, with special support to biomedical data. Under the management of unified scikit-learn APIs, cutting-edge machine learning libraries are combined together to provide thousands of different pipelines suitable for various needs. In the form of Galalxy tools, Galaxy-ML provides scalabe, reproducible and transparent machine learning computations.

Key features

  • simple web UI
  • no coding or minimum coding requirement
  • fast model deployment and model selection, specialized in hyperparameter tuning using GridSearchCV
  • high level of parallel and automated computation

Supported modules

A typic machine learning pipeline is composed of a main estimator/model and optional preprocessing component(s).

Model
  • scikit-learn

    • sklearn.ensemble
    • sklearn.linear_model
    • sklearn.naive_bayes
    • sklearn.neighbors
    • sklearn.svm
    • sklearn.tree
  • xgboost

    • XGBClassifier
    • XGBRegressor
  • mlxtend

    • StackingCVClassifier
    • StackingClassifier
    • StackingCVRegressor
    • StackingRegressor
  • Keras (Deep learning models are re-implemented to fully support sklearn APIs. Supports parameter, including layer subparameter, swaps or searches. Supports callbacks)

    • KerasGClassifier
    • KerasGRegressor
    • KerasGBatchClassifier (works best with online data generators, processing images, genomic sequences and so on)
  • BinarizeTargetClassifier/BinarizeTargetRegressor

  • IRAPSClassifier

Preprocessor
  • scikit-learn
    • sklearn.preprocessing
    • sklearn.feature_selection
    • sklearn.decomposition
    • sklearn.kernel_approximation
    • sklearn.cluster
  • imblanced-learn
    • imblearn.under_sampling
    • imblearn.over_sampling
    • imblearn.combine
  • skrebate
    • ReliefF
    • SURF
    • SURFstar
    • MultiSURF
    • MultiSURFstar
  • TDMScaler
  • DyRFE/DyRFECV
  • Z_RandomOverSampler
  • GenomeOneHotEncoder
  • ProteinOneHotEncoder
  • FastaDNABatchGenerator
  • FastaRNABatchGenerator
  • FastaProteinBatchGenerator
  • GenomicIntervalBatchGenerator
  • GenomicVariantBatchGenerator
  • ImageDataFrameBatchGenerator

Installation

APIs for models, preprocessors and utils implemented in Galaxy-ML can be installed separately.

Installing using anaconda (recommended)
conda install -c bioconda -c conda-forge Galaxy-ML
Installing using pip
pip install -U Galaxy-ML
Installing from source
python setup.py install
Using source code inplace
python install -e .

To install Galaxy-ML tools in Galaxy, please refer to https://galaxyproject.org/admin/tools/add-tool-from-toolshed-tutorial/.

Running the tests

Before running the tests, run the following commands:

conda create --name galaxy_ml python=3.9
conda activate galaxy_ml
pip install -e .
pip install nose nose-htmloutput pytest
cd galaxy_ml

To run all tests and generate an HTML report:

nosetests ./tests --with-html --html-file=./report.html

To run tests in a specific file (e.g., test_keras_galaxy.py file) and generate an HTML report

nosetests ./tests/test_keras_galaxy.py --with-html --html-file=./report.html

To run a specific test in a specific file (e.g., test_multi_dimensional_output test in test_keras_galaxy.py file) and generate an HTML report

nosetests ./tests/test_keras_galaxy.py:test_multi_dimensional_output --with-html --html-file=./report.html

Examples for using Galaxy-ML custom models

# handle imports
from keras.models import Sequential
from keras.layers import Dense, Activation
from sklearn.model_selection import GridSearchCV
from galaxy_ml.keras_galaxy_models import KerasGClassifier


# build a DNN classifier
model = Sequential()
model.add(Dense(64))
model.add(Activation(‘relu'))
model.add((Dense(1, activation=‘sigmoid’)))
config = model.get_config()

classifier = KerasGClassifier(config, random_state=42)


# clone a classifier
clf = clone(classifier)


# Get parameters
params = clf.get_params()


# Set parameters
new_params = dict(
    epochs=60,
    lr=0.01,
    layers_1_Dense__config__kernel_initializer__config__seed=999,
    layers_0_Dense__config__kernel_initializer__config__seed=999
)
clf.set_params(**new_params)


# model evaluation using GridSearchCV
grid = GridSearchCV(clf, param_grid={}, scoring=‘roc_auc’, cv=5, n_jobs=2)
grid.fit(X, y)

Example for using Galaxy-ML to persist a sklearn/keras model

from galaxy_ml.model_persist import (dump_model_to_h5,
                                     load_model_from_h5)
                 
# dump model to hdf5
dump_model_to_h5(model, `save_path`,
                 store_hyperparameter=True)

# load model from hdf5
model = load_model_from_h5(`path_to_hdf5`)

Performance comparison

Galaxy-ML's HDF5 saving utils perform faster than cPickle for large, array-rich models.

Loading model using pickle...
(1.2471628189086914 s)

Dumping model using pickle...
(3.6942389011383057 s)
File size: 930712861

Dumping model to hdf5...
(3.006715774536133 s)
File size: 930729696

Loading model from hdf5...
(0.6420958042144775 s)

Pipeline(memory=None,
         steps=[('robustscaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('kneighborsclassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=1, n_neighbors=100, p=2,
                                      weights='uniform'))],
         verbose=False)

Publication

Gu Q, Kumar A, Bray S, Creason A, Khanteymoori A, Jalili V, et al. (2021) Galaxy-ML: An accessible, reproducible, and scalable machine learning toolkit for biomedicine. PLoS Comput Biol 17(6): e1009014. https://doi.org/10.1371/journal.pcbi.1009014

galaxy-ml's People

Contributors

anuprulez avatar bgruening avatar dependabot[bot] avatar kxk302 avatar mbargull avatar qiagu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

galaxy-ml's Issues

Messing up of the pipeline / final estimator option

Sometimes a final estimator is output as a pipeline, which is not too bad.
Sometime a pipeline is output using the final estimator option, which is very bad.

Maybe we can make it a little smarter. When there is only one component in the pipeline, output the final estimator, otherwise pipeline.

Thanks Jonah for pointing out the issue.

Save training loss history to .csv file

Enable saving training loss history to .csv file (From https://help.galaxyproject.org/t/csvlogger-callback-with-create-deep-learning-model/10447)

Hi,

I am using the Create deep learning model tool (with an optimizer, loss function and fit parameters) and I would like to get an indication about the convergence of the loss function as a function of the epoch.

I thought that information would come in the CSVLogger but that does not seem to be the case, is this the correct option?

Can someone point me to where I can find the training loss history (and not simply the final value)?

J34ni

Fix get_scoring method

Currently, get_scoring() in galaxy_ml/utils.py assumes 'secondary_scoring' is a comma separated string. It seems galaxy now provides a list instead. Must revise get_scoring() so it can handle a list and a string (for backward compatibility). Right now, the fix (list to comma separated string conversion) happens in galaxytools. After get_scoring is revised, the fix in galaxytools should be removed.

V0.9.0 issues

  • BayesSearchCV doesn't work with sklearn v0.24.x, wait for release of scikit-optimize v0.90

  • Tensorflow gives a numpy error in `keras_model_config tool. Bugfix is pending, waiting TF v0.24.2

  • Need test cases for imbalanced-learning in hdf5 persistence.

Format of Search CV parameters

As reported by one of our students, the tabular dataset returned by the search CV tool is populating the list differently as documented earlier in many ML tutorials. Do we know what needs to be changed in the tool parameter selection to make the search CV tool run? I remember we used to get a list of all hyperparameters and then we could select those whose values need to be optimized.

search_cv1

I tried a few ways to make it work - by changing tool's version, using different options in Choose a parameter name (with current value) but the jobs are failing. Do you know what should be the correct setting, then I will update the tutorials accordingly. I am not sure if this is a bug or it has been formatted like this (above image).

The hyperparameters were listed previously as:

searchcv2.

link to the history: https://usegalaxy.eu/u/kumara/h/age-prediction

It could be related to the method here:

def get_search_params(estimator):

I think parameters should be listed that currently do not

no_param_list

ping @qiagu Any ideas/hint how to fix this?

Thanks a lot!

About Missing column selector and header option in the ml_visualization_ex tool

Having those options in the visualization tool may make it crowed. Instead, there are existing tools that can handle header and column selection. table_compute ( https://toolshed.g2.bx.psu.edu/repos/iuc/table_compute) is one of them.

Another tip:
The ml_visualization_ex tool supports to output a pdf image in the job working directory. Modify the code in the plotly layout part to generate personalized images for presentation or publication.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.