Git Product home page Git Product logo

mercury-ml's Introduction

mercury-ml

logo

In the ancient Roman mythology, the god Mercury was known as the messenger of the gods. Wearing winged shoes and a winged hat he zipped between Mount Olympus and the kingdoms of men and saw to it that the will of the gods was known.

We've chosen mercury-ml as the name of this package because we see its role as very similar.

Recent developments in Machine Learning and Data Processing tools have led to a myriad of except open source libraries each of which provide well developed and transparent APIs. Where it becomes more complicated is when functions from different libraries need to be strung together to form a machine learning workflow. mercury-ml is "a messenger of the gods" that enables this to happen. It seeks to break down a Machine Learning project into its typical generic components (such as read data, transform data, fit model, evaluate model etc.) and offers a generic modular structure where implementations for specific methods and technologies can slot in.

These broken down components can then be chained together into a coherent, easily configurable workflow for fitting, evaluating and (coming soon!) serving Machine Learning models.

Components

The package is split firstly into two broad sections:

  • common: where functions and classes that are commonly useful, regardless of the machine learning engine used, are found
  • Then there are sections that are specific to machine learning engines (currently this includes h2o and keras).

Within these sections there is a further subdivision into three APIs:

  • providers: The individual (modular) building-blocks used to build up a workflow
  • aliases: Centralised containers over which the desired providers can be fetched.
  • tasks: Small predefined chunks of work, mostly stringing together a handful of logical steps to be executed by various providers

You can interact with mercury-ml via any or all of these APIs. They deliver different levels of abstraction depending on what you need. You can also easily mix in your own custom providers.

To understand the purpose and function of each individual provider, alias container and task, please refer to the mercury-ml API documentation.

Dependencies

Since mercury-ml functions as a facilitator for workflows based on various different packages its dependencies will be determined by which functions are used. The core dependencies have been minimized to only a handful of packages.

Core dependencies
  • python>=3.5
  • pandas
  • numpy
  • scikit-learn
Workflows using Keras:
  • tensorflow or tensorflow-gpu
  • keras
  • Pillow
Workflow using H2O:
  • h2o
Workflows using H2O Sparkling:
  • pyspark
  • h2o-pysparkling-{spark-version} (e.g. if you installed pyspark==2.4, you should install h2o-pysparkling-2.4)
Remote storage dependencies:
  • AWS S3: boto3
  • GCP Cloud Storage: google-cloud-storage

Installation

mercury-ml can be installed from PyPi with pip install mercury-ml. This will also install the core dependencies.

Usage

mercury-ml aims to offer simplified access to functionality at different levels of abstraction.

Below are four simple examples that each do the same thing: save a Keras model to S3. They do so at different levels of abstraction:

  1. Without using mercury-ml (i.e. directly using the underlying dependencies)
  2. Using the providers API
  3. Using the aliases API
  4. Using the tasks API (in conjunction with the aliases API)

Each of these examples are perfectly valid, though in certain circumstances one may make more sense than the other.

For more complete examples, please see the examples directory in this repository.

Parameterization:

Let's assume we have the following inputs:

model = ... # assume a fitted Keras model fetched here
filename = "my_model"
local_dir = "./local_models"
extension = ".h5"
remote_dir = "my-bucket/remote-model"
1. Example via directly accessing the underlying libraries (i.e. without using mercury-ml)

Using the underlying libraries rather than using the mercury-ml APIs makes sense when you want the maximum flexibility to configure how these libraries are used.

import os

# save model
if not os.path.exists(local_dir):
    os.makedirs(local_dir)

filename = filename + extension
local_path = os.path.join(local_dir + "/" + filename)
model.save(local_path)

# copy to s3
import boto3
session = boto3.Session()
s3 = session.resource("s3")

s3_bucket, s3_partial_key = remote_dir.split("/", 1)
s3_key = s3_partial_key + "/" + filename + extension

s3.Object(s3_bucket, s3_key).put(Body=open(local_path, "rb"))
2. Example via providers

Using the providers API makes the most sense if you want to hardcode the providers you want to use. For example in the code snippet be, you can only use model_saving.save_keras_hdf5 and from_disk.copy_from_disk_to_s3. If you want to save the model in a different format, or copy it to a different store you must change your code to do so.

from mercury_ml.tensorflow import model_saving
from mercury_ml.common.artifact_copying import from_disk
import os

# save model
path = model_saving.save_keras_hdf5(model=model,
                                    filename=filename,
                                    local_dir=local_dir,
                                    extension=extension)

# copy to s3
from_disk.copy_from_disk_to_s3(source_dir=local_dir,
                               target_dir=remote_dir,
                               filename=os.path.basename(path))
3. Example via aliases

Using the alias API makes the most sense when you want to steer your workflow via a configuration file. The alias containers are just light-weight classes that allow you to access various similar providers from a single location. For example, the function used above, model_saving.save_keras_hdf5 can be accessed via a container as ModelSavers.save_hdf5. Using the getattr function this can also be accessed as getattr(ArtifactCopiers, "save_hdf5") allowing us to easily parameterize this in a config.

from mercury_ml.tensorflow import ModelSavers
from mercury_ml.common import ArtifactCopiers
import os

config = {
    "save_model": "save_hdf5",
    "copy_model": "copy_from_disk_to_s3"
}

save_model = getattr(ModelSavers, config["save_model"])
copy_from_local_to_remote = getattr(ArtifactCopiers, config["copy_model"])

# save model
path = save_model(model=model,
                  filename=filename,
                  local_dir=local_dir,
                  extension=extension)

# copy to s3
copy_from_local_to_remote(source_dir=local_dir,
                          target_dir=remote_dir,
                          filename=os.path.basename(path)
                          )
4. Example via tasks (in conjunction with aliases)

Using the tasks API makes sense when you want to use a single function that defines a small workflow that involves more than one provider and requires multiple steps. For example, the store_model task below is injected with a save_model and a copy_from_local_to_remote provider and proceeds to use these providers first to save a model locally and then to copy it to a remote location (in this example, to S3)

from mercury_ml.common.tasks import store_model
from mercury_ml.tensorflow import ModelSavers
from mercury_ml.common import ArtifactCopiers

save_model = getattr(ModelSavers, config["save_model"])
copy_from_local_to_remote = getattr(ArtifactCopiers, config["copy_model"])

# save model and copy to s3
store_model(save_model=save_model,
            copy_from_local_to_remote=copy_from_local_to_remote,
            model=model,
            filename=filename,
            local_dir=local_dir,
            remote_dir=remote_dir,
            extension=extension
            )

Data in mercury-ml

It is worth saying a few words about how mercury-ml deals with data as this forms the backbone of how it is able to facilitate robust machine learning workflows. There are three concepts to understand:

  1. DataWrapper. An instance of the DataWrapper class wraps an underlying structure (for example a Pandas DataFrame, Spark DataFrame, Numpy Array or Keras ImageDataGenerator) into an object that has the following characteristics:
    • It has the attributes underlying (which gives you direct access to the data structure that has been wrapped) and field_names, which is a list with the names of the in the underlying data (field_names is not always relevant, and may be set to None).
    • It has various functions that transform from one DataWrapper to another. For example, PandasDataWrapper.to_numpy() will yield an instance of NumpyDataWrapper
  2. DataSet. An instance of the DataSet class is a container for various instances of DataWrapper. A DataSet will typically consist of DataWrappers for full_data, index, features and targets but this is not predefined. It also contains some functionality that facilitates the transformation into a DataSet with DataWrappers of a different type.
  3. DataBunch. An instance of the DataBunch class is essentially just a container that holds one or more DataSet instances. A DataBunch will typically consist of train, valid and test DataSets.

Example usage:

As an example of how this works, let's create a DataBunch for a model training that uses Pandas DataFrames as inputs:

import pandas as pd
from mercury_ml.common.data_wrappers.pandas import PandasDataWrapper
from mercury_ml.common.data_set import DataSet
from mercury_ml.common.data_bunch import DataBunch

path_to_input_data = "./example_data_train.csv"

full_data_columns=["ID", "field1_num", "field2_num", "field3_factor", "field4_target"] #the full columns relevant to training
index_columns=["ID"] # the columns that make up the unique index
features_columns=["field1_num", "field2_num", "field3_factor"] # the columns to be used as features when training
targets_columns=["field6_target"] # the columns with the targets to be trained on        

df = pd.read_csv(path_to_input_data, usecols=full_data_columns)

train_data_set = DataSet(data_wrappers_dict={
    "full_data": PandasDataWrapper(underlying=df, field_names=full_data_columns),
    "index": PandasDataWrapper(underlying=df[index_columns], field_names=index_columns),
    "features": PandasDataWrapper(underlying=df[features_columns], field_names=features_columns),
    "targets": PandasDataWrapper(underlying=df[targets_columns], field_names=targets_columns)
})

data_bunch = DataBunch(data_sets_dict={
    "train": train_data_set
})

You could also add additional DataSets to the DataBunch, either when initially constructing:

data_bunch = DataBunch(data_sets_dict={
    "train": train_data_set,
    "valid": valid_data_set,
    "test": test_data_set
})

Or afterwards via DataBunch.add_data_set:

from mercury_ml.common.data_bunch import DataBunch
data_bunch = DataBunch()
data_bunch.add_data_set(data_set_name="train", data_set=train_data_set)
data_bunch.add_data_set(data_set_name="valid", data_set=valid_data_set)
data_bunch.add_data_set(data_set_name="test", data_set=test_data_set)

mercury-ml's People

Contributors

gabrieloexle avatar karlschriek avatar sbunzel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mercury-ml's Issues

Develop TensorFlow serving implementations

TensorFlow Serving client-side capabilities are currently not very clearly documented. We should aim to create an implementation that simplifies interacting with the REST or gRPC APIs for actions such as get_model_status, get_model_meta_data, hot_reload_model_config, predict and so forth.

This should go under mercury_ml.tensorflow.serving.

Installation hint [h2o-pysparkling] doesn't work

The installation hint [h2o-pysparkling] doesn't work due to the erroneous reference to the sparkling water package as h2o-pysparkling.

I can see that the guidelines on using these hints was recently removed from the readme. My suggestion would be to fix the h2o-pysparkling hint to install pyspark==2.4 and h2o-pysparkling-2.4 and explain this in the readme. Or else, just don't offer that extra installation hint.

Create Dockerfile with environment for unit tests

mercury-ml potentially uses a very wide variety of different systems and technologies. This makes it difficult to set up an environment wherein all unit tests can be run. We should set up a Dockerfile (or multiple Dockerfiles if necessary) that allows us to do this.

Develop a PyTorch implementation

A PyTorch implementation should be added at mercury_ml.pytorch and should broadly follow the structure of mercury_ml.h2o or mercury_ml.keras.

Develop a SciKit-Learn implementation

SciKit-Learn one of the most widely used libraries for machine learning and is especially useful for smaller scale projects that don't have to run on GPU architecture or on distributed clusters.

An implementation should be added at mercury_ml.sklearn and should broadly follow the structure of mercury_ml.h2o or mercury_ml.keras.

Add pymongo to dependencies

It seems that pymongo is not in the list of Core dependencies and will not be automatically installed with pip. I guess it should be added in the setup.py

ImageDataGenerator.fit() not called

If the following parameters for augmentation are specified
featurewise_center
featurewise_std_normalization
zca_whitening
then after creating
generator = ImageDataGenerator
the method
generator.fit(data)
should be called to calculate e.g. the featurewise_std_normalization once.
See first example of https://keras.io/preprocessing/image/
In mercury-ml, this is missing e.g. in
common/providers/source_reading/disk.py line 58-71

So either the method fit() is called somewhere or these 3 specific parameters should be deprecated.

Develop a TensorFlow implementation

TensorFlow is currently only support indirectly via Keras. An implementation that directly uses TensorFlow should be added at mercury_ml.tensorflow and should broadly follow the structure of mercury_ml.h2o or mercury_ml.keras.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.