Git Product home page Git Product logo

mleap's Introduction

MLeap Logo

Gitter Build Status Maven Central

Deploying machine learning data pipelines and algorithms should not be a time-consuming or difficult task. MLeap allows data scientists and engineers to deploy machine learning pipelines from Spark and Scikit-learn to a portable format and execution engine.

Documentation

Documentation is available at https://combust.github.io/mleap-docs/.

Read Serializing a Spark ML Pipeline and Scoring with MLeap to gain a full sense of what is possible.

Introduction

Using the MLeap execution engine and serialization format, we provide a performant, portable and easy-to-integrate production library for machine learning data pipelines and algorithms.

For portability, we build our software on the JVM and only use serialization formats that are widely-adopted.

We also provide a high level of integration with existing technologies.

Our goals for this project are:

  1. Allow Researchers/Data Scientists and Engineers to continue to build data pipelines and train algorithms with Spark and Scikit-Learn
  2. Extend Spark/Scikit/TensorFlow by providing ML Pipelines serialization/deserialization to/from a common framework (Bundle.ML)
  3. Use MLeap Runtime to execute your pipeline and algorithm without dependenices on Spark or Scikit (numpy, pandas, etc)

Overview

  1. Core execution engine implemented in Scala
  2. Spark, PySpark and Scikit-Learn support
  3. Export a model with Scikit-learn or Spark and execute it using the MLeap Runtime (without dependencies on the Spark Context, or sklearn/numpy/pandas/etc)
  4. Choose from 2 portable serialization formats (JSON, Protobuf)
  5. Implement your own custom data types and transformers for use with MLeap data frames and transformer pipelines
  6. Extensive test coverage with full parity tests for Spark and MLeap pipelines
  7. Optional Spark transformer extension to extend Spark's default transformer offerings

Unified Runtime

Dependency Compatibility Matrix

Other versions besides those listed below may also work (especially more recent Java versions for the JRE), but these are the configurations which are tested by mleap.

MLeap Version Spark Version Scala Version Java Version Python Version XGBoost Version Tensorflow Version
0.23.1 3.4.0 2.12.18 11 3.7, 3.8 1.7.6 2.10.1
0.23.0 3.4.0 2.12.13 11 3.7, 3.8 1.7.3 2.10.1
0.22.0 3.3.0 2.12.13 11 3.7, 3.8 1.6.1 2.7.0
0.21.1 3.2.0 2.12.13 11 3.7 1.6.1 2.7.0
0.21.0 3.2.0 2.12.13 11 3.6, 3.7 1.6.1 2.7.0
0.20.0 3.2.0 2.12.13 8 3.6, 3.7 1.5.2 2.7.0
0.19.0 3.0.2 2.12.13 8 3.6, 3.7 1.3.1 2.4.1
0.18.1 3.0.2 2.12.13 8 3.6, 3.7 1.0.0 2.4.1
0.18.0 3.0.2 2.12.13 8 3.6, 3.7 1.0.0 2.4.1
0.17.0 2.4.5 2.11.12, 2.12.10 8 3.6, 3.7 1.0.0 1.11.0

Setup

Link with Maven or SBT

SBT

libraryDependencies += "ml.combust.mleap" %% "mleap-runtime" % "0.23.1"

Maven

<dependency>
    <groupId>ml.combust.mleap</groupId>
    <artifactId>mleap-runtime_2.12</artifactId>
    <version>0.23.1</version>
</dependency>

For Spark Integration

SBT

libraryDependencies += "ml.combust.mleap" %% "mleap-spark" % "0.23.1"

Maven

<dependency>
    <groupId>ml.combust.mleap</groupId>
    <artifactId>mleap-spark_2.12</artifactId>
    <version>0.23.1</version>
</dependency>

PySpark Integration

Install MLeap from PyPI

$ pip install mleap

Using the Library

For more complete examples, see our other Git repository: MLeap Demos

Create and Export a Spark Pipeline

The first step is to create our pipeline in Spark. For our example we will manually build a simple Spark ML pipeline.

import ml.combust.bundle.BundleFile
import ml.combust.mleap.spark.SparkSupport._
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.bundle.SparkBundleContext
import org.apache.spark.ml.feature.{Binarizer, StringIndexer}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.util.Using

  val datasetName = "./examples/spark-demo.csv"

  val dataframe: DataFrame = spark.sqlContext.read.format("csv")
    .option("header", true)
    .load(datasetName)
    .withColumn("test_double", col("test_double").cast("double"))

  // User out-of-the-box Spark transformers like you normally would
  val stringIndexer = new StringIndexer().
    setInputCol("test_string").
    setOutputCol("test_index")

  val binarizer = new Binarizer().
    setThreshold(0.5).
    setInputCol("test_double").
    setOutputCol("test_bin")

  val pipelineEstimator = new Pipeline()
    .setStages(Array(stringIndexer, binarizer))

  val pipeline = pipelineEstimator.fit(dataframe)

  // then serialize pipeline
  val sbc = SparkBundleContext().withDataset(pipeline.transform(dataframe))
  Using(BundleFile("jar:file:/tmp/simple-spark-pipeline.zip")) { bf =>
    pipeline.writeBundle.save(bf)(sbc).get
  }

The dataset used for training can be found here

Spark pipelines are not meant to be run outside of Spark. They require a DataFrame and therefore a SparkContext to run. These are expensive data structures and libraries to include in a project. With MLeap, there is no dependency on Spark to execute a pipeline. MLeap dependencies are lightweight and we use fast data structures to execute your ML pipelines.

PySpark Integration

Import the MLeap library in your PySpark job

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

See PySpark Integration of python/README.md for more.

Create and Export a Scikit-Learn Pipeline

import pandas as pd

from mleap.sklearn.pipeline import Pipeline
from mleap.sklearn.preprocessing.data import FeatureExtractor, LabelEncoder, ReshapeArrayToN1
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame(['a', 'b', 'c'], columns=['col_a'])

categorical_features = ['col_a']

feature_extractor_tf = FeatureExtractor(input_scalars=categorical_features, 
                                         output_vector='imputed_features', 
                                         output_vector_items=categorical_features)

# Label Encoder for x1 Label 
label_encoder_tf = LabelEncoder(input_features=feature_extractor_tf.output_vector_items,
                               output_features='{}_label_le'.format(categorical_features[0]))

# Reshape the output of the LabelEncoder to N-by-1 array
reshape_le_tf = ReshapeArrayToN1()

# Vector Assembler for x1 One Hot Encoder
one_hot_encoder_tf = OneHotEncoder(sparse=False)
one_hot_encoder_tf.mlinit(prior_tf = label_encoder_tf, 
                          output_features = '{}_label_one_hot_encoded'.format(categorical_features[0]))

one_hot_encoder_pipeline_x0 = Pipeline([
                                         (feature_extractor_tf.name, feature_extractor_tf),
                                         (label_encoder_tf.name, label_encoder_tf),
                                         (reshape_le_tf.name, reshape_le_tf),
                                         (one_hot_encoder_tf.name, one_hot_encoder_tf)
                                        ])

one_hot_encoder_pipeline_x0.mlinit()
one_hot_encoder_pipeline_x0.fit_transform(data)
one_hot_encoder_pipeline_x0.serialize_to_bundle('/tmp', 'mleap-scikit-test-pipeline', init=True)

# array([[ 1.,  0.,  0.],
#        [ 0.,  1.,  0.],
#        [ 0.,  0.,  1.]])

Load and Transform Using MLeap

Because we export Spark and Scikit-learn pipelines to a standard format, we can use either our Spark-trained pipeline or our Scikit-learn pipeline from the previous steps to demonstrate usage of MLeap in this section. The choice is yours!

import ml.combust.bundle.BundleFile
import ml.combust.mleap.runtime.MleapSupport._
import scala.util.Using
// load the Spark pipeline we saved in the previous section
val bundle = Using(BundleFile("jar:file:/tmp/simple-spark-pipeline.zip"))) { bundleFile =>
  bundleFile.loadMleapBundle().get
}).opt.get

// create a simple LeapFrame to transform
import ml.combust.mleap.runtime.frame.{DefaultLeapFrame, Row}
import ml.combust.mleap.core.types._

// MLeap makes extensive use of monadic types like Try
val schema = StructType(StructField("test_string", ScalarType.String),
  StructField("test_double", ScalarType.Double)).get
val data = Seq(Row("hello", 0.6), Row("MLeap", 0.2))
val frame = DefaultLeapFrame(schema, data)

// transform the dataframe using our pipeline
val mleapPipeline = bundle.root
val frame2 = mleapPipeline.transform(frame).get
val data2 = frame2.dataset

// get data from the transformed rows and make some assertions
assert(data2(0).getDouble(2) == 1.0) // string indexer output
assert(data2(0).getDouble(3) == 1.0) // binarizer output

// the second row
assert(data2(1).getDouble(2) == 2.0)
assert(data2(1).getDouble(3) == 0.0)

Documentation

For more documentation, please see our documentation, where you can learn to:

  1. Implement custom transformers that will work with Spark, MLeap and Scikit-learn
  2. Implement custom data types to transform with Spark and MLeap pipelines
  3. Transform with blazing fast speeds using optimized row-based transformers
  4. Serialize MLeap data frames to various formats like avro, json, and a custom binary format
  5. Implement new serialization formats for MLeap data frames
  6. Work through several demonstration pipelines which use real-world data to create predictive pipelines
  7. Supported Spark transformers
  8. Supported Scikit-learn transformers
  9. Custom transformers provided by MLeap

Contributing

  • Write documentation.
  • Write a tutorial/walkthrough for an interesting ML problem
  • Contribute an Estimator/Transformer from Spark
  • Use MLeap at your company and tell us what you think
  • Make a feature request or report a bug in github
  • Make a pull request for an existing feature request or bug report
  • Join the discussion of how to get MLeap into Spark as a dependency. Talk with us on Gitter (see link at top of README.md)

Building

Please ensure you have sbt 1.9.3, java 11, scala 2.12.18

  1. Initialize the git submodules git submodule update --init --recursive
  2. Run sbt compile

Thank You

Thank you to Swoop for supporting the XGboost integration.

Contributors Information

Past contributors

License

See LICENSE and NOTICE file in this repository.

Copyright 20 Combust, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

mleap's People

Contributors

alexholmes avatar ancasarb avatar austinzh avatar clocklear avatar danielpoonwj avatar dependabot[bot] avatar ekasitk avatar faizanshabbir avatar hollinwilkins avatar huafengw avatar jsleight avatar juhoautio avatar lucagiovagnoli avatar mageswaran1989 avatar massivityreport avatar mengxr avatar mleap avatar mutyonok avatar pnpritchard avatar praj-0 avatar prianna avatar rodrigo196 avatar seme0021 avatar shyamsunder00 avatar talalryz avatar tammosminia avatar voganrc avatar weichenxu123 avatar wenfeiy-db avatar zhxiaogg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mleap's Issues

ConfigException$Missing on SparkBundleContext initialization

Dear sir or Madam,
When I try to save the model using:
val sbc = SparkBundleContext().withDataset(pipeline.transform(df))
there is following exception:
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'ml'

I am using iheart/ficus for configuration setting, which is built on top of the typesafe.config library.
So in the beginning of my program, i have code like:
val conf = ConfigFactory.load()
val settings = new Settings(conf)
which will read configuration from reference.conf and application.conf.

When I test the code in spark-shell, without using typesafe.config library, the code works.

How can I continue to use typesafe.config on my own without breaking MLeap?

Thanks.

Use nio FileSystem to serialize Bundle.ML and update format slightly

We want to use NIO FileSystem objects to serialize Bundle.ML, this will make it much more versatile and simplify the code a great deal. Also, some small tweaks to how we serialize Bundle.ML root-level components.

  1. NIO FileSystem objects for serializing
  2. bundle.json should only include version, uid, serialization format
  3. root-level transformer should be in a folder called root, next to bundle.json
  4. get rid of custom attributes on the Bundle

Coalesce and StringMap transformers for MLeap

Coalesce transformer takes in multiple columns and chooses the first non-null value. Supports only doubles and nullable doubles.

StringMap takes in a string and outputs a double using a user-defined map.

Spark support for these two transformers will come with a later ticket.

Schema of the deployed model

Hi,

I just tried mleap, which is really awesome. However, I was wondering is there a way to get the schema of the exported model in a format such as PMML? So that we can have a better overview of what types of features the model is using and information like that?

Thanks

Initial support for scikit-learn

Build out a Python module to serialize Scikit-learn + Pandas transformer pipelines to MLeap. We do not need to support deserialization to start.

  1. Support Bundle.ML JSON format
  2. Support several feature transformers, regression algorithms and classifiers
  3. Make sure decision tree serialization is working properly
  4. Publish module to PIP when we release MLeap 0.5.0
  5. Add documentation for Python SK Learn integration
  6. Create some notebooks showing SK learn

deserialize the bundle model and got Bundle[Nothing]

When I deserialize the bundle model (a simple Random Forest Model) from jar zip file, like
val bundle = BundleFile("jar:file:/home//userA/rf.zip").load().get
I got a bundle of type ml.combust.bundle.dsl.Bundle[Nothing],

and then when I access root, I got following exceptions:
java.lang.ClassCastException: ml.combust.mleap.runtime.transformer.Pipeline cannot be cast to scala.runtime.Nothing$

Thanks for help in advance!

Support Conversion of Product classes to/from DefaultLeapFrame

Spark has a nice feature that let's you build a Dataset from a case class. We should support this as well.

MleapReflection provides many tools that will be needed for this task.
I am thinking we should support the following conversions:

  1. (case class) -> DefaultLeapFrame w/ 1 row
  2. Seq(case class) -> DefaultLeapFrame w/ n rows
  3. DefaultLeapFrame -> (case class), extracts first row (throw error if more than one?)
  4. DefaultLeapFrame -> Seq(case class), extracts all rows into a Seq of a case class

These conversions should be implicit and included in the MleapSupport trait for easy usage.

Need to a workable example

Hi,

Just want to say this project is pretty cool and thanks for your effort!

We are looking for some solution to train models offline using Spark, yet score online in real time. This is exactly what we need.

I was trying to follow the minimal doc in the wiki page. https://github.com/combust-ml/mleap/wiki/Setting-up-a-Spark-2.0-notebook-with-MLeap-an-Toree

The page seems not finished yet. Any unit test class of example that I can follow to use MLeap?

Also, a couple of corrections.

  • page title should be "... MLeap and Toree"
  • in build & install toree, it should be "pip install toree-0.2.0.dev1.tar.gz"
  • also, it should be SPARK_HOME=... jupyter toree install

Again. Thanks!

Nice Java Interface

Currently working with MLeap from Java can be a pain. Let's make the interface nicer.

Include an optional schema file for MLeap pipelines

Include an optional schema.json file in the root bundle. Only include this if there is enough information to accurately describe the input and output schemas.

For Spark-trained pipelines, we will have to include the DataFrame used to train the pipeline while we serialize the model. SparkBundleContext already has an optional DataFrame for this purpose.

Standardize Serialization format with Spark

Standardizing ML Pipeline Serialization

Currently there is a large array of serialization formats for machine learning models:

  1. PMML is an XML-based format primarily targeting the JVM for executing ML models
  2. Scikit-learn relies on Python pickling to export models
  3. Spark has a serialization format based on Parquet and JSON
  4. Various other libraries such as Caffe, Torch, MLDB, etc. have their own custom file formats they use to store models with

We propose a serialization format that is highly-extensible, portable across language and platforms, open-source and with a reference implementation in both Scala and Rust. We call this serialization format Bundle.ML.

Key Features

  1. It should be easy for developers to add custom transformers in Scala, Java, Python, C, Rust, or any other language
  2. The serialization format should be flexible and meet state-of-the-art performance requirements. This means being able to serialize arbitrarily-large random forest, linear, or neural network models.
  3. Serialization should be optimized for ML Transformers and Pipelines as seen in Scikit-learn and Spark, but it should also support non-pipeline based frameworks such as H2O
  4. Serialization should be accessible for all environments and platforms, including low-level languages like C, C++ and Rust
  5. Provide a common, extensible serialization format for any technology to integrate with via custom transformers or core transformers
  6. Serialization/Deserialization should be possible with as many technologies as possible to make the models truly portable between different platforms. ie, we should be able to train a pipeline with Scikit-learn then execute it in Spark.

Add imputer transformer to MLeap

We should add imputer to MLeap based on the Spark transformer.

  1. MLeap core model
  2. MLeap runtime transformer
  3. MLeap transformer NodeOp
  4. Spark transformer NodeOp

Tricky Spark transformers

This epic is for Spark transformers that are rather tricky for one reason or another to adapt to MLeap.

This usually is because multiple data frames may be involved in the transform process. MLeap will have to come up with a solution to this as we move forward.

Spark transformer params missing after deserializing

I have started experimenting with serializing and deserializing spark pipelines. However, I have noticed that when I override default params, they are missing after deserialization. I have narrowed down the cause to how the OpNode#load method is implemented, and specifically the use of .copy(model.extractParamMap()).

I am not very familiar with this copy API provided by spark, so cannot figure out if this is a Spark bug or if this is misuse of the API. So the only solution I've thought so far is just to explicitly get and set each param (as done in OpModel#load).

Here is a reproducible case, using Binarizer as an example:

import org.apache.spark.ml.feature.Binarizer
import ml.combust.mleap.spark.SparkSupport._

val bin = new Binarizer("bin")
  .setInputCol("in")
  .setOutputCol("out")
  .setThreshold(0.5)

val path = new File(...)
bin.serializeToBundle(path)

val bin2 = path.deserializeBundle()._2.asInstanceOf[Binarizer]
assert(bin.getInputCol == bin2.getInputCol)
assert(bin.getOutputCol == bin2.getOutputCol)
assert(bin.getThreshold == bin2.getThreshold) //fails

Allow for saving meta data into the Bundle file

Allow users to store arbitrary meta data in the bundle file.

This can be useful for:

  1. Quick training summary statistics
  2. Information about labels in the model that could be used later
  3. Descriptions, notes, etc.

mleap custom spark estimator

How can I integrate custom spark transformers and estimators into mleap?

I am thinking of preprocessing steps, nan cleaning, ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.