microsoft / synapseml Goto Github PK

View Code? Open in Web Editor NEW

5.0K 146.0 815.0 142.01 MB

Simple and Distributed Machine Learning

Home Page: http://aka.ms/spark

License: MIT License

Shell 0.75% Scala 85.30% Python 10.83% R 0.02% JavaScript 1.98% Dockerfile 0.78% CSS 0.34%

spark pyspark azure scala microsoft ml machine-learning databricks cognitive-services lightgbm

synapseml's Issues

Get rid of `getEnv` and all `System.getenv`

More generally switch to using sys instead of System.

Specifically: src/core/env/src/main/scala/EnvironmentUtils.scala

`/tmp` cleanups

There are still a pile of stuff left in /tmp, all directories. The offending name patterns that need to be clean up are:

SavedModels-<N> (I just cleaned almost 19 thousand of these)
MML-Test-<N> (about 700)
MML-Test-<N>powerBI.parquet (about 160)

Most of these are empty directories, so perhaps there is some broken cleanup that removes files but not the directories. The last pattern is the only one that has files left in it.

Missing Power BI URL breaks local tests

This should probably be turned off by default in the tests and only run if the required parameters are set.

Add support for missing value cleaner

Add an Estimator that computes a missing value replacement value, such as mean, median or mode, for training data. The Estimator then produces a Model that can be applied to replace missing values.

The missing value cleaner should support one or more input columns. Different types should be supported as follows:

Floating point numbers: mean, median, mode
Integer numbers: median, mode
Strings and categoricals: model
Vectors and other composite types: not supported

The missing value cleaner should be a PipelineStage so it is compatible with SparkML pipelines.

Improve CNTKTrainer Style

src/cntk-train/src/main/scala/Builder

Should use immutable classes with constructor arguments instead of setFoos.
Rename printOutput -> runWithOutput
Go over bogus quotes in CommandBuilders.scala

Python AutoGen does not append "_" to nested classes.

I have a program...
ALS.scala:
class ALS {}
@InternalWrapper
class ALSModel {}

I have methods in a program called
ALSModel.py

In _ALS.py
def _ALS(self):
def ALSModel(self):

Which is conflicting with the name in my provided .py.

I want to know how to get a DNN model using mmlspark

I has seen many examples with loading DNN model,But I want to know how to get a DNN model using mmlspark?

Need PySpark method to access the raw CNTK model

This is needed to inspect the neural network models, especially to get the shape of input and output layers. There appears to be a Scala method for this purpose, but it's not available in PySpark API.

Add pre-trained DNNs for text data

Add support for pre-trained DNN models that can be used to extract features from free-form text data, such as word embedding vectors.

These features could then be used as inputs, for example, for document classification models.

Reduce Memory Consumption of CNTKModel

Install script action fails on HDInsight cluster

Hi folks,

I recently provisioned an HDInsight Spark 2.1 cluster and tried to install MMLSpark using the script action URI and instructions, as I've previously done for MMLSpark 0.6 without any issues. The script action fails, showing the following "Debug information" in Azure Portal:

{
    "href": "http://10.0.0.23:8080/api/v1/clusters/mawahwasb3/requests/40",
    "tasks": [
        {
            "href": "http://10.0.0.23:8080/api/v1/clusters/mawahwasb3/requests/40/tasks/156",
            "Tasks": {
                "attempt_cnt": 1,
                "command": "ACTIONEXECUTE",
                "command_detail": "run_customscriptaction ACTIONEXECUTE",
                "end_time": 1503679947306,
                "error_log": "/var/lib/ambari-agent/data/errors-156.txt",
                "exit_code": 1,
                "host_name": "hn0-mawahw.3ejwtsbjuzpurdzrrmda4wm3nd.gx.internal.cloudapp.net",
                "id": "156",
                "output_log": "/var/lib/ambari-agent/data/output-156.txt",
                "request_id": "40",
                "role": "run_customscriptaction",
                "stage_id": "0",
                "start_time": 1503679944176,
                "status": "FAILED",
                "stderr": null,
                "stdout": null,
                "structured_out": null
            }
        }, [...and all other tasks "COMPLETED"]

The last few lines printed to the mentioned output log (/var/lib/ambari-agent/data/output-156.txt) are:

Setting up ocl-icd-libopencl1:amd64 (2.2.8-1) ...
Setting up libhwloc-plugins (1.11.2-3) ...
Processing triggers for libc-bin (2.23-0ubuntu9) ...
[azureml_327951dc2df6f88e104edcd22c5f680e] ('Start downloading script locally: ', u'https://mmlspark.azureedge.net/buildartifacts/0.7/install-mmlspark.sh')
Fromdos line ending conversion successful
('Unexpected error:', "('Execution of custom script failed with exit code', 1)")
Removing temp location of the script

And the last few lines of the mentioned error log (/var/lib/ambari-agent/data/errors-156.txt) are:

/tmp/tmpyhsA9w: line 50: CNTK_WHEELS[$env]: Unknown conda env for CNTK: azureml_327951dc2df6f88e104edcd22c5f680e
Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 194, in <module>
    ExecuteScriptAction().execute()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 179, in actionexecute
    ExecuteScriptAction.execute_bash_script(bash_script, scriptpath, scriptparams)
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 149, in execute_bash_script
    raise Exception("Execution of custom script failed with exit code",exitcode)
Exception: ('Execution of custom script failed with exit code', 1)

Is there any other info I can provide to help identify the problem? Thanks in advance for your help!

Removed issue

Add doc page describing example datasets

Add a doc page that describes the source and properties of example datasets available on CDN, similar to this page.

Need to Organize Repo Structure

we should organize our additional pipeline stages into a project: src/stages
all other additions to spark can be their own projects either in core, or in src

show RMSE in the flight delay sample notebook.

In GPU VM template, allow user to specify existing storage account

For example, user might have a very large dataset of images on an existing storage account. It'd be time-consuming to copy it over to newly created blob storage every time user spins up a cluster.

Docker image with Python 2.7

It'd be handy if we can use Python 2.7 out of the box. Currently, the image on Docker Hub (https://hub.docker.com/r/microsoft/mmlspark/) has only Python 3.5.

GPU Work

README.md link to the 401 notebook as a demonstration of the GPU functionality.
Possibly figure out a way to still include the 401 notebook so it's visible.
Add to the release procedure an item for testing the 401 notebook.

Finding the mapping of string class labels to integer indices used by a TrainedClassifierModel

Hi folks,

I created a TrainedClassifierModel using a training dataset in which the label column is string-valued. I've noticed that when I apply the TrainedClassifierModel's transform method to a validation dataset, the resulting scored_labels column is integer-valued: presumably those are integer indices for the predicted labels. To produce a human-interpretable result, I'd like to map the scored_labels values back to their corresponding strings.

How can I find the label-to-index mapping that the TrainedClassifierModel has learned? Is it exposed as an attribute?
Suppose that I manually mapped my string-valued labels to consecutive integer indices beginning at zero, then used that integer column as my label column during training. Will MMLSpark adopt my integer-valued labels as its own class indices, or is there a potential for permutation?
- If the former is true, then I would know exactly how to map scored_labels indices back to strings.
- The latter may be true e.g. if MMLSpark simply assigns indices to labels in the order it encounters them.

Thanks!

Remove dependence on brainscript for GPU training

make the CNTKModel API completely in python and make it operate on python CNTK computation graphs and trainers

Pl update coordinate for MMLSpark in Databricks section

Looks like the coordinate structure has changed. In the Databricks section on the README.md it is still showing the old structure.

MMLSpark model.save changes dfs working directory

With the following diagnostic code,

hdi_wd = get_hdispark_working_dir()
print("AFTER STATS TRANSFORM, BEFORE WRITE")
print(hdi_wd)

# save model
model.save("outputs/aot-mmlspark.model")
hdi_wd = get_hdispark_working_dir()
print("AFTER WRITE")

AFTER STATS TRANSFORM, BEFORE WRITE
wasb://snip/testhdi_1505503317753
Running HDI/Spark job in wasb://snip
AFTER WRITE

Regular spark model:
Running HDI/Spark job in wasb://snip/testhdi_1505505091869
BEFORE WRITE
wasb://snip/testhdi_1505505091869
Running HDI/Spark job in wasb://snip/testhdi_1505505091869
AFTER WRITE
wasb://snip/testhdi_1505505091869

are there in any scala examples

Are there in any scala examples, or scala notebooks, databricks or Apache Zeppelin etc

Provide sample on how to deploy a model in Spark

MMLSpark provides a way to save the model. It would be nice to take this to the next step to show how to deploy the model to Spark using one of the methods - Spark UDF and/or Livy endpoint. It woudl be great to update one of the notebook to demostrate this.

Fix sparklyr wrappers regressions for new version of sparklyr

The newest version of sparklyr does not have ml_model function anymore. The fix would be in the code generation.

Make ImageReader fail fast if file path is invalid

Currently, if the path to images is incorrect, the ImageReader fails lazily when it tries to read images. This makes debugging hard because the failure might happen much later in the pipeline during some different operation.

Add check to ImageReader to validate the correctness and existence of the file path upon instantiation, for better debugging experience.

CNTKTrain does not clean up after itself

This occurs when you run the tests/ ./runme

Add a notebook to demonstrate `TextPreprocessing`

Can not install R packages on DataBricks, different expected folder structor.

On Databricks to add an R library I enter two items. URL and package name.

https://mmlspark.azureedge.net
mmlspark-0.10.dev24+145.g6c0cdfcc6.zip

It then creates a URL
https://mmlspark.azureedge.net/web/packages/mmlspark-0.10.dev24%2B145.g6c0cdfcc6.zip

But, my R package is at...
https://mmlspark.azureedge.net/rrr/mmlspark-0.10.dev24+145.g6c0cdfcc6.zip

Ensure CNTK model has no race conditions for old streaming API

Support cross-validation based FindBestModel

Add a new Estimator - FindBestModelCV - that is like FindBestModel, but uses cross-validation instead of evaluation against dataset.

FindBestModelCV would take a list of un-trained model or models, do cross-validation for each of them using the same fold splits, and compare metrics. It would then train the best model against full data, and return that as output Model.

Additionally, FindBestModelCV should produce a table of metrics for all sweeps.

`ZipIterator` issues

The current implementation of ZipIterator has a bunch of issues:

If it's possible, it would be much better to implement a plain
iterator that returns the entries as lazy values that actually do the
reading when needed. This would make it possible to drop the
sampling completely, and use something like
.filter(_ => r.nextDouble < someRatio) instead of baking it in.
Reading the quick description, it's not clear to me that it always
returns the same elements (ie, the setSeed(0)) -- but maybe this is
idiomatic and shouldn't be documented?
Also, there is the known algorithm that returns N random elements,
maybe it's also useful to do that? (This would be easy if the first
point is done.)
The implementation is not too great -- it looks like there are too
many vars, and the return inside the while loop is making it
hard to follow. Again, doing the first point would make all of this
complexity go away.

UDFtransformer should not extend HasInputCol and HasInputCols

This is a bit confusing and will cause complications if both parameters are independently set. Try extending the latter and adding custom setters in scala and py side

Notebooks for R wrappers

Notebooks demonstrating R code.

Additional R wrappers work

Documentation
Tests (this will also require adding R to the stuff that gets installed on dev environments.)

Featurize needs API review

The featurize estimator has a strange API, namely it takes a map from a string to a sequence of strings as a parameter.
the featurize estimator uses assemble features which for some reason is an estimator (should be a transform)
featurize should have a helper function that maps a column types to featurization pipelines. The logic of the estimator should be very simple and just compose a pipeline out of these pipelines

Try to running the sample notebook results error

Hi, I just include the package using

pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
 --repositories=https://mmlspark.azureedge.net/maven

However, when I run the sample notebook example 302, the following code

import mmlspark
import numpy as np
from mmlspark import toNDArray

IMAGE_PATH = "datasets/CIFAR10/test"
images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache()
images.printSchema()
print(images.count())

results error:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-1-4aed556d1de7> in <module>()
      6 images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache()
      7 images.printSchema()
----> 8 print(images.count())

/home/wonglab/spark_install/spark/python/pyspark/sql/dataframe.py in count(self)
    378         2
    379         """
--> 380         return int(self._jdf.count())
    381 
    382     @ignore_unicode_prefix

/home/wonglab/spark_install/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/home/wonglab/spark_install/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/home/wonglab/spark_install/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o45.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.UnsatisfiedLinkError: org.opencv.imgcodecs.Imgcodecs.imdecode_0(JI)J

Any idea? Thanks in advance!

Add LightGBM learners to MMLSpark

Add LightGBM learners to MMLSpark as spark Estimators and Transformers
We can generate Java wrappers through SWIG

Document Pip install

Is this currently possible
If so how
This might make the lives of those using pycharm easier

Code Gen does not autogenerate Evaluators

When trying to add a new Evaluator, the python class is not autogenerated.

Support scoring of TensorFlow-based neural network models

Add support for scoring Tensorflow-based DNN models, similar to CNTK models with options for selecting a pre-trained network and output layer for featurization.

Find the problem with ComputeModelStatistics in NB 104

Find why the two commented cells with ComputeModelStatistics take so much time and fix it in some better way.

Maven Dependency

I added the following dep to my poml.xml

com.microsoft.ml.spark mmlspark 0.6 test

and also added the repo

azureedge.net MS Azure Maven Repo https://mmlspark.azureedge.net/maven

Still maven cannot find the artifact...

Extend DefaultParamsWritable to complex params

right now we cannot serialize custom params without custom code

Add support for Spark 2.2

How to read CIFAR images with scala in mmlspark

Currently, example 301 which evaluates pre-trained CNTK model with CIFAR10 images, is totally written by python. This example use pickle.load to read cifar-10-batches-py/test_batch and then parallelize to distributed RDD, however, I cannot directly use pickle to read data in scala code application.

I tried spark.readImages in mmlspark, but it seems cannot deal with cifar-10-batches-bin data well. And I finally choose cookie-datasets to read cifar data in scala (the master branch of cookie-datasets still used spark-1.5, and I upgrade it to spark-2.1 with necessary changes)

BTW, since you only have python examples, and I have already interpreted 101 and part 301 examples code to scala, I'm not sure whether you want this part of example codes?

Make CNTKModel support Multi-in Multi-out models

right now it only supports single in single out

Train Classifier and Train Regressor need API review

Relying on metadata and magic numbers not idiomatic and makes it hard to parameterize these pipeline stages
our automatic compute model statistics modules rely on this metadata making the two unnecessarily coupled and not friendly to the rest of the ecosystem
metadata wrangling makes code unnecessarily complex
the two share a huge amount of code and both should just output pipeline models
num features default not set idiomatically
verify train classifier's test needs to be run in a specific directory

microsoft / synapseml Goto Github PK

synapseml's Issues

Recommend Projects

Recommend Topics

Recommend Org