Git Product home page Git Product logo

synapseml's Issues

`/tmp` cleanups

There are still a pile of stuff left in /tmp, all directories. The offending name patterns that need to be clean up are:

  • SavedModels-<N> (I just cleaned almost 19 thousand of these)
  • MML-Test-<N> (about 700)
  • MML-Test-<N>powerBI.parquet (about 160)

Most of these are empty directories, so perhaps there is some broken cleanup that removes files but not the directories. The last pattern is the only one that has files left in it.

Add support for missing value cleaner

Add an Estimator that computes a missing value replacement value, such as mean, median or mode, for training data. The Estimator then produces a Model that can be applied to replace missing values.

The missing value cleaner should support one or more input columns. Different types should be supported as follows:

  • Floating point numbers: mean, median, mode
  • Integer numbers: median, mode
  • Strings and categoricals: model
  • Vectors and other composite types: not supported

The missing value cleaner should be a PipelineStage so it is compatible with SparkML pipelines.

Improve CNTKTrainer Style

src/cntk-train/src/main/scala/Builder

  • Should use immutable classes with constructor arguments instead of setFoos.
  • Rename printOutput -> runWithOutput
  • Go over bogus quotes in CommandBuilders.scala

Python AutoGen does not append "_" to nested classes.

I have a program...
ALS.scala:
class ALS {}
@InternalWrapper
class ALSModel {}

I have methods in a program called
ALSModel.py

In _ALS.py
def _ALS(self):
def ALSModel(self):

Which is conflicting with the name in my provided .py.

Need PySpark method to access the raw CNTK model

This is needed to inspect the neural network models, especially to get the shape of input and output layers. There appears to be a Scala method for this purpose, but it's not available in PySpark API.

Add pre-trained DNNs for text data

Add support for pre-trained DNN models that can be used to extract features from free-form text data, such as word embedding vectors.

These features could then be used as inputs, for example, for document classification models.

Install script action fails on HDInsight cluster

Hi folks,

I recently provisioned an HDInsight Spark 2.1 cluster and tried to install MMLSpark using the script action URI and instructions, as I've previously done for MMLSpark 0.6 without any issues. The script action fails, showing the following "Debug information" in Azure Portal:

{
    "href": "http://10.0.0.23:8080/api/v1/clusters/mawahwasb3/requests/40",
    "tasks": [
        {
            "href": "http://10.0.0.23:8080/api/v1/clusters/mawahwasb3/requests/40/tasks/156",
            "Tasks": {
                "attempt_cnt": 1,
                "command": "ACTIONEXECUTE",
                "command_detail": "run_customscriptaction ACTIONEXECUTE",
                "end_time": 1503679947306,
                "error_log": "/var/lib/ambari-agent/data/errors-156.txt",
                "exit_code": 1,
                "host_name": "hn0-mawahw.3ejwtsbjuzpurdzrrmda4wm3nd.gx.internal.cloudapp.net",
                "id": "156",
                "output_log": "/var/lib/ambari-agent/data/output-156.txt",
                "request_id": "40",
                "role": "run_customscriptaction",
                "stage_id": "0",
                "start_time": 1503679944176,
                "status": "FAILED",
                "stderr": null,
                "stdout": null,
                "structured_out": null
            }
        }, [...and all other tasks "COMPLETED"]

The last few lines printed to the mentioned output log (/var/lib/ambari-agent/data/output-156.txt) are:

Setting up ocl-icd-libopencl1:amd64 (2.2.8-1) ...
Setting up libhwloc-plugins (1.11.2-3) ...
Processing triggers for libc-bin (2.23-0ubuntu9) ...
[azureml_327951dc2df6f88e104edcd22c5f680e] ('Start downloading script locally: ', u'https://mmlspark.azureedge.net/buildartifacts/0.7/install-mmlspark.sh')
Fromdos line ending conversion successful
('Unexpected error:', "('Execution of custom script failed with exit code', 1)")
Removing temp location of the script

And the last few lines of the mentioned error log (/var/lib/ambari-agent/data/errors-156.txt) are:

/tmp/tmpyhsA9w: line 50: CNTK_WHEELS[$env]: Unknown conda env for CNTK: azureml_327951dc2df6f88e104edcd22c5f680e
Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 194, in <module>
    ExecuteScriptAction().execute()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 179, in actionexecute
    ExecuteScriptAction.execute_bash_script(bash_script, scriptpath, scriptparams)
  File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 149, in execute_bash_script
    raise Exception("Execution of custom script failed with exit code",exitcode)
Exception: ('Execution of custom script failed with exit code', 1)

Is there any other info I can provide to help identify the problem? Thanks in advance for your help!

Need to Organize Repo Structure

  • we should organize our additional pipeline stages into a project: src/stages
  • all other additions to spark can be their own projects either in core, or in src

GPU Work

  • README.md link to the 401 notebook as a demonstration of the GPU functionality.
  • Possibly figure out a way to still include the 401 notebook so it's visible.
  • Add to the release procedure an item for testing the 401 notebook.

Finding the mapping of string class labels to integer indices used by a TrainedClassifierModel

Hi folks,

I created a TrainedClassifierModel using a training dataset in which the label column is string-valued. I've noticed that when I apply the TrainedClassifierModel's transform method to a validation dataset, the resulting scored_labels column is integer-valued: presumably those are integer indices for the predicted labels. To produce a human-interpretable result, I'd like to map the scored_labels values back to their corresponding strings.

  • How can I find the label-to-index mapping that the TrainedClassifierModel has learned? Is it exposed as an attribute?
  • Suppose that I manually mapped my string-valued labels to consecutive integer indices beginning at zero, then used that integer column as my label column during training. Will MMLSpark adopt my integer-valued labels as its own class indices, or is there a potential for permutation?
    • If the former is true, then I would know exactly how to map scored_labels indices back to strings.
    • The latter may be true e.g. if MMLSpark simply assigns indices to labels in the order it encounters them.

Thanks!

MMLSpark model.save changes dfs working directory

With the following diagnostic code,

hdi_wd = get_hdispark_working_dir()
print("AFTER STATS TRANSFORM, BEFORE WRITE")
print(hdi_wd)

# save model
model.save("outputs/aot-mmlspark.model")
hdi_wd = get_hdispark_working_dir()
print("AFTER WRITE")

AFTER STATS TRANSFORM, BEFORE WRITE
wasb://snip/testhdi_1505503317753
Running HDI/Spark job in wasb://snip
AFTER WRITE

Regular spark model:
Running HDI/Spark job in wasb://snip/testhdi_1505505091869
BEFORE WRITE
wasb://snip/testhdi_1505505091869
Running HDI/Spark job in wasb://snip/testhdi_1505505091869
AFTER WRITE
wasb://snip/testhdi_1505505091869

Provide sample on how to deploy a model in Spark

MMLSpark provides a way to save the model. It would be nice to take this to the next step to show how to deploy the model to Spark using one of the methods - Spark UDF and/or Livy endpoint. It woudl be great to update one of the notebook to demostrate this.

Make ImageReader fail fast if file path is invalid

Currently, if the path to images is incorrect, the ImageReader fails lazily when it tries to read images. This makes debugging hard because the failure might happen much later in the pipeline during some different operation.

Add check to ImageReader to validate the correctness and existence of the file path upon instantiation, for better debugging experience.

Support cross-validation based FindBestModel

Add a new Estimator - FindBestModelCV - that is like FindBestModel, but uses cross-validation instead of evaluation against dataset.

FindBestModelCV would take a list of un-trained model or models, do cross-validation for each of them using the same fold splits, and compare metrics. It would then train the best model against full data, and return that as output Model.

Additionally, FindBestModelCV should produce a table of metrics for all sweeps.

`ZipIterator` issues

The current implementation of ZipIterator has a bunch of issues:

  1. If it's possible, it would be much better to implement a plain
    iterator that returns the entries as lazy values that actually do the
    reading when needed. This would make it possible to drop the
    sampling completely, and use something like
    .filter(_ => r.nextDouble < someRatio) instead of baking it in.

  2. Reading the quick description, it's not clear to me that it always
    returns the same elements (ie, the setSeed(0)) -- but maybe this is
    idiomatic and shouldn't be documented?

  3. Also, there is the known algorithm that returns N random elements,
    maybe it's also useful to do that? (This would be easy if the first
    point is done.)

  4. The implementation is not too great -- it looks like there are too
    many vars, and the return inside the while loop is making it
    hard to follow. Again, doing the first point would make all of this
    complexity go away.

Additional R wrappers work

  • Documentation
  • Tests (this will also require adding R to the stuff that gets installed on dev environments.)

Featurize needs API review

  • The featurize estimator has a strange API, namely it takes a map from a string to a sequence of strings as a parameter.
  • the featurize estimator uses assemble features which for some reason is an estimator (should be a transform)
  • featurize should have a helper function that maps a column types to featurization pipelines. The logic of the estimator should be very simple and just compose a pipeline out of these pipelines

Try to running the sample notebook results error

Hi, I just include the package using

pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
 --repositories=https://mmlspark.azureedge.net/maven

However, when I run the sample notebook example 302, the following code

import mmlspark
import numpy as np
from mmlspark import toNDArray

IMAGE_PATH = "datasets/CIFAR10/test"
images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache()
images.printSchema()
print(images.count())

results error:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-1-4aed556d1de7> in <module>()
      6 images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache()
      7 images.printSchema()
----> 8 print(images.count())

/home/wonglab/spark_install/spark/python/pyspark/sql/dataframe.py in count(self)
    378         2
    379         """
--> 380         return int(self._jdf.count())
    381 
    382     @ignore_unicode_prefix

/home/wonglab/spark_install/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/home/wonglab/spark_install/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/home/wonglab/spark_install/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o45.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.UnsatisfiedLinkError: org.opencv.imgcodecs.Imgcodecs.imdecode_0(JI)J

Any idea? Thanks in advance!

Document Pip install

  • Is this currently possible
  • If so how
  • This might make the lives of those using pycharm easier

Maven Dependency

I added the following dep to my poml.xml

com.microsoft.ml.spark mmlspark 0.6 test

and also added the repo

azureedge.net MS Azure Maven Repo https://mmlspark.azureedge.net/maven

Still maven cannot find the artifact...

How to read CIFAR images with scala in mmlspark

Currently, example 301 which evaluates pre-trained CNTK model with CIFAR10 images, is totally written by python. This example use pickle.load to read cifar-10-batches-py/test_batch and then parallelize to distributed RDD, however, I cannot directly use pickle to read data in scala code application.

I tried spark.readImages in mmlspark, but it seems cannot deal with cifar-10-batches-bin data well. And I finally choose cookie-datasets to read cifar data in scala (the master branch of cookie-datasets still used spark-1.5, and I upgrade it to spark-2.1 with necessary changes)

BTW, since you only have python examples, and I have already interpreted 101 and part 301 examples code to scala, I'm not sure whether you want this part of example codes?

Train Classifier and Train Regressor need API review

  • Relying on metadata and magic numbers not idiomatic and makes it hard to parameterize these pipeline stages
  • our automatic compute model statistics modules rely on this metadata making the two unnecessarily coupled and not friendly to the rest of the ecosystem
  • metadata wrangling makes code unnecessarily complex
  • the two share a huge amount of code and both should just output pipeline models
  • num features default not set idiomatically
  • verify train classifier's test needs to be run in a specific directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.