microsoft / synapseml Goto Github PK
View Code? Open in Web Editor NEWSimple and Distributed Machine Learning
Home Page: http://aka.ms/spark
License: MIT License
Simple and Distributed Machine Learning
Home Page: http://aka.ms/spark
License: MIT License
More generally switch to using sys
instead of System
.
Specifically: src/core/env/src/main/scala/EnvironmentUtils.scala
There are still a pile of stuff left in /tmp
, all directories. The offending name patterns that need to be clean up are:
SavedModels-<N>
(I just cleaned almost 19 thousand of these)MML-Test-<N>
(about 700)MML-Test-<N>powerBI.parquet
(about 160)Most of these are empty directories, so perhaps there is some broken cleanup that removes files but not the directories. The last pattern is the only one that has files left in it.
This should probably be turned off by default in the tests and only run if the required parameters are set.
Add an Estimator that computes a missing value replacement value, such as mean, median or mode, for training data. The Estimator then produces a Model that can be applied to replace missing values.
The missing value cleaner should support one or more input columns. Different types should be supported as follows:
The missing value cleaner should be a PipelineStage so it is compatible with SparkML pipelines.
src/cntk-train/src/main/scala/Builder
setFoo
s.printOutput
-> runWithOutput
CommandBuilders.scala
I have a program...
ALS.scala:
class ALS {}
@InternalWrapper
class ALSModel {}
I have methods in a program called
ALSModel.py
In _ALS.py
def _ALS(self):
def ALSModel(self):
Which is conflicting with the name in my provided .py.
I has seen many examples with loading DNN model,But I want to know how to get a DNN model using mmlspark?
This is needed to inspect the neural network models, especially to get the shape of input and output layers. There appears to be a Scala method for this purpose, but it's not available in PySpark API.
Add support for pre-trained DNN models that can be used to extract features from free-form text data, such as word embedding vectors.
These features could then be used as inputs, for example, for document classification models.
Hi folks,
I recently provisioned an HDInsight Spark 2.1 cluster and tried to install MMLSpark using the script action URI and instructions, as I've previously done for MMLSpark 0.6 without any issues. The script action fails, showing the following "Debug information" in Azure Portal:
{
"href": "http://10.0.0.23:8080/api/v1/clusters/mawahwasb3/requests/40",
"tasks": [
{
"href": "http://10.0.0.23:8080/api/v1/clusters/mawahwasb3/requests/40/tasks/156",
"Tasks": {
"attempt_cnt": 1,
"command": "ACTIONEXECUTE",
"command_detail": "run_customscriptaction ACTIONEXECUTE",
"end_time": 1503679947306,
"error_log": "/var/lib/ambari-agent/data/errors-156.txt",
"exit_code": 1,
"host_name": "hn0-mawahw.3ejwtsbjuzpurdzrrmda4wm3nd.gx.internal.cloudapp.net",
"id": "156",
"output_log": "/var/lib/ambari-agent/data/output-156.txt",
"request_id": "40",
"role": "run_customscriptaction",
"stage_id": "0",
"start_time": 1503679944176,
"status": "FAILED",
"stderr": null,
"stdout": null,
"structured_out": null
}
}, [...and all other tasks "COMPLETED"]
The last few lines printed to the mentioned output log (/var/lib/ambari-agent/data/output-156.txt
) are:
Setting up ocl-icd-libopencl1:amd64 (2.2.8-1) ...
Setting up libhwloc-plugins (1.11.2-3) ...
Processing triggers for libc-bin (2.23-0ubuntu9) ...
[azureml_327951dc2df6f88e104edcd22c5f680e] ('Start downloading script locally: ', u'https://mmlspark.azureedge.net/buildartifacts/0.7/install-mmlspark.sh')
Fromdos line ending conversion successful
('Unexpected error:', "('Execution of custom script failed with exit code', 1)")
Removing temp location of the script
And the last few lines of the mentioned error log (/var/lib/ambari-agent/data/errors-156.txt
) are:
/tmp/tmpyhsA9w: line 50: CNTK_WHEELS[$env]: Unknown conda env for CNTK: azureml_327951dc2df6f88e104edcd22c5f680e
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 194, in <module>
ExecuteScriptAction().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute
method(env)
File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 179, in actionexecute
ExecuteScriptAction.execute_bash_script(bash_script, scriptpath, scriptparams)
File "/var/lib/ambari-agent/cache/custom_actions/scripts/run_customscriptaction.py", line 149, in execute_bash_script
raise Exception("Execution of custom script failed with exit code",exitcode)
Exception: ('Execution of custom script failed with exit code', 1)
Is there any other info I can provide to help identify the problem? Thanks in advance for your help!
Removed issue
Add a doc page that describes the source and properties of example datasets available on CDN, similar to this page.
For example, user might have a very large dataset of images on an existing storage account. It'd be time-consuming to copy it over to newly created blob storage every time user spins up a cluster.
It'd be handy if we can use Python 2.7 out of the box. Currently, the image on Docker Hub (https://hub.docker.com/r/microsoft/mmlspark/) has only Python 3.5.
README.md
link to the 401 notebook as a demonstration of the GPU functionality.Hi folks,
I created a TrainedClassifierModel using a training dataset in which the label column is string-valued. I've noticed that when I apply the TrainedClassifierModel's transform
method to a validation dataset, the resulting scored_labels
column is integer-valued: presumably those are integer indices for the predicted labels. To produce a human-interpretable result, I'd like to map the scored_labels
values back to their corresponding strings.
scored_labels
indices back to strings.Thanks!
Looks like the coordinate structure has changed. In the Databricks section on the README.md it is still showing the old structure.
With the following diagnostic code,
hdi_wd = get_hdispark_working_dir()
print("AFTER STATS TRANSFORM, BEFORE WRITE")
print(hdi_wd)
# save model
model.save("outputs/aot-mmlspark.model")
hdi_wd = get_hdispark_working_dir()
print("AFTER WRITE")
AFTER STATS TRANSFORM, BEFORE WRITE
wasb://snip/testhdi_1505503317753
Running HDI/Spark job in wasb://snip
AFTER WRITE
Regular spark model:
Running HDI/Spark job in wasb://snip/testhdi_1505505091869
BEFORE WRITE
wasb://snip/testhdi_1505505091869
Running HDI/Spark job in wasb://snip/testhdi_1505505091869
AFTER WRITE
wasb://snip/testhdi_1505505091869
Are there in any scala examples, or scala notebooks, databricks or Apache Zeppelin etc
MMLSpark provides a way to save the model. It would be nice to take this to the next step to show how to deploy the model to Spark using one of the methods - Spark UDF and/or Livy endpoint. It woudl be great to update one of the notebook to demostrate this.
The newest version of sparklyr does not have ml_model function anymore. The fix would be in the code generation.
Currently, if the path to images is incorrect, the ImageReader fails lazily when it tries to read images. This makes debugging hard because the failure might happen much later in the pipeline during some different operation.
Add check to ImageReader to validate the correctness and existence of the file path upon instantiation, for better debugging experience.
This occurs when you run the tests/ ./runme
On Databricks to add an R library I enter two items. URL and package name.
https://mmlspark.azureedge.net
mmlspark-0.10.dev24+145.g6c0cdfcc6.zip
It then creates a URL
https://mmlspark.azureedge.net/web/packages/mmlspark-0.10.dev24%2B145.g6c0cdfcc6.zip
But, my R package is at...
https://mmlspark.azureedge.net/rrr/mmlspark-0.10.dev24+145.g6c0cdfcc6.zip
Add a new Estimator - FindBestModelCV - that is like FindBestModel, but uses cross-validation instead of evaluation against dataset.
FindBestModelCV would take a list of un-trained model or models, do cross-validation for each of them using the same fold splits, and compare metrics. It would then train the best model against full data, and return that as output Model.
Additionally, FindBestModelCV should produce a table of metrics for all sweeps.
The current implementation of ZipIterator
has a bunch of issues:
If it's possible, it would be much better to implement a plain
iterator that returns the entries as lazy values that actually do the
reading when needed. This would make it possible to drop the
sampling completely, and use something like
.filter(_ => r.nextDouble < someRatio)
instead of baking it in.
Reading the quick description, it's not clear to me that it always
returns the same elements (ie, the setSeed(0)
) -- but maybe this is
idiomatic and shouldn't be documented?
Also, there is the known algorithm that returns N random elements,
maybe it's also useful to do that? (This would be easy if the first
point is done.)
The implementation is not too great -- it looks like there are too
many var
s, and the return
inside the while
loop is making it
hard to follow. Again, doing the first point would make all of this
complexity go away.
This is a bit confusing and will cause complications if both parameters are independently set. Try extending the latter and adding custom setters in scala and py side
Notebooks demonstrating R code.
Hi, I just include the package using
pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.5 \
--repositories=https://mmlspark.azureedge.net/maven
However, when I run the sample notebook example 302, the following code
import mmlspark
import numpy as np
from mmlspark import toNDArray
IMAGE_PATH = "datasets/CIFAR10/test"
images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache()
images.printSchema()
print(images.count())
results error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-1-4aed556d1de7> in <module>()
6 images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1).cache()
7 images.printSchema()
----> 8 print(images.count())
/home/wonglab/spark_install/spark/python/pyspark/sql/dataframe.py in count(self)
378 2
379 """
--> 380 return int(self._jdf.count())
381
382 @ignore_unicode_prefix
/home/wonglab/spark_install/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/home/wonglab/spark_install/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/home/wonglab/spark_install/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o45.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.UnsatisfiedLinkError: org.opencv.imgcodecs.Imgcodecs.imdecode_0(JI)J
Any idea? Thanks in advance!
Add LightGBM learners to MMLSpark as spark Estimators and Transformers
We can generate Java wrappers through SWIG
When trying to add a new Evaluator, the python class is not autogenerated.
Add support for scoring Tensorflow-based DNN models, similar to CNTK models with options for selecting a pre-trained network and output layer for featurization.
Find why the two commented cells with ComputeModelStatistics take so much time and fix it in some better way.
I added the following dep to my poml.xml
com.microsoft.ml.spark mmlspark 0.6 testand also added the repo
azureedge.net MS Azure Maven Repo https://mmlspark.azureedge.net/mavenStill maven cannot find the artifact...
right now we cannot serialize custom params without custom code
Currently, example 301 which evaluates pre-trained CNTK model with CIFAR10 images, is totally written by python. This example use pickle.load
to read cifar-10-batches-py/test_batch
and then parallelize to distributed RDD, however, I cannot directly use pickle
to read data in scala code application.
I tried spark.readImages
in mmlspark, but it seems cannot deal with cifar-10-batches-bin
data well. And I finally choose cookie-datasets to read cifar data in scala (the master branch of cookie-datasets still used spark-1.5, and I upgrade it to spark-2.1 with necessary changes)
BTW, since you only have python examples, and I have already interpreted 101 and part 301 examples code to scala, I'm not sure whether you want this part of example codes?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.