salesforce / transmogrifai Goto Github PK

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

Home Page: https://transmogrif.ai

License: BSD 3-Clause "New" or "Revised" License

Scala 98.96% Shell 0.02% Java 0.26% Jupyter Notebook 0.76%

ml automl transformations estimators dsl pipelines machine-learning scala salesforce einstein

transmogrifai's Introduction

TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.

Use TransmogrifAI if you need a machine learning library to:

Build production ready machine learning applications in hours, not months
Build machine learning models without getting a Ph.D. in machine learning
Build modular, reusable, strongly typed machine learning workflows

To understand the motivation behind TransmogrifAI check out these:

Open Sourcing TransmogrifAI: Automated Machine Learning for Structured Data, a blog post by @snabar
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions, a talk by @tovbinm
Low Touch Machine Learning, a talk by @leahmcguire

Skip to Quick Start and Documentation.

Predicting Titanic Survivors with TransmogrifAI

The Titanic dataset is an often-cited dataset in the machine learning community. The goal is to build a machine learnt model that will predict survivors from the Titanic passenger manifest. Here is how you would build the model using TransmogrifAI:

import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._

// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = pathToData).readDataset().toDF()

// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()

println("Model summary:\n" + model.summaryPretty())

Model summary:

Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric.
Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376]
Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159]

Selected model Random Forest classifier with parameters:
|-----------------------|--------------|
| Model Param           |     Value    |
|-----------------------|--------------|
| modelType             | RandomForest |
| featureSubsetStrategy |         auto |
| impurity              |         gini |
| maxBins               |           32 |
| maxDepth              |           12 |
| minInfoGain           |        0.001 |
| minInstancesPerNode   |           10 |
| numTrees              |           50 |
| subsamplingRate       |          1.0 |
|-----------------------|--------------|

Model evaluation metrics:
|-------------|--------------------|---------------------|
| Metric Name | Hold Out Set Value |  Training Set Value |
|-------------|--------------------|---------------------|
| Precision   |               0.85 |   0.773851590106007 |
| Recall      | 0.6538461538461539 |  0.6930379746835443 |
| F1          | 0.7391304347826088 |  0.7312186978297163 |
| AuROC       | 0.8821603927986905 |  0.8766642291593114 |
| AuPR        | 0.8225075757571668 |   0.850331080886535 |
| Error       | 0.1643835616438356 | 0.19682151589242053 |
| TP          |               17.0 |               219.0 |
| TN          |               44.0 |               438.0 |
| FP          |                3.0 |                64.0 |
| FN          |                9.0 |                97.0 |
|-------------|--------------------|---------------------|

Top model insights computed using correlation:
|-----------------------|----------------------|
| Top Positive Insights |      Correlation     |
|-----------------------|----------------------|
| sex = "female"        |   0.5177801026737666 |
| cabin = "OTHER"       |   0.3331391338844782 |
| pClass = 1            |   0.3059642953159715 |
|-----------------------|----------------------|
| Top Negative Insights |      Correlation     |
|-----------------------|----------------------|
| sex = "male"          |  -0.5100301587292186 |
| pClass = 3            |  -0.5075774968534326 |
| cabin = null          | -0.31463114463832633 |
|-----------------------|----------------------|

Top model insights computed using CramersV:
|-----------------------|----------------------|
|      Top Insights     |       CramersV       |
|-----------------------|----------------------|
| sex                   |    0.525557139885501 |
| embarked              |  0.31582347194683386 |
| age                   |  0.21582347194683386 |
|-----------------------|----------------------|

While this may seem a bit too magical, for those who want more control, TransmogrifAI also provides the flexibility to completely specify all the features being extracted and all the algorithms being applied in your ML pipeline. Visit our docs site for full documentation, getting started, examples, faq and other information.

Adding TransmogrifAI into your project

You can simply add TransmogrifAI as a regular dependency to an existing project. Start by picking TransmogrifAI version to match your project dependencies from the version matrix below (if not sure - take the stable version):

TransmogrifAI Version	Spark Version	Scala Version	Java Version
0.7.1 (unreleased, master), 0.7.0 (stable)	2.4	2.11	1.8
0.6.1, 0.6.0, 0.5.3, 0.5.2, 0.5.1, 0.5.0	2.3	2.11	1.8
0.4.0, 0.3.4	2.2	2.11	1.8

For Gradle in build.gradle add:

repositories {
    jcenter()
    mavenCentral()
}
dependencies {
    // TransmogrifAI core dependency
    compile 'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.7.0'

    // TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
    // compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.7.0'
}

For SBT in build.sbt add:

scalaVersion := "2.11.12"

resolvers += Resolver.jcenterRepo

// TransmogrifAI core dependency
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.7.0"

// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.7.0"

Then import TransmogrifAI into your code:

// TransmogrifAI functionality: feature types, feature builders, feature dsl, readers, aggregators etc.
import com.salesforce.op._
import com.salesforce.op.aggregators._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.readers._

// Spark enrichments (optional)
import com.salesforce.op.utils.spark.RichDataset._
import com.salesforce.op.utils.spark.RichRDD._
import com.salesforce.op.utils.spark.RichRow._
import com.salesforce.op.utils.spark.RichMetadata._
import com.salesforce.op.utils.spark.RichStructType._

Quick Start and Documentation

Visit our docs site for full documentation, getting started, examples, faq and other information.

See scaladoc for the programming API.

Authors

Kevin Moore @jauntbox
Kin Fai Kan @kinfaikan
Leah McGuire @leahmcguire
Matthew Tovbin @tovbinm
Max Ovsiankin @maxov
Michael Loh @mikeloh77
Michael Weil @michaelweilsalesforce
Shubha Nabar @snabar
Vitaly Gordon @vitalyg
Vlad Patryshev @vpatryshev

Internal Contributors (prior to release)

Chris Rupley @crupley
Chris Wu @cjwooo
Eric Wayman @ericwayman
Felipe Oliveira @feliperazeek
Gera Shegalov @gerashegalov
Jean-Marc Soumet @ajmssc
Marco Vivero @marcovivero
Mario Rodriguez @mrodriguezsfiq
Mayukh Bhaowal @mayukhb
Minh-An Quinn @minhanquinn
Nicolas Drizard @nicodri
Oleg Gusak @ogusak
Patrick Framption @tricktrap
Ryle Goehausen @ryleg
Sanmitra Ijeri @sanmitra
Sky Chen @almandsky
Sophie Xiaodan Sun @sxd929
Till Bergmann @tillbe
Xiaoqian Liu @wingsrc

License

BSD 3-Clause © Salesforce.com, Inc.

transmogrifai's People

Contributors

Stargazers

Watchers

Forkers

ajayborra daminisatya knowscieng kishorebt mattprivman donaldrivard sudjoshi sxd929 ebottabi samdpark its2mc jonnydubowsky sfofthings seratch sumit33k ravitezu tovbinm shaunstanislauslau frankensteinai himsmittal samangel93 dreadlord1984 mohnkhan vigaeatery egorafanasenko antonioiba vpm238 codeaudit xyuan tonyle9 wwjiang007 bylake chetan009 experimenti spencerx udacitysimon kumar-asista kckrithika wangqiaoshi oopsoutofmemory im-elvis dailyactie bgfurfeature gokulsfdc sravanthi-konduru ruparelmetarya fosukg hongyunnchen nkhuyu wei-he ouyangshourui ghostflare76 kmader jierfei1007 acrushdjn hal2001 simsausaurabh tillbe neelshah18 jbdatascience dnuang rameshoswal xiongzheng pranishd1 andykung77 gfeldman simpledatalabsinc greensuse nanaakwasiabayieboateng dreamforcehackathon shadowkun eban143 thomasgui76 chengtaoyuan chenhuimin spirit-dongdong cdalzell devhttps dichoto merlintang almandsky qibaoyuan gnanam336 earlbabson pizzaeueu skasturi jiangming1 geogubd mysonhushu huang-z-h boxianlai winnerineast wh-forker yuzhanggdut shiyufeng03 patrickliu95 cfmcgrady hatleon wuboui zhuohuwu0603

transmogrifai's Issues

Remove or document Direct Output Committers limitations

Describe the bug
Direct output committer and vanilla committers are susceptible to list inconsistency on s3. It only works with layers on top of s3 such as S3Guard or EMRFS as long as one relies on _SUCCESS files.

To Reproduce
Run on pure s3 :)

Expected behavior
Use the default committer. Just refer users to options out there and how to configure them

Logs or screenshots
N/A

Additional context
N/A

Create new BinaryClassificationEvaluator that computes relationship between binned probabilities and outcomes

Problem
We would like a new Evaluator for binary classification which bins the probabilities predicted for the data and compares it to the average outcome of values that fell in that bin.

Solution
This estimator should extend the OpBinaryClassificationEvaluatorBase class. It should return a new class extending EvaluationMetrics which contains the bins, the counts in each bin, the average outcome of records in each bin and the brier score https://en.wikipedia.org/wiki/Brier_score. An
example of this kind of estimator can be found here: https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/evaluators/OpBinaryClassificationEvaluator.scala

Build & tests parallelism - Part II

Problem
Currently our build & tests are only parallelized by core and non-code projects.

Solution
Parallelize the build further by using Gradle test regex (e.g. by namespace) and adding more parallel steps in TravisCI matrix and CircleCI workflow.

For instance CircleCI 2.0 supports balancing tests by timings - https://circleci.com/docs/2.0/parallelism-faster-jobs/#splitting-by-timings-data

Alternatives
?! propose your own ;)

Issue running on Databricks

I'm having issue accessing the spark context when running on databricks. I've imported all libraries as part of the examples in the titanic snippet but now luck

See screenshot

Is it possible to do multi-label classification with TransmogrifAI?

Problem
This is more a question, and a feature request if the answer to the question is no.

Is it possible to do multi-label classification with TransmogrifAI?

Solution
I should be able to generate models that map features to vectors by assigning 1 or 0 to each element of the vector.

Alternatives
NA

Additional context
Note that this is different from multi-class classification (the Iris example)

Multi-class classification

makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time [1]

In the other hand multi-label classification

can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these. [1]

References:

transmogrifai: command not found

Problem
Follow the tutorial to create own.
transmogrifai gen --input data/cadata.csv --id houseId --response medianHouseValue --overwrite --auto HouseObject RealEstateApp
While, gain the transmogrifai: command not found
Solution
I do this
export OP_HOME="/path/to/TransmogrifAI/"
alias op="java -cp $OP_HOME/cli/build/libs/* com.salesforce.op.cli.CLI"
in ~/.bashrc

Alternatives
Describe alternatives you've considered. A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context about the feature request here.

Quick Start Examples (Titanic) Does not Build with 0.3.4 branch

Describe the bug
When I execute Titanic Binary Classification example
./gradlew compileTestScala installDist it can not download need jar package，

I hope you can provide a stable jar package address。。。

thanks

Allow adding/getting metadata on the model level

Problem
We would like to be able to store some types of metadata of the level of the model, e.g instance holdout splits. Currently there is no easy way to store metadata with the trained model, except if not through OpParams, stage metadata or manually if a separate file.

Solution
We would like to add two methods to OpWorkflowModel class to allow saving/loading metadata with the model:

class OpWorkflowModel { 
   def addMetadata[T <: Product](name: String, value: T): this.type
   def getMetadata[T <: Product](name: String): Try[T]
}

Alternatives
Store metadata in custom map in OpParams or in a separate file.

Additional context
N/A

Validator does not check for number of label classes and number of rows for each split

Describe the bug
None of the implementations of OpValidator.createTrainValidationSplits does not check that:

number of label classes is at least 2
splits are not empty

To Reproduce
Run OpCrossValidation or OpTrainValidationSplit of data with a single row.

Expected behavior
Validate that number of label classes if at least 2 and splits are not empty. Throw a clean and informative error in each of the cases.

Can somebody help me understand the outputs of the scripts?

Problem
The meaning of the output of the ./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpTitanicSimple -Dargs="pwd/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv" is not clear

Solution
Could somebody help me through it chunk by chunk?

Alternatives
Build in more interpretable result presentation in the script.

Additional context
Can't paste the result here.... way too long....

Replace AvroInOut with spark-avro or at least upgrade Hadoop API

Problem
AvroInOut works great but it uses the old org.apache.hadoop.mapred Hadoop API

Solution
Consider replacing it with spark-avro or at least use the newer org.apache.hadoop.mapreduce Hadoop API

Alternatives
NA

Additional context
This change would require extensive testing to make sure we have the AvroInOut functionality the same.
Starting from adding unit tests and also running some real Spark jobs.

TransmogirfAI build issues

Alternatively you can disable this build feature completely by removing this line https://github.com/salesforce/TransmogrifAI/blob/master/build.gradle#L26
from your build.gradle file.

I have commented line 26 and tried to build again:. Now the build fails at line 164.
The trace is given below:

FAILURE: Build failed with an exception.

Where:
Build file '/home/cuda/TransmogrifAI/build.gradle' line: 164
What went wrong:
A problem occurred evaluating root project 'transmogrifai'.

Could not get unknown property 'createVersionProperties' for root project 'transmogrifai' of type org.gradle.api.Project.

In Line 164 , I find the following
Line 164:
if (System.getenv("CI") != 'true') {
jar.dependsOn(createVersionProperties)
}

I set the envirnment variable CI to false. Then I get the following error at line 231:

shadowJar.dependsOn(createVersionProperties)

The variable createVersionProperties is is required by the script. Please help me to find out the most viable solution to work around this problem,.

Can I use the best model for prediction by test data?

Problem
I I got the best model through the transmogriAI,But I don't konwn how to prediction the test set?

Built failed when running TitanicDataset! But my JAVA version is right

I failed when running the TitanicDataset with the command:

./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpTitanicSimple -Dargs="\
`pwd`/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv"

My JAVA version is already 1.8. And I install the package completely the same as the Installation document. But the procedure returns:

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':sparkSubmit'.
> A problem occurred starting process 'command 'C:\spark-2.2.2-bin-hadoop2.7/bin/spark-submit''

The message is:

 Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':sparkSubmit'.
        at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:110)
        at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:77)
        at org.gradle.api.internal.tasks.execution.OutputDirectoryCreatingTaskExecuter.execute(OutputDirectoryCreatingTaskExecuter.java:51)
        at org.gradle.api.internal.tasks.execution.SkipUpToDateTaskExecuter.execute(SkipUpToDateTaskExecuter.java:59)
        at org.gradle.api.internal.tasks.execution.ResolveTaskOutputCachingStateExecuter.execute(ResolveTaskOutputCachingStateExecuter.java:54)
        at org.gradle.api.internal.tasks.execution.ValidatingTaskExecuter.execute(ValidatingTaskExecuter.java:59)
        at org.gradle.api.internal.tasks.execution.SkipEmptySourceFilesTaskExecuter.execute(SkipEmptySourceFilesTaskExecuter.java:101)
        at org.gradle.api.internal.tasks.execution.FinalizeInputFilePropertiesTaskExecuter.execute(FinalizeInputFilePropertiesTaskExecuter.java:44)
        at org.gradle.api.internal.tasks.execution.CleanupStaleOutputsExecuter.execute(CleanupStaleOutputsExecuter.java:91)
        at org.gradle.api.internal.tasks.execution.ResolveTaskArtifactStateTaskExecuter.execute(ResolveTaskArtifactStateTaskExecuter.java:62)
        at org.gradle.api.internal.tasks.execution.SkipTaskWithNoActionsExecuter.execute(SkipTaskWithNoActionsExecuter.java:59)
        at org.gradle.api.internal.tasks.execution.SkipOnlyIfTaskExecuter.execute(SkipOnlyIfTaskExecuter.java:54)
        at org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter.execute(ExecuteAtMostOnceTaskExecuter.java:43)
        at org.gradle.api.internal.tasks.execution.CatchExceptionTaskExecuter.execute(CatchExceptionTaskExecuter.java:34)
        at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter$1.run(EventFiringTaskExecuter.java:51)
        at org.gradle.internal.operations.DefaultBuildOperationExecutor$RunnableBuildOperationWorker.execute(DefaultBuildOperationExecutor.java:300)
        at org.gradle.internal.operations.DefaultBuildOperationExecutor$RunnableBuildOperationWorker.execute(DefaultBuildOperationExecutor.java:292)
        at org.gradle.internal.operations.DefaultBuildOperationExecutor.execute(DefaultBuildOperationExecutor.java:174)
        at org.gradle.internal.operations.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:90)
        at org.gradle.internal.operations.DelegatingBuildOperationExecutor.run(DelegatingBuildOperationExecutor.java:31)
        at org.gradle.api.internal.tasks.execution.EventFiringTaskExecuter.execute(EventFiringTaskExecuter.java:46)
        at org.gradle.execution.taskgraph.LocalTaskInfoExecutor.execute(LocalTaskInfoExecutor.java:42)
        at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$BuildOperationAwareWorkItemExecutor.execute(DefaultTaskExecutionGraph.java:277)
        at org.gradle.execution.taskgraph.DefaultTaskExecutionGraph$BuildOperationAwareWorkItemExecutor.execute(DefaultTaskExecutionGraph.java:262)
        at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor$ExecutorWorker$1.execute(DefaultTaskPlanExecutor.java:135)
        at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor$ExecutorWorker$1.execute(DefaultTaskPlanExecutor.java:130)
        at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor$ExecutorWorker.execute(DefaultTaskPlanExecutor.java:200)
        at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor$ExecutorWorker.executeWithWork(DefaultTaskPlanExecutor.java:191)
        at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor$ExecutorWorker.run(DefaultTaskPlanExecutor.java:130)
        at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
        at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:46)
        at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:55)
Caused by: org.gradle.process.internal.ExecException: A problem occurred starting process 'command 'C:\spark-2.2.2-bin-hadoop2.7/bin/spark-submit''
        at org.gradle.process.internal.DefaultExecHandle.execExceptionFor(DefaultExecHandle.java:231)
        at org.gradle.process.internal.DefaultExecHandle.setEndStateInfo(DefaultExecHandle.java:209)
        at org.gradle.process.internal.DefaultExecHandle.failed(DefaultExecHandle.java:355)
        at org.gradle.process.internal.ExecHandleRunner.run(ExecHandleRunner.java:85)
        at org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.run(CurrentBuildOperationPreservingRunnable.java:42)
        ... 3 more
Caused by: net.rubygrapefruit.platform.NativeException: Could not start 'C:\spark-2.2.2-bin-hadoop2.7/bin/spark-submit'
        at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start(DefaultProcessLauncher.java:27)
        at net.rubygrapefruit.platform.internal.WindowsProcessLauncher.start(WindowsProcessLauncher.java:22)
        at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start(WrapperProcessLauncher.java:36)
        at org.gradle.process.internal.ExecHandleRunner.run(ExecHandleRunner.java:67)
        ... 4 more
Caused by: java.io.IOException: Cannot run program "C:\spark-2.2.2-bin-hadoop2.7/bin/spark-submit" (in directory "C:\spark-2.2.2-bin-hadoop2.7"): CreateProcess error=193, %1 is not a valid Win32 application
        at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start(DefaultProcessLauncher.java:25)
        ... 7 more
Caused by: java.io.IOException: CreateProcess error=193, %1 is not a valid Win32 application
        ... 8 more

I wonder what goes wrong? Thanks!

Missing com.salesforce.hw.iris.Iris class in iris example

Describe the bug
A clear and concise description of what the bug is.
hi , all.
I started with the example of iris, but i can't find "com.salesforce.hw.iris.Iris".

To Reproduce
Minimal set of steps or code snippet to reproduce the behavior

Expected behavior
A clear and concise description of what you expected to happen.

Logs or screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Build failed while running titanic code

Describe the bug
java.lang.ClassNotFoundException: com.salesforce.hw.OpTitanicSimple

To Reproduce
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpTitanicSimple -Dargs="
pwd/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv"

Expected behavior
Successful build and run through this use case with claimed result

Logs or screenshots
Main class:
com.salesforce.hw.OpTitanicSimple
Arguments:
/Users/tsu001/Documents/fun_stuff/TransmogrifAI/helloworld/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv
System properties:
(spark.repl.local.jars,)
(spark.driver.memory,4G)
(SPARK_SUBMIT,true)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(spark.app.name,op-helloworld:com.salesforce.hw.OpTitanicSimple)
(spark.jars,file:/Applications/spark-2.2.2-bin-hadoop2.7/)
(spark.submit.deployMode,client)
(spark.master,local[4])
Classpath elements:
file:/Applications/spark-2.2.2-bin-hadoop2.7/

java.lang.ClassNotFoundException: com.salesforce.hw.OpTitanicSimple
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:566)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:499)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:374)
at org.apache.spark.util.Utils$.classForName(Utils.scala:233)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:732)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

FAILURE: Build failed with an exception.

What went wrong:
Execution failed for task ':sparkSubmit'.

Process 'command '/Applications/spark-2.2.2-bin-hadoop2.7/bin/spark-submit'' finished with non-zero exit value 101

Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
Get more help at https://help.gradle.org

BUILD FAILED in 24s

Additional context
This is hello world.... I can't run it through :(

Parquet readers

Problem
Missing implementations for Parquet readers.

Solution
Implement Parquet data readers: simple, aggregate and conditional.

Alternatives
N/A

FeatureBuilder.fromDataFrame should automatically infer advanced types

Problem
Currently FeatureBuilder.fromDataFrame only infers a set of primitive TransmogrifAI types directly mapped from dataframe schema, such as Text, Real etc. But more advanced types, such as PickList, Email, Phone etc. are not inferred, because it requires some knowledge of actual data in the dataframe.

Solution
Estimate value distribution for each feature and deduct a more appropriate column type. Potentially reuse logic from CSVSchemaUtils.infer, SmartTextVectorizer and RawFeatureFilter.

Alternatives
N/A

Update scala-graph library to eliminate scalacheck from runtime

Problem
features module at the moment is using scala-graph 1.11.2 library, which has transitive dependency on scalacheck testing library, but not in the test or optional scope. Probably it was a minor mistake in the early days of scala-graph development, which was fixed quite long ago. From the end-user perspective this might be suspicious at minimum, e.g. my first question during dependency troubleshooting and spark job assembly building was: what this test dependency is doing in the runtime?

Solution
The aforementioned fix is available since 1.11.4, while the last version in the 1.11 branch is 1.11.5.
Solution might be as simple as bumping the scalaGraphVersion, thus reducing dependency tree a bit and simplifying the assembly building for the users.

Upgrade Apache Spark to 2.4

As you may already know, Apache Spark 2.4.0 was released on Nov 3.

https://spark.apache.org/releases/spark-release-2-4-0.html

Are you already planning to upgrade Spark version? I am interested in working on the upgrade task.

Investigate which classes require registration with Kryo

Problem
We might have some unregistered classes with Kryo that compromise performance.

Solution
Investigate which classes require registration with Kryo by set spark.kryo.registrationRequired=true in TestSparkContext. Then go over failing tests and see if any classes need to be registered in OpKryoRegistrator.

Additional context
Only do this once we upgrade to Spark 2.3 - #44

Non fatal URISyntaxException when running a workflow on spark-shell

Describe the bug
When executing a workflow interactively using spark-shell, I get a lot of URISyntaxExceptions from the Spark ExecutorClassLoader.

ERROR ExecutorClassLoader: Failed to check existence of class <root>.package on REPL class server at spark://192.168.178.56:56688/classes
java.net.URISyntaxException: Illegal character in path at index 37: spark://192.168.178.56:56688/classes/<root>/package.class

Note that the workflow still finishes correctly, so it's more an annoyance than a serious issue.

To Reproduce
Copy the following snippet, which is the TitanicMini example as REPL version into a file titanic-mini.sc

import com.salesforce.op._
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification._
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry.OpLogisticRegression
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager}

case class Passenger(
  id: Long,
  survived: Double,
  pClass: Option[Long],
  name: Option[String],
  sex: Option[String],
  age: Option[Double],
  sibSp: Option[Long],
  parCh: Option[Long],
  ticket: Option[String],
  fare: Option[Double],
  cabin: Option[String],
  embarked: Option[String]
)

implicit val s = spark
import s.implicits._
LogManager.getLogger("com.salesforce.op").setLevel(Level.ERROR)
val passengersData = DataReaders.Simple.csvCase[Passenger](Some("TitanicPassengersTrainData.csv"), key = _.id.toString).readDataset().toDF()
val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
val featureVector = features.transmogrify()
val checkedFeatures = survived.sanityCheck(featureVector, checkSample = 1.0, removeBadFeatures = true)
val prediction = BinaryClassificationModelSelector.
  withCrossValidation(modelTypesToUse = Seq(OpLogisticRegression)). 
  setInput(survived, checkedFeatures).getOutput()
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(prediction).train()
println(model.summaryPretty())

Run spark-shell and load the example:

wget https://raw.githubusercontent.com/salesforce/TransmogrifAI/master/helloworld/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv
$SPARK_HOME/bin/spark-shell --packages com.salesforce.transmogrifai:transmogrifai-core_2.11:0.4.0
...
scala> :load titanic-mini.sc

Expected behavior
Execution of the example without errors.

Logs or screenshots

...
18/09/30 11:27:45 ERROR ExecutorClassLoader: Failed to check existence of class <root>.package on REPL class server at spark://192.168.178.56:56688/classes
java.net.URISyntaxException: Illegal character in path at index 37: spark://192.168.178.56:56688/classes/<root>/package.class
	at java.net.URI$Parser.fail(URI.java:2848)
	at java.net.URI$Parser.checkChars(URI.java:3021)
	at java.net.URI$Parser.parseHierarchical(URI.java:3105)
	at java.net.URI$Parser.parse(URI.java:3053)
	at java.net.URI.<init>(URI.java:588)
	at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:327)
	at org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
	at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
...

Additional context
Seen on MacOS and Linux. Only tested on a local cluster.

Could not find com.salesforce.transmogrifai:transmogrifai-core_2.11:0.4.1-SNAPSHOT

Describe the bug
I'm getting an error while following the instructions on https://docs.transmogrif.ai/en/stable/examples/Bootstrap-Your-First-Project.html

To Reproduce

╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI  (62aed6e ✘)
╰─± git checkout 0.4.0
HEAD is now at 62aed6e... Update Running-from-Spark-Shell.md
╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI  (62aed6e ✘)
╰─± ./gradlew cli:shadowJar

> Task :utils:scalaStyle
Found 0 warnings
Found 0 errors

> Task :features:scalaStyle
Found 0 warnings
Found 0 errors

> Task :readers:scalaStyle
Found 0 warnings
Found 0 errors

> Task :core:scalaStyle
Found 0 warnings
Found 0 errors

> Task :cli:scalaStyle
Found 0 warnings
Found 0 errors

Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/4.10/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 55s
29 actionable tasks: 21 executed, 8 up-to-date
╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI  (62aed6e ✘)
╰─± alias transmogrifai="java -cp `pwd`/cli/build/libs/\* com.salesforce.op.cli.CLI"
╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI  (62aed6e ✘)
╰─± transmogrifai gen --input `pwd`/test-data/PassengerDataAll.csv \
  --id passengerId --response survived \
  --schema `pwd`/test-data/PassengerDataAll.avsc \
  --answers cli/passengers.answers Titanic --overwrite
Starting simple project generation
Generating 'Titanic' in '/Users/noah/Projects/TransmogrifAI/./titanic/' with template 'simple'
log4j:WARN No appenders could be found for logger (org.reflections.Reflections).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/.project'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/.settings/org.eclipse.buildship.core.prefs'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/gradle.properties'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/README.md'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/.gitignore'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/gradle/wrapper/gradle-wrapper.properties'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/src/main/scala/com/salesforce/app/Titanic.scala'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/build.gradle'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/src/main/avro/PassengerDataAll.avsc'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/settings.gradle'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/spark.gradle'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/src/main/scala/com/salesforce/app/Features.scala'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/gradlew'
  Created '/Users/noah/Projects/TransmogrifAI/./titanic/gradle/wrapper/gradle-wrapper.jar'
Done.
To get started, read the README.md file in the directory you just created
╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI  (62aed6e ✘)
╰─± cd titanic
╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI/titanic  (62aed6e ✘)
╰─± ./gradlew compileTestScala installDist
> Task :compileJava FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Could not resolve all files for configuration ':compileClasspath'.
> Could not find com.salesforce.transmogrifai:transmogrifai-core_2.11:0.4.1-SNAPSHOT.
  Searched in the following locations:
    - https://jcenter.bintray.com/com/salesforce/transmogrifai/transmogrifai-core_2.11/0.4.1-SNAPSHOT/maven-metadata.xml
    - https://jcenter.bintray.com/com/salesforce/transmogrifai/transmogrifai-core_2.11/0.4.1-SNAPSHOT/transmogrifai-core_2.11-0.4.1-SNAPSHOT.pom
    - https://jcenter.bintray.com/com/salesforce/transmogrifai/transmogrifai-core_2.11/0.4.1-SNAPSHOT/transmogrifai-core_2.11-0.4.1-SNAPSHOT.jar
    - https://repo.maven.apache.org/maven2/com/salesforce/transmogrifai/transmogrifai-core_2.11/0.4.1-SNAPSHOT/maven-metadata.xml
    - https://repo.maven.apache.org/maven2/com/salesforce/transmogrifai/transmogrifai-core_2.11/0.4.1-SNAPSHOT/transmogrifai-core_2.11-0.4.1-SNAPSHOT.pom
    - https://repo.maven.apache.org/maven2/com/salesforce/transmogrifai/transmogrifai-core_2.11/0.4.1-SNAPSHOT/transmogrifai-core_2.11-0.4.1-SNAPSHOT.jar
  Required by:
      project :

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 1s
2 actionable tasks: 2 executed


╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI/titanic  (62aed6e ✘)
╰─± scala -version
Scala code runner version 2.12.7 -- Copyright 2002-2018, LAMP/EPFL and Lightbend, Inc.
╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI/titanic  (62aed6e ✘)
╰─± apache-spark -version
zsh: command not found: apache-spark
╭─noah at MacBook-Pro in ~/Projects/TransmogrifAI/titanic  (62aed6e ✘)
╰─± spark-shell -version
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.3:4040
Spark context available as 'sc' (master = local[*], app id = local-1538195242400).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.2
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Expected behavior
I expected the example to compile

Logs or screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

More test specs

Problem
Currently there is no standard way for users to test out their readers, features, workflows and apps.

Solution
Similarly to OpTransformerSpec and OpEstimatorSpec we provide test facilities to allow testing readers, features, workflows and apps: OpReaderSpec, OpFeatureSpec, OpWorkflowSpec and OpAppSpec accordingly.

Alternatives
N/A

Error: Could not find or load main class com.salesforce.op.cli.CLI

I am working on windows...... for building I used in Windows command prompt -
gradlew.bat cli:shadowJar
then for alias i used-
DOSKEY transmogrifai=java -cp $OP_HOME/cli/build/libs/* com.salesforce.op.cli.CLI

But when i am trying to

generate using transmogrifai

I am getting error as -
Error: Could not find or load main class com.salesforce.op.cli.CLI

Any help will be appreciated and please let me know if i did anything wrong

Geolocation to Country transformer

Problem
We would like to be able to treat GeoLocation values as categorical features.

Solution
Add unary transformer to convert Geolocation values into Country.

Alternatives
N/A

XGBoost error code 255

Hi I am trying to use the new XGBoost support in master (latest commit d0785f0) but I am facing the following issue:
Here the code (BinaryClassification of Titanic Dataset=passengersData, targetColumn is Survived)

val (saleprice, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = targetColumn)
      
val featureVector = features.transmogrify()

val checkedFeatures = saleprice.sanityCheck(featureVector, removeBadFeatures = true)

val prediction = BinaryClassificationModelSelector.withCrossValidation(modelTypesToUse = Seq(
        OpXGBoostClassifier
      )).setInput(saleprice, checkedFeatures).getOutput()

val wf = new OpWorkflow()

val model = wf.setInputDataset(passengersData).setResultFeatures(passengerId, checkedFeatures,prediction).train()

val results = "Model summary:\n" + model.summaryPretty()
println(results)

Attached the log and the error
logxg.txt

Use TransmogrifAI with CDH 6.0

I want to use TransmogrifAI with CDH6.0, but encounter some problems. After installation of TransmogrifAI following the quick start doc, I submit the Titanic example with spark-submit command, a classnotfound exception is thrower, which said class com.fasterxml.jackson.module.scala.modifiers.EithModule was not found. Then I found TransmogrifAI depends on jackson-module-scala_2.11-2.6.5.jar, and the class exists in this jar. But CDH depends on jackson-module-scala_2.11-2.9.5.jar, which not contains this class. So what should I do to tackle the problem and run the demo with CDH? Thxs!

Titanic evaluate throws an error on 0.3.4 and master branches

@chatbotnlp Sep 19 15:14
Hello team , I am trying to run simple titanic example Train and score method are working fine but when i try to run Evaluate method i am getting error.

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'probability' given input columns: [parch, key, name_2-stagesApplied_OPVector_00000000000f, age_1-stagesApplied_OPVector_00000000000d, age-cabin-embarked-name-pClass-parch-

full log https://pastebin.com/mhWWht6k

Response & Predictor traits

Problem
Currently there is no type checks available to allow response / predictor specific feature engineering. For instance for response feature of type FeatureLike[RealNN] it makes sense to have feature.calibrate shortcut.

Solution
Add Response / Predictor traits to FeatureLike

Run in sbt

Describe the bug
I make a new Scala project using sbt. I simply use the exact code in the OpTitanicSimple example

I get this exception:

java.lang.AbstractMethodError

on line

val prediction =
            BinaryClassificationModelSelector.withTrainValidationSplit(
                modelTypesToUse = Seq(OpLogisticRegression)
            ).setInput(survived, finalFeatures).getOutput()

To Reproduce
Here's my build.sbt file:

name := "auto-titanic"

version := "0.1"

// This is important for Transmorgrif and Spark
scalaVersion := "2.11.12"

resolvers += Resolver.jcenterRepo
resolvers += Resolver.mavenCentral

// TransmogrifAI core dependency
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.4.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.2"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.2"

Additionally, I also change the SparkContext line to:

implicit val spark = SparkSession.builder.config("spark.master", "local").getOrCreate()

Java version is 1.8.0_181

Expected behavior
To run normally as in the provided example

Logs or screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Illegal character in path at index 2: ..\test-data\PassengerData.avro

I am working on windows
when i am running OpWorkflowRunnerLocalTest, there comes out a exception
\utils\src\main\scala\com\salesforce\op\utils\io\avro\AvroInOut.scala
function: selectExistingPaths
my filepath is D:\SparkDir\SalesForceFromGitHub\test-data\PassengerDataAll.avro
but " val firstPath = paths.head.split("/")"
when make a change ：" val firstPath = paths.head.split("\\")"
it passed

FAILURE: Build failed with an exception

After that i want to try one of classificiation model which is here:

https://docs.transmogrif.ai/en/stable/examples/Titanic-Binary-Classification.html

But during running command :

cd helloworld
./gradlew compileTestScala installDist
./gradlew -q sparkSubmit -Dmain=com.salesforce.hw.OpTitanicSimple -Dargs="
pwd/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv"
I am getting this error:

FAILURE: Build failed with an exception.

What went wrong:
Execution failed for task ':sparkSubmit'.

A problem occurred starting process 'command 'null/bin/spark-submit''

Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
Get more help at https://help.gradle.org

BUILD FAILED in 2s
I tried to find out but couldn't get any idea about this , How can i solve this issue ?

Cannot find getRawFeatureDistribution() in any files

I've seen a definition about getRawFeatureDistribution() method in https://scaladoc.transmogrif.ai/index.html#package. But I can't find any related code in current commit. How can I get raw feature distribution for now?

Find good JVM implementation of DBSCAN and wrap it into TransmogrifAI

Problem
We would like to offer DBSCAN as an algorithm in TransmogrifAI but there is no spark implementation.

Solution
Find an implementation of DBSCAN (https://en.wikipedia.org/wiki/DBSCAN) in the JVM with good support and edge case coverage and wrap it into TransmogrifAI. See https://docs.transmogrif.ai/en/stable/developer-guide/index.html#wrapping-a-non-serializable-external-library and https://docs.transmogrif.ai/en/stable/developer-guide/index.html#writing-your-own-estimator

Alternatives
If you feel you can cover all the edge cases you could also write a new implementation of DBSCAN

Build & tests parallelism - Part I

Problem
Currently our build & tests are sequential, perhaps it would be better to have it parallelized.

Solution

Parallelize build & tests using Gradle (if not possible, leverage CircleCI parallelism)
Evaluate performance difference

Alternatives
N/A

java.lang.NoClassDefFoundError: org/apache/spark/ml/classification/LinearSVCParams

Describe the bug
I am trying to reproduce the Titanic example from intellij idea. When I try to execute the below code, it give me the following error...

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/classification/LinearSVCParams
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector$.defaultModelsAndParams(BinaryClassificationModelSelector.scala:87)
	at com.salesforce.op.stages.impl.selector.ModelSelectorFactory$class.selector(ModelSelectorFactory.scala:73)
	at com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector$.selector(BinaryClassificationModelSelector.scala:47)
	at com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector$.withTrainValidationSplit(BinaryClassificationModelSelector.scala:195)
	at org.formcept.integration.transmogrifAITest$.main(transmogrifAITest.scala:69)
	at org.formcept.integration.transmogrifAITest.main(transmogrifAITest.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.classification.LinearSVCParams
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 30 more

To Reproduce

package org.fakeOrg.FakeComponent
import com.salesforce.op._
import com.salesforce.op.evaluators.Evaluators
import com.salesforce.op.features._
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry.OpLogisticRegression
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object transmogrifAITest {
  case class Passenger(id: Int, survived: Int, pClass: Option[Int], name: Option[String], sex: Option[String],
                       age: Int, sibSp: Option[Int], parCh: Option[Int], ticket: Option[String],
                       fare: Option[Double], cabin: Option[String], embarked: Option[String])
  def main(args: Array[String]): Unit = {
    implicit val spark = SparkSession.builder.config(new SparkConf()).master("local[*]").getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    import spark.implicits._
// Define features using the OP types based on the data
    val survived = FeatureBuilder.RealNN[Passenger].extract(_.survived.toRealNN).asResponse
    val pClass = FeatureBuilder.PickList[Passenger].extract(_.pClass.map(_.toString).toPickList).asPredictor
    val name = FeatureBuilder.Text[Passenger].extract(_.name.toText).asPredictor
    val sex = FeatureBuilder.PickList[Passenger].extract(_.sex.map(_.toString).toPickList).asPredictor
    val age = FeatureBuilder.Real[Passenger].extract(_.age.toReal).asPredictor
    val sibSp = FeatureBuilder.Integral[Passenger].extract(_.sibSp.toIntegral).asPredictor
    val parCh = FeatureBuilder.Integral[Passenger].extract(_.parCh.toIntegral).asPredictor
    val ticket = FeatureBuilder.PickList[Passenger].extract(_.ticket.map(_.toString).toPickList).asPredictor
    val fare = FeatureBuilder.Real[Passenger].extract(_.fare.toReal).asPredictor
    val cabin = FeatureBuilder.PickList[Passenger].extract(_.cabin.map(_.toString).toPickList).asPredictor
    val embarked = FeatureBuilder.PickList[Passenger].extract(_.embarked.map(_.toString).toPickList).asPredictor
    // TRANSFORMED FEATURES
    // Do some basic feature engineering using knowledge of the underlying dataset
    val familySize = sibSp + parCh + 1
    val estimatedCostOfTickets = familySize * fare
    val pivotedSex = sex.pivot()
    val normedAge = age.fillMissingWithMean().zNormalize()
    val ageGroup = age.map[PickList](_.value.map(v => if (v > 18) "adult" else "child").toPickList)
    val passengerFeatures = Seq(pClass, name, age, sibSp, parCh, ticket, cabin, embarked, familySize, estimatedCostOfTickets, pivotedSex, ageGroup).transmogrify()
    // Optionally check the features with a sanity checker
    val sanityCheck = true
    val finalFeatures = if (sanityCheck) survived.sanityCheck(passengerFeatures) else passengerFeatures
    // Define the model we want to use (here a simple logistic regression) and get the resulting output
    val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(modelTypesToUse = Seq(OpLogisticRegression)).setInput(survived, finalFeatures).getOutput()
    val evaluator = Evaluators.BinaryClassification().setLabelCol(survived).setPredictionCol(prediction)
    // WORKFLOW
    // Define a way to read data into our Passenger class from our CSV file
    val trainDataReader = DataReaders.Simple.csvCase[Passenger](path = Option("/**fakePath**/titanic.csv"), key = _.id.toString)
    // Define a new workflow and attach our data reader
    val workflow = new OpWorkflow().setResultFeatures(survived, prediction).setReader(trainDataReader)
    // Fit the workflow to the data
    val fittedWorkflow = workflow.train()
    println(s"Summary: ${fittedWorkflow.summary()}")
    // Manifest the result features of the workflow
    println("Scoring the model")
    val (dataframe, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)
    println("Transformed dataframe columns:")
    dataframe.columns.foreach(println)
    println("Metrics:")
    println(metrics)
  }
}

My build.sbt looks like this...

//Spark dependecies
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-launcher_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.1.0"

// TransmogrifAI core dependency
resolvers += Resolver.jcenterRepo
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.4.0"
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-features" % "0.4.0"
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-readers" % "0.4.0"

IS THERE SOMETHING THAT I AM MISSING?

Dataframe Encoders for TransmogrifAI types

Problem
Currently TransmogrifAI implements a bunch of custom functions to encode/decode TransmogrifAI type to/from Spark dataframe native types (see FeatureSparkTypes, FeatureTypeSparkConverter and FeatureTypeFactory). This method requires applying converters each time values are encoded/decoded to/from a Spark dataframe.

Solution
We need to have a proper implementation of org.apache.spark.sql.Encoder to handle TransmogrifAI types efficiently.

Alternatives
N/A

Additional context
Ideally we should also avoid boxing/unboxing into TransmogrifAI but this would require a major refactoring. This is up for a discusion.

View model's predictions and probabilities of each prediction class

Problem
Hi currently using the Model score functionality I can only see predictions but not the associated probability

Solution
I would like to view predictions and probabilities of each prediction class.
One example of the desired functionality is described here:
https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

Thanks.
Alberto.

Data Readers with Auto Schema and Auto Class

Problem
Actually DataReaders need a case class or type to load the data inside a data frame, requesting the developer to code upfront the case class representing the schema of the data .

Solution
Enable the option to automatically infer schema and class in the data loaders.

Alternatives
Add a way to pass the case class at runtime, at the moment it is only possible at compile time.

java.lang.NoClassDefFoundError: com/fasterxml/jackson/module/scala/modifiers/EitherModule

We're using TransmogrifAI in a project which also pulls in a newer version of Jackson from another dependency. When we do that, we see the exception

[error] Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/module/scala/modifiers/EitherModule
[error]         at java.lang.ClassLoader.defineClass1(Native Method)
[error]         at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
[error]         at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
[error]         at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
[error]         at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
[error]         at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
[error]         at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
[error]         at java.security.AccessController.doPrivileged(Native Method)
[error]         at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
[error]         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[error]         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
[error]         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[error]         at java.lang.ClassLoader.defineClass1(Native Method)
[error]         at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
[error]         at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
[error]         at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
[error]         at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
[error]         at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
[error]         at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
[error]         at java.security.AccessController.doPrivileged(Native Method)
[error]         at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
[error]         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[error]         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
[error]         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[error]         at com.salesforce.op.utils.json.JsonLike$class.toJson(JsonUtils.scala:179)

EitherModule was moved in this commit.

No feature type tag mapping for Spark type IntegerType

Describe the bug
spark sql type integer is not mapped to featuretype
in the source code,type is mapped as bellow.
def featureTypeTagOf(sparkType: DataType, isNullable: Boolean): WeakTypeTag[_ <: FeatureType] = sparkType match {
case DoubleType if !isNullable => weakTypeTag[types.RealNN]
case DoubleType => weakTypeTag[types.Real]
// add this line and tweak method fromSparkFn in FeatureTypeSparkConverter.scala can fix the bug
// case IntegerType => weakTypeTag[types.Integral]

case LongType => weakTypeTag[types.Integral]
case ArrayType(StringType, _) => weakTypeTag[types.TextList]
case StringType => weakTypeTag[types.Text]
case BooleanType => weakTypeTag[types.Binary]
......
case _ => throw new IllegalArgumentException(s"No feature type tag mapping for Spark type $sparkType")

}
and change line 153 ( in method fromSparkFn) in FeatureTypeSparkConverter.scala to:
case wt if wt <:< weakTypeOf[t.Integral] => (value: Any) => {
var result: Any = FeatureTypeDefaults.Integral.value
if (value == null) {result = FeatureTypeDefaults.Integral.value}
else if (value.isInstanceOf[Integer]) {result = Some(value.asInstanceOf[Int].toLong)}
else {result = Some(value.asInstanceOf[Long])}
result
}

To Reproduce
package com.jd.spark
import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf
case class Passenger(
id: Int,
survived: Int,
pClass: Option[Int],
name: Option[String],
sex: Option[String],
age: Option[Double],
sibSp: Option[Int],
parCh: Option[Int],
ticket: Option[String],
fare: Option[Double],
cabin: Option[String],
embarked: Option[String])
object TestSalesForce {
def main(args: Array[String]) = {
implicit val spark = SparkSession.builder.config(new SparkConf().setMaster("local")).enableHiveSupport()
.getOrCreate()
import spark.implicits._

// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = Option("D:\\JDdev\\TransmogrifAI\\test-data\\PassengerDataAll.csv")).readDataset().toDF()
 // .selectExpr("id", "cast (survived as double)survived", "pClass", "name", "sex", "age", "sibSp", "parCh", "ticket", "fare", "cabin", "embarked")
// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")

// Automated feature engineering

val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)

// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(checkedFeatures).train()
val num=1
val lnum=num.asInstanceOf[Long]
//val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(Array(pred._1,pred._2,pred._3):_*).train()
var score = model.score().cache()
val testf=score.schema("label").dataType
val field1 = score.schema.fields(0).name
val field2 = score.schema.fields(1).name
var test = score.withColumnRenamed(s"$field2", "vec")
//test=test.withColumn("label", $"0.0")
val code = (arg: String) => {
  0.0
}
val addCol = udf(code)
test = test.withColumn("label", addCol(test(s"$field1")))
test=test.select("label", "vec")
test.printSchema()
test.write.format("orc").save("test/orc")
test.write.format("json").save("test/json")
test.write.format("libsvm").save("test/libsvm")

//println("Model summary:\n" + model.summaryPretty())

}
}

Expected behavior
spark sql Integer type will map to feature type integal

Logs or screenshots

Exception in thread "main" java.lang.IllegalArgumentException: No feature type tag mapping for Spark type IntegerType
at com.salesforce.op.features.FeatureSparkTypes$.featureTypeTagOf(FeatureSparkTypes.scala:216)
at com.salesforce.op.features.FeatureBuilder$$anonfun$1.apply(FeatureBuilder.scala:199)
at com.salesforce.op.features.FeatureBuilder$$anonfun$1.apply(FeatureBuilder.scala:196)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at com.salesforce.op.features.FeatureBuilder$.fromDataFrame(FeatureBuilder.scala:196)
at com.jd.spark.TestSalesForce$.main(TestSalesForce.scala:33)
at com.jd.spark.TestSalesForce.main(TestSalesForce.scala)

Additional context
map from int to feature type

Can't use the cloned project

When I am trying to follow the example in the Dev guide I am stuck at the step below:

Build the TransmogrifAI CLI by running:

./gradlew cli:shadowJar
alias transmogrifai="java -cp pwd/cli/build/libs/* com.salesforce.op.cli.CLI"

getting following error during build. Can you please help?

c:\Program Files\Git\transmogrifai>gradlew cli:shadowJar alias transmogrifai="ja
va -cp pwd/cli/build/libs/* com.salesforce.op.cli.CLI"
Downloading https://services.gradle.org/distributions/gradle-4.10-bin.zip
..........................................................................

Welcome to Gradle 4.10!

Here are the highlights of this release:

Incremental Java compilation by default
Periodic Gradle caches cleanup
Gradle Kotlin DSL 1.0-RC3
Nested included builds
SNAPSHOT plugin versions in the plugins {} block

For more details see https://docs.gradle.org/4.10/release-notes.html

Starting a Gradle Daemon (subsequent builds will be faster)

FAILURE: Build failed with an exception.

What went wrong:
Task 'alias' not found in root project 'transmogrifai'.

Add Helloworld to CI

Problem
Currently Helloworld project is not CI verified and has to be tested manually prior to each release.

Solution
Add a CI build for Helloworld project.

Alternatives
NA

Model Load from a brand new workflow

Problem
Actually in order to score a model that is a result of an automl selection I have first to train the model, save it to disk, load it and score against the scoring dataset and all of this it is performed in the same method

Solution
Once a model is saved, it should be possible to load it without the need of the original workflow that created the model itself.

Alternatives
Actually I do not find alternatives.

Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:

I am trying to work with one binary classification example same as titanic problem demo, But i am getting Null value error , I have already checked the csv file there are no null values but still it is showing null values.
My data looks like this main file code is:

/*
 * Copyright (c) 2017, Salesforce.com, Inc.
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * 1. Redistributions of source code must retain the above copyright notice,
 * this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright notice,
 * this list of conditions and the following disclaimer in the documentation
 * and/or other materials provided with the distribution.
 *
 * 3. Neither the name of Salesforce.com nor the names of its contributors may
 * be used to endorse or promote products derived from this software without
 * specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
 * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 */

package com.salesforce.hw

import com.salesforce.op._
import com.salesforce.op.evaluators.Evaluators
import com.salesforce.op.features.FeatureBuilder
import com.salesforce.op.features.types._
import com.salesforce.op.readers.DataReaders
import com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector
import com.salesforce.op.stages.impl.classification.ClassificationModelsToTry._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * Define a case class corresponding to our data file (nullable columns must be Option types)
 *
 * @param id       passenger id
 * @param survived 1: survived, 0: did not survive
 * @param pClass   passenger class
 * @param name     passenger name
 * @param sex      passenger sex (male/female)
 * @param age      passenger age (one person has a non-integer age so this must be a double)
 * @param sibSp    number of siblings/spouses traveling with this passenger
 * @param parCh    number of parents/children traveling with this passenger
 * @param ticket   ticket id string
 * @param fare     ticket price
 * @param cabin    cabin id string
 * @param embarked location where passenger embarked
 */
case class Passenger
(
  id: Option[Int],
  blue_ca_1: Option[Int],
  blue_ca_2: Option[Int],
  blue_ca_3: Option[Int],
  blue_ca_4: Option[Int],
  blue_ca_5: Option[Int],
  blue_ca_6: Option[Int],
  blue_ca_7: Option[Int],
  blue_ca_8: Option[Int],
  blue_ca_9: Option[Int],
  blue_ca_10: Option[Int],
  blue_ca_11: Option[Int],
  blue_ca_12: Option[Int],
  blue_ca_13: Option[Int],
  blue_ca_14: Option[Int],
  blue_ca_15: Option[Int],
  blue_ca_16: Option[Int],
  blue_ca_17: Option[Int],
  outcome: Int
)

/**
 * A simplified TransmogrifAI example classification app using the Titanic dataset
 */
object OpTitanicSimple {

  /**
   * Run this from the command line with
   * ./gradlew sparkSubmit -Dmain=com.salesforce.hw.OpTitanicSimple -Dargs=/full/path/to/csv/file
   */
  def main(args: Array[String]): Unit = {
    if (args.isEmpty) {
      println("You need to pass in the CSV file path as an argument")
      sys.exit(1)
    }
    val csvFilePath = args(0)
    println(s"Using user-supplied CSV file path: $csvFilePath")

    // Set up a SparkSession as normal
    val conf = new SparkConf().setAppName(this.getClass.getSimpleName.stripSuffix("$"))
    implicit val spark = SparkSession.builder.config(conf).getOrCreate()

    ////////////////////////////////////////////////////////////////////////////////
    // RAW FEATURE DEFINITIONS
    /////////////////////////////////////////////////////////////////////////////////

    // Define features using the OP types based on the data
    val blue_ca_1 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_1.toIntegral).asPredictor
    val blue_ca_2 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_2.toIntegral).asPredictor
    val blue_ca_3 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_3.toIntegral).asPredictor
    val blue_ca_4 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_4.toIntegral).asPredictor
    val blue_ca_5 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_5.toIntegral).asPredictor
    val blue_ca_6 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_6.toIntegral).asPredictor
    val blue_ca_7 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_7.toIntegral).asPredictor
    val blue_ca_8 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_8.toIntegral).asPredictor
    val blue_ca_9 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_9.toIntegral).asPredictor
    val blue_ca_10 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_10.toIntegral).asPredictor
    val blue_ca_11 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_11.toIntegral).asPredictor
    val blue_ca_12 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_12.toIntegral).asPredictor
    val blue_ca_13 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_13.toIntegral).asPredictor
    val blue_ca_14 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_14.toIntegral).asPredictor
    val blue_ca_15 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_15.toIntegral).asPredictor
    val blue_ca_16 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_16.toIntegral).asPredictor
    val blue_ca_17 = FeatureBuilder.Integral[Passenger].extract(_.blue_ca_17.toIntegral).asPredictor
    val outcome = FeatureBuilder.RealNN[Passenger].extract(_.outcome.toRealNN).asResponse

    ////////////////////////////////////////////////////////////////////////////////
    // TRANSFORMED FEATURES
    /////////////////////////////////////////////////////////////////////////////////

    // Do some basic feature engineering using knowledge of the underlying dataset

    // Define a feature of type vector containing all the predictors you'd like to use
    val passengerFeatures = Seq(
      blue_ca_1, blue_ca_2, blue_ca_3, blue_ca_4, blue_ca_5, blue_ca_6,
      blue_ca_7, blue_ca_8, blue_ca_9, blue_ca_10, blue_ca_11,
      blue_ca_12, blue_ca_13, blue_ca_14, blue_ca_15, blue_ca_16, blue_ca_17
    ).transmogrify()

    // Optionally check the features with a sanity checker
    val sanityCheck = true
    val finalFeatures = if (sanityCheck) outcome.sanityCheck(passengerFeatures) else passengerFeatures

    // Define the model we want to use (here a simple logistic regression) and get the resulting output
    val (prediction, rawPrediction, prob) =
      BinaryClassificationModelSelector.withTrainValidationSplit()
        .setModelsToTry(LogisticRegression)
        .setInput(outcome, finalFeatures).getOutput()

    val evaluator = Evaluators.BinaryClassification()
      .setLabelCol(outcome)
      .setRawPredictionCol(rawPrediction)
      .setPredictionCol(prediction)
      .setProbabilityCol(prob)

    ////////////////////////////////////////////////////////////////////////////////
    // WORKFLOW
    /////////////////////////////////////////////////////////////////////////////////

    import spark.implicits._ // Needed for Encoders for the Passenger case class
    // Define a way to read data into our Passenger class from our CSV file
    val trainDataReader = DataReaders.Simple.csvCase[Passenger](
      path = Option(csvFilePath),
      key = _.id.toString
    )

    // Define a new workflow and attach our data reader
    val workflow =
      new OpWorkflow()
        .setResultFeatures(outcome, rawPrediction, prob, prediction)
        .setReader(trainDataReader)

    // Fit the workflow to the data
    val fittedWorkflow = workflow.train()
    println(s"Summary: ${fittedWorkflow.summary()}")

    // Manifest the result features of the workflow
    println("Scoring the model")
    val (dataframe, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)

    println("Transformed dataframe columns:")
    dataframe.columns.foreach(println)
    println("Metrics:")
    println(metrics)
  }
}

And Passenger file ( variables definition file ) looks like this:

{
  "type" : "record",
  "name" : "Passenger",
  "namespace" : "com.salesforce.hw.tpo",
  "fields" : [ {
    "name" : "blue_ca_1",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_2",
    "type" : [ "double", "null" ],
    "default": 0
  }, {
    "name" : "blue_ca_3",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_4",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_5",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_6",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_7",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_8",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_9",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_10",
    "type" : [ "double", "null" ]
  }, {
    "name" : "blue_ca_11",
    "type" : [ "int", "null" ]
  }, {
    "name" : "blue_ca_12",
    "type" : [ "int", "null" ]
  }, {
    "name" : "blue_ca_13",
    "type" : [ "int", "null" ]
  }, {
    "name" : "blue_ca_14",
    "type" : [ "int", "null" ]
  }, {
    "name" : "blue_ca_15",
    "type" : [ "int", "null" ]
  }, {
    "name" : "blue_ca_16",
    "type" : [ "int", "null" ]
  }, {
    "name" : "blue_ca_17",
    "type" : [ "int", "null" ]
  }, {
    "name" : "outcome",
    "type" : [ "int", "null" ]
  } ]
}

Here are all the files https://github.com/monk1337/TransmogrifAI-Auto-ml

I Have two questions :

response variables should be cast in real ( float ) ? because outcome will be probability of two classes which will be float values so should i do this ? :

val outcome = FeatureBuilder.RealNN[Passenger].extract(_.outcome.toRealNN).asResponse

or this :

    val outcome = FeatureBuilder.Integral[Passenger].extract(_.outcome.toIntegral).asResponse

Second thing , How to solve that Null value issue and run this successfully ?

Thanks in advance !

Evaluate integration with Sparser library

Problem
Json parsing is slow and we should allow a better way.

Solution
Evaluate Sparser - https://dawn.cs.stanford.edu/2018/08/07/sparser/

Implement a JSON reader using Spark
Implement a JSON reader using Sparser
Evaluate performance

Alternatives
?!

Factor out all common spark/hadoop properties

Problem
We currently largely overlapping spark.gradle files especially in terms of spark properties.

$ git ls-files | grep spark.gradle
gradle/spark.gradle
helloworld/gradle/spark.gradle
templates/simple/spark.gradle

Solution
Provide a way to have a single spark.gradle or at least a single spark-transmogrifai.conf file with common properties that is passed via --properties-file to Spark.

Alternatives

common properties file
refactored spark.gradle

Additional context
DRY

sanitychecker seems do not work for realmap featuretype

Describe the bug
sanitychecker seems do not work for realmap featuretype

Logs or screenshots

18/10/08 10:57:58 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: requirement failed: Vector column 99 has multiple null indicator fields: ArrayBuffer((99,({
  "parentFeatureName" : [ "delv_way_cd_count" ],
  "parentFeatureType" : [ "com.salesforce.op.features.types.RealMap" ],
  "grouping" : "99",
  "indicatorValue" : "NullIndicatorValue",
  "index" : 87,
  "nullIndicator" : true
},87)), (99,({
  "parentFeatureName" : [ "pay_mode_cd_count" ],
  "parentFeatureType" : [ "com.salesforce.op.features.types.RealMap" ],
  "grouping" : "99",
  "indicatorValue" : "NullIndicatorValue",
  "index" : 126557,
  "nullIndicator" : true
},126557)))
java.lang.IllegalArgumentException: requirement failed: Vector column 99 has multiple null indicator fields: ArrayBuffer((99,({
  "parentFeatureName" : [ "delv_way_cd_count" ],
  "parentFeatureType" : [ "com.salesforce.op.features.types.RealMap" ],
  "grouping" : "99",
  "indicatorValue" : "NullIndicatorValue",
  "index" : 87,
  "nullIndicator" : true
},87)), (99,({
  "parentFeatureName" : [ "pay_mode_cd_count" ],
  "parentFeatureType" : [ "com.salesforce.op.features.types.RealMap" ],
  "grouping" : "99",
  "indicatorValue" : "NullIndicatorValue",
  "index" : 126557,
  "nullIndicator" : true
},126557)))
	at com.salesforce.op.stages.impl.preparators.SanityChecker$$anonfun$makeColumnStatistics$2.apply(SanityChecker.scala:286)
	at com.salesforce.op.stages.impl.preparators.SanityChecker$$anonfun$makeColumnStatistics$2.apply(SanityChecker.scala:284)
	at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
	at com.salesforce.op.stages.impl.preparators.SanityChecker.makeColumnStatistics(SanityChecker.scala:284)
	at com.salesforce.op.stages.impl.preparators.SanityChecker.fitFn(SanityChecker.scala:652)
	at com.salesforce.op.stages.base.binary.BinaryEstimator.fit(BinaryEstimator.scala:107)
	at com.salesforce.op.stages.base.binary.BinaryEstimator.fit(BinaryEstimator.scala:61)
	at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$20.apply(FitStagesUtil.scala:265)
	at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$20.apply(FitStagesUtil.scala:264)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
	at com.salesforce.op.utils.stages.FitStagesUtil$.com$salesforce$op$utils$stages$FitStagesUtil$$fitAndTransformLayer(FitStagesUtil.scala:264)
	at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$17.apply(FitStagesUtil.scala:227)
	at com.salesforce.op.utils.stages.FitStagesUtil$$anonfun$17.apply(FitStagesUtil.scala:225)
	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
	at com.salesforce.op.utils.stages.FitStagesUtil$.fitAndTransformDAG(FitStagesUtil.scala:225)
	at com.salesforce.op.OpWorkflow.fitStages(OpWorkflow.scala:384)
	at com.salesforce.op.OpWorkflow.train(OpWorkflow.scala:335)

Additional context
as far as I am concerned,map featuretype are for sparse features (for most shallow ML tasks in real world.features are sparse)

Quick Start Examples (Titanic) Does not Build with master

Describe the bug
When using titanic dataset (Bootstrap Your First Project page) ,
./gradlew compileTestScala installDist
command is trying to pull *core-3.50-snapshot jar, which is not available. This causes the build to fail.

To Reproduce
Run the steps mentioned in Bootstrap Your First Project on master branch. It would fail for ./gradlew compileTestScala installDist
It works for 0.3.4 branch only.
Expected behavior
For developers : it should work on master
Possible solution : Use shadowJar to generate core jar which will have the desired dependencies and then use the local jars for building by adding flatDir path in the gradle template.

Logs or screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

salesforce / transmogrifai Goto Github PK

transmogrifai's Introduction

TransmogrifAI

Predicting Titanic Survivors with TransmogrifAI

Adding TransmogrifAI into your project

Quick Start and Documentation

Authors

Internal Contributors (prior to release)

License

transmogrifai's People

Contributors

Stargazers

Watchers

Forkers

transmogrifai's Issues

References:

Recommend Projects

Recommend Topics

Recommend Org