eharmony / aloha Goto Github PK
View Code? Open in Web Editor NEWA scala-based feature generation and modeling framework
Home Page: http://eharmony.github.io/aloha
License: MIT License
A scala-based feature generation and modeling framework
Home Page: http://eharmony.github.io/aloha
License: MIT License
add a JSON function like on line 282 of https://github.com/eHarmony/aloha/blob/master/aloha-vw-jni/src/main/scala/com/eharmony/aloha/models/vw/jni/VwJniModel.scala
This will make it easier to create H2O models
Documentation written for a first time user who doesn't know why this is cool and maybe doesn't know Java or Scala.
Clean up links, fill in appropriate pages
When doing contextual bandits it's essential to record the probability with which the action was shown. To do this we should make use of explore-java to get these probabilities
Add the ability to do arg max models. This naive implementation would identify one (or more) parameters that can vary during score time. At training time these parameters are known, but at test time they are combinatorially tried one at a time and the set that produces the max (or min) is returned.
Allow VW multi-class predictions to specify a class mapping. This will enable the models to output different prediction class types.
Int
if not suppliedBoolean
, Long
, Double
, String
classLookup(cbAction).toIndex + 1
(this will be a HashMap
constructed from the array.Make sure to reuse code for determining type of JSON array / prediction class type
type T = typeOf[class lookup]
Double => Int => T => B
, where B
is the model output type// If the prediction is not an Int
// (doubleVal.isValidInt), throw exception at predict time
// throw vs return error?
val classIndex: Double => Int = (vwPred: Double) => {
if (!vwPred.isValidInt) throw new Exception("...")
vwPred.toInt - 1
}
val classLookup: Int => T =
(vwPredClass: Int) => lookup(vwPredClass)
val tToB: T => B = TypeCoercion[T, B] getOrElse {
throw new DeserializationException(
s"Couldn't find conversion to ${RefInfoOps.toString[B]}"
)
}
val finalizerFn: Double => B =
classIndex andThen classLookup andThen tToB
If it exists, copy classLookup
in spec to classLookup
in model.
When creating an Aloha VW model and embedding the native model as a base64 encoded string VFS2 is forced. This should not be the case and the format should be specified by the caller.
As part of this also check the H2O branch to make sure the same thing isn't happening over there.
This would be a tool that ingests a number of business rules specified w.r.t. features in a dataset and a set of other models. It would return a model decision tree that encodes all those business rules and leads to either predetermined outcomes or to the other models specified.
Update the POM to add Sonatype required plugins / info.
A user should be able to specify features that are based on the actions. This is tricky as it requires the model to know all possible actions and represent these features differently.
Create a separate page for how aloha's philosophy on errors and how it deals with errors. Include SemanticsUdfException
, ErrorSwallowingModel
and information about The Model
API for apply
, scoreAsEither
and score
.
Create an easy, generic, fast way to convert Option[A]
to Seq[(String, Double)]
. This should NOT be implicit conversion but rather a pimp on Option
through the use of a value class and an implicit method. The implicit method is chosen instead of an implicit value class because value classes can't be embedded in a trait and we want the pimp to be in a stackable trait architecture because we want to update com.eharmony.aloha.feature.BasicFunctions
so that the standard import of import com.eharmony.aloha.feature.BasicFunctions._
can be used to get the new functionality.
The value class will go in the package object for the regression model.
package object com.eharmony.aloha.models.reg {
final class OptToKv[A](val a: Option[A]) extends AnyVal {
def toKv(implicit f: A => Double): Seq[(String, Double)] =
if (a.nonEmpty) ("", f(a.get)) :: Nil else Nil
}
}
The implicit conversion will go in RegressionModelValueToTupleConversions
.
implicit def toKv[A](a: Option[A]) = new OptToKv(a)
Then, specifications like the following can be written in Aloha models:
"features: {
"height_cm": "Option(180).toKv",
"weight_lbs": "Option(${profile.weight}).filter(_ <= 200) .toKv"
}
This would create the regression model features for someone over 200 lbs as:
IndexedSeq(
List(("height_cm", 180.0)),
Nil
)
This is exactly what we want. Don't screw up type inference but make the conversion fast and pretty painless for the model authors.
I'm trying to serialize the Aloha VW JNI model after updating the to latest code and I'm seeing the following exception:
java.io.NotSerializableException: org.apache.commons.vfs2.provider.local.LocalFile
This is being triggered by a call to SparkContext.broadcast of an Aloha VW JNI model. After looking at the code the culprit appears to be the vwModel object. Can this please be changed so that it is serializable again.
Currently, there is a lot of serialization code in VW that was complicated to write. This should be moved to aloha-core
so that it can be reused in aloha-h20
Update model JSON formats with JSON specifics for the h2o model.
Remove the notion of a ScoreProto and replace that with a recursive data type that allows representation of all the subscores.
In com.eharmony.aloha.dataset.RowCreatorBuilder
in aloha-core
make the Error message on line 100 make use of the failure information used in the logging statement above it.
Create a JSON entity and matching Selector type that allows the JSON author to provide a function that provides an index of the child to traverse. This may look something like:
{
"type": "constant",
"selector": "${reg.hour_of_day}",
"children": [ 3, 4, 5 ]
}
Add the ability to build R Aloha models. This should have all the same support that VW currently has.
I recently ran into an issue where the random node selector wasn't working because I was using the same features at two different levels of multiple model decision trees. This process is avoidable if Aloha takes care of the salts automatically.
This can be done by having a flag which defaults to true called automaticSalts. If this is set to true, then Aloha will append the depth in the model decision tree to the set of features for the hash. This will prevent the same set of features from being used twice.
This feature should be extended to CategoricalDistribution models, and any other place randomization is needed.
In order for this work something about the model should be included in the salt. An idea that came up is a hash of the JSON.
bag
function in com.eharmony.aloha.feature.BagOfWords
doesn't sum weights associated with a duplicate key. Essentially it's doing the map but not the reduce phase of a word count.
Need to aggregate these values.
Also check skipGrams
and nGrams
for similar problems.
Currently, at least in VwContextualBanditRowCreator
when label data is missing, covariate data is emitted but labels are not.
Need to figure out a solution to this.
In the exploration models we only want to include the sub scores which contributed to the actual score. This will give us a better idea of what actually happened.
Use an async ajax request to maven central to automatically pull latest Aloha release.
We still need protos on our side. This implementation may make sense in aloha-proto.
Make it something like:
Couldn't create dataset. Errors:
0) CsvRowCreator.Producer: Object is missing required member 'imports'
Add a detailed description of every model type and how to construct them by hand.
Detailed description of Aloha's inner workings assuming knowledge of Java/Scala.
All model types which contain other models must recursively call close on their contained models.
Currently, the com.github.github
maven-site-plugin
is slow because each resource is uploaded individually and the uploads are rate-limited. Look for an alternative plugin that uploads all at once.
For instance: mojohaus
Suspect that 'missing data' field may be the culprit.
Add the ability to build H2O Aloha models. This should have all the same support that VW currently has.
Show examples of how Aloha can be used today with both R and sklearn.
aloha-cli bash script doesn't process quoted strings correctly. Quoted strings with spaces are exploded into multiple arguments.
Need to Fix shell script only. Might have to add unit tests though
It was always defaulting to Vfs2 and wasn't respecting the via flag.
VwJniModel serialization only works when serialized and deserialized on the same machine. This appears to be because the serialization only supports the vw model files location and not the contents of the file.
Clean up all the Scala doc warnings and ensure proper doc coverage
Some types of models are not serializable and therefore cannot be used with Spark. The one I see is Exception in thread "main" java.io.NotSerializableException: com.eharmony.aloha.util.rand.HashedCategoricalDistribution.
For this ticket please go through all model types and ensure they are all serializable.
Add the ability to build sklearn Aloha models. This should have all the same support that VW currently has.
I was working with Aloha and receiving the below error every now and then after altering the spec file. A more descriptive error message would be helpful.
" Couldn't create writer for dataset type csv. Error: No applicable producer found. Given Producer"
For documentation purposes
A new release of VW should be coming shortly. This needs to be integrated into Aloha.
In com.eharmony.aloha.models.tree.decision.ModelDecisionTree
This can cause a memory leak because if the submodel is a VW JNI model, then never closing the model leaves memory on the native side undeleted.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.