airbnb / aerosolve Goto Github PK

View Code? Open in Web Editor NEW

4.8K 351.0 568.0 6.72 MB

A machine learning package built for humans.

Home Page: http://airbnb.github.io/aerosolve/

License: Apache License 2.0

Java 48.18% Thrift 0.41% Scala 51.20% Shell 0.20%

aerosolve's Introduction

aerosolve

Machine learning for humans.

What is it?

A machine learning library designed from the ground up to be human friendly. It is different from other machine learning libraries in the following ways:

A thrift based feature representation that enables pairwise ranking loss and single context multiple item representation.
A feature transform language gives the user a lot of control over the features
Human friendly debuggable models
Separate lightweight Java inference code
Scala code for training
Simple image content analysis code suitable for ordering or ranking images

This library is meant to be used with sparse, interpretable features such as those that commonly occur in search (search keywords, filters) or pricing (number of rooms, location, price). It is not as interpretable with problems with very dense non-human interpretable features such as raw pixels or audio samples.

There are a few reasons to focus on interpretability:

Your corpus is new and not fully defined and you want more insight into your corpus
Having interpretable models lets you iterate quickly. Figure out where the model disagrees most and have insight into what kind of new features are needed.
Debugging noisy features. By plotting the feature weights you can discover buggy features or fit them to splines and discover features that are unexpectedly complex (which usually indicates overfitting).
You can discover relationships between different variables and your target prediction. e.g. For the Airbnb demand model, plotting graphs of reviews and 3-star reviews is more interpretable than many nested if then else rules.

How to get started?

The artifacts for aerosolve are hosted on bintray. If you use Maven, SBT or Gradle you can just point to bintray as a repository and automatically fetch the artifacts.

Check out the image impression demo where you can learn how to teach the algorithm to paint in the pointillism style of painting. Image Impressionism Demo.

There is also an income prediction demo based on a popular machine learning benchmark. Income Prediction Demo.

Feature Representation

This section dives into the thrift based feature representation.

Features are grouped into logical groups called families of features. The reason for this is so we can express transformations on an entire feature family at once or interact two different families of features together to create a new feature family.

There are three kinds of features per FeatureVector:

stringFeatures - this is a map of feature family to binary feature strings. For example "GEO" -> { "San Francisco", "CA", "USA" }
floatFeatures - this is a map of feature family to feature name and value. For example "LOC" -> { "Latitude" : 37.75, "Longitude" : -122.43 }
denseFeatures - this is a map of feature family to a dense array of floats. Not really used except for the image content analysis code.

Example Representation

Examples are the basic unit of creating training data and scoring. A single example is composed of:

context - this is a FeatureVector that occurs once in the example. It could be the features representing a search session for example. e.g. "Keyword" -> "Free parking"
example(0..N) - this is a repeated list of FeatureVectors that represent the items being scored. These can correspond to documents in a search session. e.g. "LISTING CITY" -> "San Francisco"

The reasons for having this structure are:

having one context for hundreds of items saves a lot of space during RPCs or even on disk
you can compute the transforms for the context once, then apply the transformed context repeatedly in conjunction with each item
having a list of items allows the use of list based loss functions such as pairwise ranking loss, domination loss etc where we evaluate multiple items at once

Feature Transform language

This section dives into the feature transform language.

Feature transforms are applied with a separate transformer module that is decoupled from the model. This allows the user to break apart transforms or transform data ahead of time of scoring for example. e.g. in an application the items in a corpus may be transformed ahead of time and stored, while the context is not known until runtime. Then at runtime, one can transform the context and combined them with each transformed item to get the final feature vector that is then fed to the models.

Feature transforms allow us to modify FeatureVectors on the fly. This allows engineers to rapidly iterate on feature engineering quickly and in a controlled way.

Here are some examples of feature transforms that are commonly used:

List transform. A meta transform that specifies other transforms to be applied
Cross transform. Operates only on stringFeatures. Allows interactions between two different string feature families. e.g. "Keyword" cross "LISTING CITY" creates the new feature family "Keyword_x_city" -> "Free parking^San Francisco"
Multiscale grid transform Constructs multiple nested grids for 2D coordinates. Useful for modelling geography.

Please see the corresponding unit tests as to what these transforms do, what kind of features they operate on and what kind of config they expect.

Models

This section covers debuggable models.

Although there are several models in the model directory only two are the main debuggable models. The rest are experimental or sub-models that create transforms for the interpretable models.

Linear model. Supports hinge, logistic, epsilon insensitive regression, ranking loss functions. Only operates on stringFeatures. The label for the task is stored in a special feature family and specified by rank_key in the config. See the linear model unit tests on how to set up the models. Note that in conjunction with quantization and crosses you can get incredible amounts of complexity from the "linear" model, so it is not actually your regular linear model but something more complex and can be thought of as a bushy, very wide decision tree with millions of branches.

Spline model. A general additive linear piecewise spline model. The training is done at a higher resolution specified by num_buckets between the min and max of a feature's range. At the end of each iteration we attempt to project the linear piecewise spline into a lower dimensional function such as a polynomial spline with Dirac delta endpoints. If the RMSE of the projection is above threshold, we leave the spline alone in the high resolution piecewise linear mode. This allows us to debug the spline model for features that are buggy or unexpectedly complex (e.g. jumping up and down when we expect some kind of smoothness)

Boosted stumps model - small compact model. Not very interpretable but at small sizes useful for feature selection.
Decision tree model - in memory only. Mostly used to generate transforms for the linear or spline model.
Maxout neural network model. Experimental and mostly used as a comparison baseline.

IDE

If you use intellij, try build first, so that thrift classes is available and to fix the spark compiling error inside intellij, type command+; and click dependency and change related files from test to compile, such as org.apache.spark and org.apache.hadoop:hadoop-common. We keep gradle config as testCompile so that to reduce jar file size.

Support

Hackpad

Dev group

User group

In the wild

Organizations and projects using aerosolve can list themselves here.

aerosolve's People

Contributors

Stargazers

Watchers

Forkers

hectorgon ai-cdrone nivanov85 j122 dreuven eerwitt kattmingming vrieni gragtah germc benlee pkthebud tefla juhomi rtvt123 tchen0123 carlosdp yanweifu codeaudit ctozlm langley lexsf aeppert entylop oana-co damarnez signalsandnoise garydonovan byuksel christoschristofidis josephmartz showeye raleighgee cytusian tixo hbcbh1999 piha510780 bradparks omenta resourcehog alihalabyah drooids lukaszdz liyancs bootinge mikelupu yzxyzh ayush15k1994 nucatus ppcheng faisal-w yffud justomiguel nithindd codevlabs sheavner shalomabitan langelee boluoyu rohegde ashr81 xuyiyu remina wycheng vermaravikant lxzw remotesyssupport merlot1818 rppgithub mrichardsaatbccm mosessky zofuthan yonglehou heianxing passerby4j ericxsun arrmac mindis roomthily zaidos bgxcpku debenson agilemobiledev nkhuyu ostenpl alsayadi abdentertainment kousiknath ml-ai-nlp-ir olel-may 7924102 pichler wtest qicst23 antonykurniawan foocewei geeseek the6thmonth dromescu snazz2001

aerosolve's Issues

combineContextAndItems(Example examples)

While trying to use aerosolve for kaggle airbnb competition, I noticed in com.airbnb.aerosolve.core.transforms.Transformer.java
that only string features are being copied from context to examples.
Is this behavior coded on purpose ?
In general, is the differentiation between context and item features only for memory optimization or does it impact the algorithms ranking task ?
Thanks in advance for any advices to help.

missing FeatureVector.java

missing FeatureVector.java,then project can't build

training compiled with Scala 2.11

I think that it will be great to have the training module compiled with Scala 2.11, because the version 2.10 is relatively old.

Pull historical price data from apartments at a certain location

Hi, I am a potential house renter, I want to sell my house and buy a few apartments in my city, qinhuangdao, China. I am wondering could I pull the historical price data on airbnb in my city and analyze myself, in order to make a final decision? I would very much appreciate your help. Thank you a lot!!!!

Blocking issue with running Aerosolve demos

Hi, I'm a Stanford MS student trying to run the image impressionism and income classification demos. When running gradle shadowjar --info, I get multiple errors of the following type during the execution of the task :core:compileJava:

/Users/ei5h4/Documents/aerosolve/core/build/gen-java/com/airbnb/aerosolve/core/ModelRecord.java:1075: error: method hashCode in class Object cannot be applied to given types;
      hashCode = hashCode * 8191 + org.apache.thrift.TBaseHelper.hashCode(featureWeight);
                                                                ^
  required: no arguments
  found: double
  reason: actual and formal argument lists differ in length

My thrift version is 0.10.0. I tried downloading and installing an older version of thrift (0.9.0) from source since this demo is old and might rely on an older thrift (just a hypothesis). But that turned out to have some roadblocks as well since the older thrift uses some C code namespace tr1 that is no longer supported by C++11 on my OSX El Capitan. So I couldn't verify if thrift is the issue or something else. Basically I thought the hashCode function in the error above might have a changed signature from 0.9.0 to 0.10.0.

I think anyone else attempting to build the demo will run into this issue as well. Really hope to get this running on my machine soon. Aerosolve is super exciting!

TwentyNews demo DebugTransform job fail

I hope that someone could help me.
I have this issue when running sh job_runner.sh DebugTransform

ERROR Runner: Exception on job DebugTransform : com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'delete_string.fields'

maybe this empty class cause this:

`package com.airbnb.aerosolve.core.transforms;

// TODO: remove this once all configs have migrated over to the new transform names
public class DeleteStringFeatureColumnTransform extends DeleteStringFeatureFamilyTransform {}
`

Thanks a lot

Income prediction demo job_runner.sh fails bc it refers to old library file.

"/aerosolve/demo/income_prediction/sh job_runner.sh MakeTraining" returns:

/aerosolve/demo/income_prediction/build/libs/income_prediction-0.1.3-all.jar does not exist, skipping.
java.lang.ClassNotFoundException: com.airbnb.aerosolve.demo.IncomePrediction.JobRunner

income_prediction-0.1.3 is now income_prediction-0.1.6

Error when $ gradle shadowjar --info

:core:compileJava (Thread[Task worker for ':',5,main]) started.

Task :core:compileJava
Putting task artifact state for task ':core:compileJava' into context took 0.0 secs.
Executing task ':core:compileJava' (up-to-date check took 0.006 secs) due to:
Task has failed previously.
All input files are considered out-of-date for incremental task ':core:compileJava'.
Compiling with JDK Java compiler API.
/Users/maximebodereau/Documents/Projects/Ux AI/aerosolve/core/src/main/java/com/airbnb/aerosolve/core/util/Weibull.java:13: error: cannot find symbol
public WeibullBuilder defaultBuilder() {
^
symbol: class WeibullBuilder
location: class Weibull
1 error

:core:compileJava (Thread[Task worker for ':',5,main]) completed. Took 0.262 secs.

FAILURE: Build failed with an exception.

What went wrong:
Execution failed for task ':core:compileJava'.

java.lang.NoSuchFieldError: pid

Incorrect link on airbnb.io

If you go to http://airbnb.io/projects/aerosolve/ and click on the big red GitHub button, you are actually redirected to the airpal repository. I know this issue is not related to aerosolve itself, but I didn't know where to tell you best.

Income Prediction Error sh job_runner.sh TrainModel

I got this error while trying to run sh job_runner.sh TrainModel

I ran it within the demo/income_prediction folder.
Everything was built with Gradle successfully.

sh job_runner.sh MakeTraining and sh job_runner.sh MakeTesting worked and did make outfile files

Running
scala 2.11.8
spark 2.0.1

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.RDD.coalesce$default$3(IZ)Lscala/math/Ordering;
	at com.airbnb.aerosolve.training.AdditiveModelTrainer$.sgdTrain(AdditiveModelTrainer.scala:416)
	at com.airbnb.aerosolve.training.AdditiveModelTrainer$.train(AdditiveModelTrainer.scala:262)
	at com.airbnb.aerosolve.training.AdditiveModelTrainer$.trainAndSaveToFileEarlySample(AdditiveModelTrainer.scala:786)
	at com.airbnb.aerosolve.training.AdditiveModelTrainer$.trainAndSaveToFile(AdditiveModelTrainer.scala:768)
	at com.airbnb.aerosolve.training.TrainingUtils$.trainAndSaveToFile(TrainingUtils.scala:192)
	at com.airbnb.aerosolve.demo.IncomePrediction.IncomePredictionPipeline$.trainModel(IncomePredictionPipeline.scala:93)
	at com.airbnb.aerosolve.demo.IncomePrediction.JobRunner$$anonfun$main$1.apply(JobRunner.scala:41)
	at com.airbnb.aerosolve.demo.IncomePrediction.JobRunner$$anonfun$main$1.apply(JobRunner.scala:32)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at com.airbnb.aerosolve.demo.IncomePrediction.JobRunner$.main(JobRunner.scala:32)
	at com.airbnb.aerosolve.demo.IncomePrediction.JobRunner.main(JobRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala

Could you provide Shiny Model Debug code?

Please

Can't found file： com.airbnb.aerosolve.core.FeatureVector

Hello:
In file： FeatureVectorGen line3，4 can‘t found example and featureVector, is this an error?
Thanks.

import com.airbnb.aerosolve.core.Example;
import com.airbnb.aerosolve.core.FeatureVector;

Your project airbnb aerosolve is using buggy third-party libraries [WARNING]

Hi, there!

We are a research team working on third-party library analysis. We have found that some widely-used third-party libraries in your project have major/critical bugs, which will degrade the quality of your project. We highly recommend you to update those libraries to new versions.

We have attached the buggy third-party libraries and corresponding jira issue links below for you to have more detailed information. We have analyzed the api call related to the following libraries and found one library that is using the API call that might invoke buggy methods in the library of the history.

commons-codec commons-codec
version: 1.4
API call in your project:org.apache.commons.codec.binary.Base64.setInitialBuffer(byte[],int,int)

Jira issues:
Base64InputStream#read(byte[]) incorrectly returns 0 at end of any stream which is multiple of 3 bytes long
version:1.4
ArrayIndexOutOfBoundsException when doing multiple reads() on encoding Base64InputStream
version:1.4
Base64 encoding issue for larger avi files
version:1.4
org.apache.commons.codec.net.URLCodec.ESCAPE_CHAR isn't final but should be
version:1.2;1.3;1.4
org.apache.commons.codec.language.RefinedSoundex.US_ENGLISH_MAPPING should be package protected MALICIOUS_CODE
version:1.4
org.apache.commons.codec.language.Soundex.US_ENGLISH_MAPPING should be package protected MALICIOUS_CODE
version:1.4
Caverphone encodes names starting and ending with "mb" incorrectly.
version:1.4
All links to fixed bugs in the "Changes Report" http://commons.apache.org/codec/changes-report.html point nowhere; e.g. http://issues.apache.org/jira/browse/34157. Looks as if all JIRA tickets were renumbered.
version:1.1;1.2;1.3;1.4
Regression: Base64.encode(chunk=true) has bug when input length is multiple of 76
version:1.4
DigestUtils: MD5 checksum is not calculated correctly on linux64-platforms
version:1.3;1.4
new Base64().encode() appends a CRLF; and chunks results into 76 character lines
version:1.4
Base64 encode() method is no longer thread-safe; breaking clients using it as a shared BinaryEncoder
version:1.4
Base64 default constructor behaviour changed to enable chunking in 1.4
version:1.4
Base64InputStream causes NullPointerException on some input
version:1.4
Base64.encodeBase64String() shouldn't chunk
version:1.4
2. org.apache.commons commons-lang3
version: 3.4
Jira issues:
TypeUtils.ParameterizedType#equals doesn't work with wildcard types
version:3.3.2;3.4
DateUtilsTest.testLang530 fails for some timezones
version:3.4
StringUtils.stripAccents from "Ł" and "ł"
version:3.4
No release notes for version 3.4
version:3.4
JsonToStringStyle doesn't handle chars and objects correctly
version:3.4
ReflectionToStringBuilder doesn't throw IllegalArgumentException when the constructor's object param is null
version:3.4
StrLookup.systemPropertiesLookup() no longer reacts on changes on system properties
version:3.4
StringUtils#capitalize: Javadoc says toTitleCase; code uses toUpperCase
version:3.4
Multiple calls of org.apache.commons.lang3.concurrent.LazyInitializer.initialize() are possible
version:3.4;3.5
EnumUtils *BitVector issue with more than 32 values Enum
version:3.4
StringUtils#equals fails with Index OOBE on non-Strings with identical leading prefix
version:3.4
There are no tests for CharSequenceUtils.regionMatches
version:3.4
ArrayUtils.removeAll(Object array; int... indices) should do the clone; not its callers
version:3.4
TypeUtils.isAssignable throws NullPointerException when fromType has type variables and toType generic superclass specifies type variable
version:3.4
FastDateFormat does not support the week-year component (uppercase 'Y')
version:3.4
ordinalIndexOf("abc"; "ab"; 1) gives incorrect answer of -1 (correct answer should be 0)
version:3.4
Fix implementation of StringUtils.getJaroWinklerDistance()
version:3.4
parseDateStrictly does't pass specified locale
version:3.4
ClassUtils.getClass(ClassLoader; String) fails for "void"
version:3.4
NumberUtils.isNumber bug
version:3.4
FastDateFormat doesn't respect summer daylight in localized strings
version:3.4
StringUtils#normalizeSpace does not trim the string anymore
version:3.4
DiffBuilder: Add null check on fieldName when appending Object or Object[]
version:3.4
FastDatePrinter Memory allocation regression
version:3.4
SerializationUtils.ClassLoaderAwareObjectInputStream should use static initializer to initialize primitiveTypes map.
version:3.2;3.3;3.4
NumberUtils.isNumber and NumberUtils.createNumber resolve inconsistently
version:3.4
ArrayUtils.contains returns false for instances of subtypes
version:3.4
CompareToBuilder.append(Object;Object;Comparator) method is too big to be inlined
version:3.4
StrBuilder#replaceAll ArrayIndexOutOfBoundsException
version:3.2.1;3.4;3.5
NumberUtils#createNumber() returns positive BigDecimal when negative Float is expected
version:3.x

Sincerely~
FDU Software Engineering Lab
Marth 14th,2019

Outdated jar version in image_impressionism/README.md

Version in the following command should be updated:

aerosolve/demo/image_impressionism/README.md
91: spark-shell --master local --jars build/libs/image_impressionism-0.1.7-all.jar

what does "node_query" in search_template.conf mean?

Hi, I'm confused about this parameter about 'node_query'. Would you please provide example data or sql? Thanks a lot

README.md for income_prediction refers to old lib 0.1.3

Incorrect jar name

In demo/income_prediction/job_runner.sh, the version number for the income prediction jar file is old. As a result, there is a ClassNotFoundException when trying to find the JobRunner class.

Existing:
build/libs/income_prediction-0.1.3-all.jar \

Should be:
build/libs/income_prediction-0.1.6-all.jar \

Please apply this change and commit to the git repo.

Thanks!

What does the parameter "lossOnly" mean in AdditiveModelTrainer?

Linear model "Only operates on string features"?

The README says that the Linear model "only operates on string features", but the demo using the linear model (image_impressionism), appears to use both string features and float features. Is this sentence in the Readme incorrect or am I missing something?