Git Product home page Git Product logo

spark-glmnet's Introduction

spark-glmnet

glmnet - “Lasso and Elastic-Net Regularized Generalized Linear Models"

A Scala implementation of the "Lasso and Elastic-Net Regularized Generalized Linear Models" for Spark MLlib from "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Jerome Friedman, Trevor Hastie and Rob Tibshirani of Stanford University (http://web.stanford.edu/~hastie/Papers/glmnet.pdf). The algorithm is typically referred to as “glmnet” - generalized linear model with elastic net regularization. Elastic net is the combination of the ridge and lasso regularization methods. This algorithm is generally faster than traditional methods such as linear regression and is particularly well suited for “fat” datasets (many more features than events).

Spark MLlib

This code is fully integrated with Spark MLlib and is being submitted as an addition to MLlib. It performs K-fold cross validation, picks the best (highest accuracy) alpha/lambda combination and returns a model based on these.

Following is the process that glmnet executes:

1. User sets up arrays of values:
  1.1 An array of alpha values.
  1.2 Number of lambda values - default is 100 (glmnet will automatically generate the series of lambda values).
  1.3 Choose number of K-folds for cross validation.
2. On each fold:
   2.1 Using Coordinate Descent generate a model on each combination of alpha and lambda using K-fold training data.
   2.2 Test all models on K-fold test data and save accuracies.
3. Average accuracies across the various folds of results, for each of the alpha/lambda combinations, and choose the one combination with highest accuracy.
4. Train on all of the data using the alpha/lambda combination from step 3 and produce the final (best) model. 

Developers

Mike Bowles
Jake Belew
Ben Burford

Instructions for setting up the project in Eclipse

$ git clone [email protected]:jakebelew/spark-glmnet.git
(Create an Eclipse project)
$ cd spark-glmnet
(Note: if this is your first time running SBT this may take a while.)
$ sbt
> eclipse with-source=true
> exit
(In Eclipse, import project)
(To enable using the glmnet project log4j file, in order to better display linear regression information)
Project -> Properties -> Java Build Path -> Source -> Add Folder -> src/main/resources (select, OK) -> OK

Running the cross-validation example in Eclipse

Run org.apache.spark.examples.ml.LinearRegressionCrossValidatorExample in eclipse.
* It will generate training data and apply the glmnet algorithm.
* It will run the data in K=2 folds, with alpha = 0.2 and 0.3, and 100 lambda values.
* It will choose the “Best fit” combination of alpha and lambda and generate a model on the entire training data set using the chosen alpha and lambda.
* It will generate test data and run the resulting model on the test data and display the accuracy of that model on the test data.

spark-glmnet's People

Contributors

benburford avatar jakebelew avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.