Git Product home page Git Product logo

py-earth's Introduction

py-earth Build Status

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines algorithm, in the style of scikit-learn. The py-earth package implements Multivariate Adaptive Regression Splines using Cython and provides an interface that is compatible with scikit-learn's Estimator, Predictor, Transformer, and Model interfaces. For more information about Multivariate Adaptive Regression Splines, see the references below.

Now With Missing Data Support!

The py-earth package now supports missingness in its predictors. Just set allow_missing=True when constructing an Earth object.

Requesting Feedback

If there are other features or improvements you'd like to see in py-earth, please send me an email or open or comment on an issue. In particular, please let me know if any of the following are important to you:

  1. Improved speed
  2. Exporting models to additional formats
  3. Support for shared memory multiprocessing during fitting
  4. Support for cyclic predictors (such as time of day)
  5. Better support for categorical predictors
  6. Better support for large data sets
  7. Iterative reweighting during fitting

Installation

Make sure you have numpy and scikit-learn installed. Then do the following:

git clone git://github.com/scikit-learn-contrib/py-earth.git
cd py-earth
sudo python setup.py install

Usage

import numpy
from pyearth import Earth
from matplotlib import pyplot
    
#Create some fake data
numpy.random.seed(0)
m = 1000
n = 10
X = 80*numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1*numpy.random.normal(size=m)
    
#Fit an Earth model
model = Earth()
model.fit(X,y)
    
#Print the model
print(model.trace())
print(model.summary())
    
#Plot the model
y_hat = model.predict(X)
pyplot.figure()
pyplot.plot(X[:,6],y,'r.')
pyplot.plot(X[:,6],y_hat,'b.')
pyplot.xlabel('x_6')
pyplot.ylabel('y')
pyplot.title('Simple Earth Example')
pyplot.show()

Other Implementations

I am aware of the following implementations of Multivariate Adaptive Regression Splines:

  1. The R package earth (coded in C by Stephen Millborrow): http://cran.r-project.org/web/packages/earth/index.html
  2. The R package mda (coded in Fortran by Trevor Hastie and Robert Tibshirani): http://cran.r-project.org/web/packages/mda/index.html
  3. The Orange data mining library for Python (uses the C code from 1): http://orange.biolab.si/
  4. The xtal package (uses Fortran code written in 1991 by Jerome Friedman): http://www.ece.umn.edu/users/cherkass/ee4389/xtalpackage.html
  5. MARSplines by StatSoft: http://www.statsoft.com/textbook/multivariate-adaptive-regression-splines/
  6. MARS by Salford Systems (also uses Friedman's code): http://www.salford-systems.com/products/mars
  7. ARESLab (written in Matlab by Gints Jekabsons): http://www.cs.rtu.lv/jekabsons/regression.html

The R package earth was most useful to me in understanding the algorithm, particularly because of Stephen Milborrow's thorough and easy to read vignette (http://www.milbo.org/doc/earth-notes.pdf).

References

  1. Friedman, J. (1991). Multivariate adaptive regression splines. The annals of statistics, 19(1), 1โ€“67. http://www.jstor.org/stable/10.2307/2241837
  2. Stephen Milborrow. Derived from mda:mars by Trevor Hastie and Rob Tibshirani. (2012). earth: Multivariate Adaptive Regression Spline Models. R package version 3.2-3. http://CRAN.R-project.org/package=earth
  3. Friedman, J. (1993). Fast MARS. Stanford University Department of Statistics, Technical Report No 110. https://statistics.stanford.edu/sites/default/files/LCS%20110.pdf
  4. Friedman, J. (1991). Estimating functions of mixed ordinal and categorical variables using adaptive splines. Stanford University Department of Statistics, Technical Report No 108. http://media.salford-systems.com/library/MARS_V2_JHF_LCS-108.pdf
  5. Stewart, G.W. Matrix Algorithms, Volume 1: Basic Decompositions. (1998). Society for Industrial and Applied Mathematics.
  6. Bjorck, A. Numerical Methods for Least Squares Problems. (1996). Society for Industrial and Applied Mathematics.
  7. Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning (2nd Edition). (2009).
    Springer Series in Statistics
  8. Golub, G., & Van Loan, C. Matrix Computations (3rd Edition). (1996). Johns Hopkins University Press.

References 7, 2, 1, 3, and 4 contain discussions likely to be useful to users of py-earth. References 1, 2, 6, 5, 8, 3, and 4 were useful during the implementation process.

py-earth's People

Contributors

1pakch avatar aleon1138 avatar colcarroll avatar jcrudy avatar mattlewissf avatar mehdidc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

py-earth's Issues

installation on windows

Hi,

I find it hard to install this package in windows.
I have anaconda ,numpy installed and cygwin installed.
the installations stalls with "c:....\gcc.exe failed with exit status 1"

thanks
Madhu

Optimize reorderxby with BLAS for speed

The _utilreorderxby function is a bottleneck for large sample sizes. Using BLAS for the row copies and swaps might speed it up. Or, is there a faster algorithm?

Pass order vectors around instead of reordering data

Currently, the entire data set is reordered for each variable at each iteration. This process is currently the biggest performance bottleneck by far. It would be better if all variable orders were determined upfront and stored, then passed into valid_knots and best_knot. This requires modification to those two methods. It will probably make both methods slower, but hopefully not as slow as reordering the data all the time.

Add PMML support

The ability to read and write models in PMML would do a lot to improve portability.

Pyearth stuck in fitting

Dear Jason,

I just tried a couple of fitting examples on which to apply pyearth.

This is the code:
import numpy
from pyearth import Earth
from matplotlib import pyplot

Create some fake data

numpy.random.seed(seed = 0)

N = 100
X = numpy.linspace(-10, 10, N)
numpy.random.shuffle(X)
y = numpy.zeros(X.shape)
sigma = 1
for i in range(y.size):
r = numpy.random.normal(1, sigma)
y[i] = numpy.sin(X[i]) + r

Fit an Earth model

model = Earth()
print 'fitting'
model.fit(X,y)
print 'done'

print "\n*** SUMMARY ***"

print model.summary()

Plot the model

y_hat = model.predict(X)
pyplot.figure()
pyplot.plot(X,y,'r.')
pyplot.plot(X,y_hat,'b.')
pyplot.show()

It generates a noisy sin function and tries to fit it. There an issue I'd like to point out. If the number of samples is not large enough, e.g. N = 100, the algorithm may crash, and raise a segmentation fault. Its success depends on the data. If I do not set the random seed, then sometimes it crashes and sometimes it does not.

I hope this information may help you!

Best,

Marco

py-earth fails for multi-column regression

py-Earth fails when fitting a multi column [m x n] label matrix with the two-dimensional training data. This can be considered as an enhancement

model.fit(X_train, y_train)
File "/Library/Python/2.7/site-packages/pyearth/earth.py", line 331, in fit
X, y, sample_weight = self._scrub(X, y, sample_weight)
File "/Library/Python/2.7/site-packages/pyearth/earth.py", line 262, in _scrub
y = y.reshape(y.shape[0])
ValueError: total size of new array must be unchanged

The coefficients aren't coming out right

y <- abs(x6 - 4) + E

Gives this result:

Forward Pass

iter parent var knot mse terms gcv rsq grsq

0 - - - 150.123858 1 150.425 0.000 0.000
1 0 6 431 0.911382 3 0.928 0.994 0.994
2 0 9 981 0.903521 5 0.935 0.994 0.994
Stopping Condition: 2

Pruning Pass

iter bf terms mse gcv rsq grsq

0 - 5 0.90 0.935 0.994 0.994
1 4 4 0.91 0.933 0.994 0.994
2 3 3 0.91 0.928 0.994 0.994
3 2 2 140.46 141.875 0.064 0.057
4 1 1 150.12 150.425 0.000 0.000
Selected iteration: 2

Earth Model

(Intercept) -0.022567565678
h(x6-4.0982) -0.0736813881288
h(4.0982-x6) -0.0328571125806
h(x9+38.6904) -0.0066944613208

Support non-diagonal weight matrices

This can be accomplished by premultiplying basis functions by the Cholesky square root of the weight matrix (a generalization of how diagonal weight matrices are currently handled). See equation (7) in [1].

[1] Green, P. J. (2011). Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robust and Resistant Alternatives. Journal of The Royal Statistical Society, Series B (Methodological), 46(2), 149โ€“192.

Examples failing, or Not Compatible with Windows

Both of the usage examples result in the same error for me..

import numpy
import cProfile
from pyearth import Earth
from matplotlib import pyplot

numpy.random.seed(2)
m = 1000
n = 10
X = 80_numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1_numpy.random.normal(size=m)
model = Earth(max_degree = 1)
model.fit(X,y)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyearth\earth.py", line 312, in fit
self.forward_pass(X, y)
File "C:\Python27\lib\site-packages\pyearth\earth.py", line 383, in forward_pass
forward_passer = ForwardPasser(X, y, **args)
File "_forward.pyx", line 67, in pyearth._forward.ForwardPasser.init (pyearth/_forward.c:3146)
File "_forward.pyx", line 96, in pyearth._forward.ForwardPasser.init_linear_variables (pyearth/_forward.c:3698)
ValueError: Buffer dtype mismatch, expected 'INT_t' but got 'long long'

Add user specified linear terms

Currently the ForwardPasser automatically decides whether a variable should enter linearly. There should also be a way for users to specify that a variable should enter linearly.

Traceback in import pyearth when using sklearn.__version__ = "0.16-git"

I saw this issue on Ubuntu 12.04 (LTS Kernel) with a source install of sklearn (version 0.16-git). The setup process completes correctly, but I see the following traceback when importing pyearth:

>>> import pyearth
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyearth/__init__.py", line 8, in <module>
    from .earth import Earth
  File "/usr/local/lib/python2.7/dist-packages/pyearth/earth.py", line 5, in <module>
    from sklearn.utils.validation import assert_all_finite, safe_asarray
ImportError: cannot import name safe_asarray

It looks like sklearn.utils.validation.safe_asarray is an internal tool to sklearn, and as mentioned on their website, is not guaranteed to be stable between releases.

Earth.fit can't handle non-continuous X columns

It produces nans. I'm guessing this has to do with adjacent knot candidates being exactly equal, which generally only happens with non-continuous distributions, but might also happen with zero-padded data.

Change the way xlabels are specified

The xlabels argument is currently treated as a hyperparameter. It should probably be treated differently. Issues:

  1. In the Pandas case, the labels come with the data and should therefore be set in fit.
  2. In the Pandas case, column order may be less relevant than column name.

At the very least, I should make it easy to set xlabels in the fit method.

Implement upfront data validation/conversion

Currently there is no checking on input data. If you put in data in the wrong format, the first time you know about it might be when you get a segfault or, worse, an incorrect model.

pip fails to install at same time of numpy

If neither numpy nor pyearth are installed, and we install both of them by:

$ pip install -r requirements.txt

where requirements.txt:

numpy
scipy
-e git+https://github.com/jcrudy/py-earth.git#egg=py-earth

Throws ImportError: no module named numpy when installing pyearth. Refer to this issue and this issue to see how pip works in this case.

Build fails because of missing file

I might be missing something but setup.py includes the examples/vFunctionExample.py which is not in the repository.

'scripts':['examples/vFunctionExample.py'],

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.