partofthething / ace Goto Github PK

Python package for performing the Alternating Conditional Expectation (ACE) regression

License: MIT License

Python 100.00%

ace's Introduction

The ace Package

ace is an implementation of the Alternating Conditional Expectation (ACE) algorithm [Breiman85], which can be used to find otherwise difficult-to-find relationships between predictors and responses and as a multivariate regression tool.

The code for this project, as well as the issue tracker, etc. is hosted on GitHub. The documentation is hosted at http://partofthething.com/ace.

What is it?

ACE can be used for a variety of purposes. With it, you can:

build easy-to-evaluate surrogate models of data. For example, if you are optimizing input parameters to a complex and long-running simulation, you can feed the results of a parameter sweep into ACE to get a model that will instantly give you predictions of results of any combination of input within the parameter range.

expose interesting and meaningful relations between predictors and responses from complicated data sets. For instance, if you have survey results from 1000 people and you and you want to see how one answer is related to a bunch of others, ACE will help you.

The fascinating thing about ACE is that it is a non-parametric multivariate regression tool. This means that it doesn't make any assumptions about the functional form of the data. You may be used to fitting polynomials or lines to data. Well, ACE doesn't do that. It uses an iteration with a variable-span scatterplot smoother (implementing local least squares estimates) to figure out the structure of your data. As you'll see, that turns out to be a powerful difference.

Installing it

ace is available in the Python Package Index, and can be installed simply with the following.

On Linux:

sudo pip install ace

On Windows, use:

pip install ace

Directly from source:

git clone [email protected]:partofthething/ace.git
cd ace
python setup.py install

Note

If you don't have git, you can just download the source directly from here.

You can verify that the installation completed successfully by running the automated test suite in the install directory:

python -m unittest discover -bv

Using it

To use, get some sample data:

from ace.samples import wang04
x, y = wang04.build_sample_ace_problem_wang04(N=200)

and run:

from ace import model
myace = model.Model()
myace.build_model_from_xy(x, y)
myace.eval([0.1, 0.2, 0.5, 0.3, 0.5])

For some plotting (matplotlib required), try:

from ace import ace
ace.plot_transforms(myace.ace, fname = 'mytransforms.pdf')
myace.ace.write_transforms_to_file(fname = 'mytransforms.txt')

Note that you could alternatively have loaded your data from a whitespace delimited text file:

myace.build_model_from_txt(fname = 'myinput.txt')

Warning

The more data points ACE is given as input, the better the results will be. Be careful with less than 50 data points or so.

Demo

A combination of various functions with noise is shown below:

Given just those points and zero knowledge of the underlying functions, ACE comes back with this:

A longer version of this demo is available in the Sample ACE Problems section.

Other details

This implementation of ACE isn't as fast as the original FORTRAN version, but it can still crunch through a problem with 5 independent variables having 1000 observations each in on the order of 15 seconds. Not bad.

ace also contains a pure-Python implementation of Friedman's SuperSmoother [Friedman82], the variable-span smoother mentioned above. This can be useful on its own for smoothing scatterplot data.

History

The ACE algorithm was published in 1985 by Breiman and Friedman [Breiman85], and the original FORTRAN source code is available from Friedman's webpage.

Motivation

Before this package, the ACE algorithm has only been available in Python by using the rpy2 module to load in the acepack package of the R statistical language. This package is a pure-Python re-write of the ACE algorithm based on the original publication, using modern software practices. This package is slower than the original FORTRAN code, but it is easier to understand. This package should be suitable for medium-weight data and as a learning tool.

For the record, it is also quite easy to run the original FORTRAN code in Python using f2py.

About the Author

This package was originated by Nick Touran, a nuclear engineer specializing in reactor physics. He was exposed to ACE by his thesis advisor, Professor John Lee, and used it in his Ph.D. dissertation to evaluate objective functions in a multidisciplinary design optimization study of nuclear reactor cores [Touran12].

License

This package is released under the MIT License, reproduced here.

References

Breiman85: L. BREIMAN and J. H. FRIEDMAN, "Estimating optimal transformations for multiple regression and correlation," Journal of the American Statistical Association, 80, 580 (1985). [Link1]
Friedman82: J. H. FRIEDMAN and W. STUETZLE, "Smoothing of scatterplots," ORION-003, Stanford University, (1982). [Link2]
Touran12: N. TOURAN, "A Modal Expansion Equilibrium Cycle Perturbation Method for Optimizing High Burnup Fast Reactors," Ph.D. dissertation, Univ. of Michigan, (2012). [The Thesis]
Wang04: D. WANG and M. MURPHY, "Estimating optimal transformations for multiple regression using the ACE algorithm," Journal of Data Science, 2, 329 (2004). [Link3]

ace's People

Contributors

Stargazers

Watchers

ace's Issues

wang04 random seed to big

from ace import ace
import numpy as np
np.__version__

'1.10.2'

from ace.samples import wang04
x, y = wang04.build_sample_ace_problem_wang04(N=200)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-448739fc9080> in <module>()
----> 1 from ace.samples import wang04
      2 x, y = wang04.build_sample_ace_problem_wang04(N=200)

/Users/paulperry/anaconda/lib/python2.7/site-packages/ace/samples/wang04.py in <module>()
      7 from ace import ace
      8 
----> 9 numpy.random.seed(9287349087)
     10 
     11 def build_sample_ace_problem_wang04(N=100):

mtrand.pyx in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7781)()

ValueError: Seed must be between 0 and 4294967295

The validation problems have invalid imports

The "validation" problems have invalid imports.

Specifically, line 11 of smoother_friedman82.py should be:

from .. import smoother

This is consisstent with both standards for Python 2.7 and Python 3.5

'Model' object has no attribute 'x'

This is a problem with the README.rst documentation.

import matplotlib.pyplot as plt
%matplotlib inline
from ace import ace
ace.plot_transforms(myace, fname = 'mytransforms.pdf')

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-3ea400bca898> in <module>()
      2 get_ipython().magic(u'matplotlib inline')
      3 from ace import ace
----> 4 ace.plot_transforms(myace, fname = 'mytransforms.pdf')

/Users/paulperry/anaconda/lib/python2.7/site-packages/ace/ace.pyc in plot_transforms(ace_model, fname)
    241     plt.rcParams.update({'font.size': 8})
    242     plt.figure()
--> 243     numCols = len(ace_model.x) / 2 + 1
    244     for i in range(len(ace_model.x)):
    245         plt.subplot(numCols, 2, i + 1)

AttributeError: 'Model' object has no attribute 'x'

Changing it to the following works:
ace.plot_transforms(myace.ace, fname = 'mytransforms.pdf')

validate_smoothers.py relies on "mace" a package that doesn't appear to exist

I'm not sure if this is an error on my part or from something else, but I cannot find the package "mace" on PyPI.

This script doesn't appear to run ever though, so I'm not sure it is too important.

random seed is too large

The random seed used in many of the files is too large when using windows 64-bit.

    numpy.random.seed(9287349087)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy\random\mtrand\mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295

Deriving the transformation equations/forms

I am not sure if this is an issue, or a request for improvement/documentation.

In your ACE implementation, is there a way to expose directly the equations/forms for the transformations? If not, what would you recommend, running OLS in pairs (e.g. phi0-x0, phi1-x1, and so on)?

divide by zero encountered in double scalar

Hi there.
I am able to run the Wang example without problems.
However, when I try to adapt to use with my own data I get an error message, attached below:

Do you know what might be causing this? I am including my data to allow you to try and replicate the issue. I realize it is not a lot of data points but I suspect this has not to do with the issue.

data_final.txt

Use ACE in a predictive sense?

Follow up on a closed issue I have more questions.
I went back and reviewed the literature papers; if I understand those examples correctly, I think that one could use ACE in a predictive sense in a couple of ways:

use the magnitude of the transforms as a measure of the strength of the relationship between the original independent predictors and the target, even though as you say, there are no functional forms for the transforms
predict the target given new measurements of the predictors, using the inverse relationship between theta and Y; e.g., using the example from Wang and Murphy:

or in pictorial way using my example from here:

Originally posted by @mycarta in #11 (comment)

ACE giving wiggly results for Wang's test problem

The excellent test problem in [Wang, "Estimating Optimal Transformations for Multiple Regression Using the ACE Algorithm"] is acting up a little. On one hand, the basic shapes of the various components are being recovered. On the other hand, they are very noisy and not nearly as good as in the paper. This suggests something is slightly wrong in ACE still.

ACE algorithm is too slow

The initial implementation of ACE that's active does not make use of the update capability of the fixed-span smoother. This makes ACE really slow, as confirmed by this profiling result (made with gprof2dot):

Next step is to just modify the fixed-span smoother to update intelligently. Should be easy.

ACE sometimes gives negative Maximal Correlation values

Maximal Correlation (MC) values should always be between 0 and 1. However, when I calculate the MC values of x1 and x2 with y for values of
x1 = [ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]
x2 = [ 2., 5., 9., 7., 4., 8., 1., 6., 3., 10.]
y = [ 3., 9., 11., 8., 4., 15., 14., 20., 30., 32.]
I get a negative MC between x2 and y.

Running the same problem using the R library acepack yields an MC value within the proper range.

Python calculation:

def ACE(x, y):
    ''' 
    Output MCs: Maximal Correlations (MCs) for each variable x 
    Input x: list of 1D numpy arrays, one for each input variable
    Input y: 1D numpy array of responses
    '''
    ace_solver = ace.ACESolver()
    ace_solver.specify_data_set(x, y)
    ace_solver.solve()
    MCs = [] # mutual correlations
    for i in range(len(x)):
        (MC, Pval) = stats.pearsonr( ace_solver.x_transforms[i], ace_solver.y_transform )
        MCs.append( MC )
    return(MCs)

from ace import ace
from scipy import stats
import numpy as np
x = [np.array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]), np.array([ 2.,  5.,  9.,  7.,  4.,  8.,  1.,  6.,  3., 10.])]
y = np.array([ 3.,  9., 11.,  8.,  4., 15., 14., 20., 30., 32.])
MCs = ACE(x, y)
print('MCs = ', MCs)

yields
MCs = [0.9523, -0.0577]

Meaning the Maximal Correlation value between x2 and y is -0.058.

R acepack calculation:

library(acepack)
x1 = 1:10
x2 = c(2.,  5.,  9.,  7.,  4.,  8.,  1.,  6.,  3., 10.)
x <- cbind(x1, x2)
y = c( 3.,  9., 11.,  8.,  4., 15., 14., 20., 30., 32.)
ace_model = ace(x, y)
MC = cor(ace_model$tx, ace_model$ty)

yields MC values of

x1 0.9427068
x2 0.3442552

Giving a positive Maximal Correlation value between x2 and y of 0.344

ACE not giving proper numbers

The ACE algorithm is running and converging, but it doesn't converge to the right answer. What is the issue?

There's a good potential that I'm not applying the expected conditionals correctly. Right now I'm just using the supersmoother S(y|x) as a direct replacement for E(y|x). The FORTRAN code is too unreadable to see how it's supposed to be done.

Plotting in the ace.py:227 accepts only int while the output len(myace.ace.x)/2 is float

I ran the sample wang example and ran to the issue because it does a
num_cols = len(ace_model.x) / 2 + 1 equal to 3.5.
that cause the value of num_cols be 3.5. I removed one of arrays in the x. Now I have 4 x samples. But again plotting threw the error because it is considering it as float .

ValueError: Number of rows must be a positive integer, not 3.0

PS.: "Successfully installed ace-0.3.2"