sheffieldml / gpy Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 558.0 63.91 MB

Gaussian processes framework in python

License: BSD 3-Clause "New" or "Revised" License

Python 97.85% C++ 0.06% C 0.08% Shell 0.01% Cython 2.00%

gpy's People

Contributors

Stargazers

Watchers

Forkers

dasabir c0g beckdaniel jaidevd bizso09 nfoti owenthomas sharnapax mathewzilla jamesmcm jlistgarten tianhao960 rokroskar yanshuaicao ichixiao andreas-h amodig ingmarschuster osdf mdepasca andymiller roryjbeard kpysniak martinsch dshah244 luxun1 ylnks qingkaikong tjhgit jsoendermann wouterbulten wenjingk alexgrig iassael mellorjc pablotcarreira xiaojieqiu surban pksrijith gthandavam neuroccino shujin-sun djour robbymeals ptonner piyushpandita92 smason markvalen reese3928 andreas-koukorinis maaskola aihgf strongh yincheng predictivesciencelab gusmaogabriels dapid aporia3517 jameshensman adrinjalali aaronkl aliha jasminlin manderle01 finmod opkoisti doufunao ml-lab adamian ejokeeffe austinrochford lionandjelka milet debasmitdas victorkristof imdrail jluttine neildhir luk0r zhmxu alansaul lucasmaystre lfcampos sachinruk kurtcutajar yihongucla peratham lbollar djsutherland vsaase reinmj onlysang ken-okabe romarcha hkh412 donrivers sszzsupersupersupersuper avloss changyong-oh vishal-kit

gpy's Issues

objects are unpickleable

this is due to model.optimization_runs, which contains instances of GPy.inference.optimization.optimizer.

We need to find a smart way of dealing with that

build fails (on Travis) due to plot_ARD

@nfusi wrote some lovely code to plot the significance of each input. Unfortunately, in Travis, there's no $DISPLAY set, so pylab things don't work.

The usual fix for no display is to set matplotlib in pdf mode, but that would be annoying for users.

Thoughts?

It's a bit of a pain gradchecking individual gradients in models with a lot of parameters (usually an interesting setting in which many models become unstable). We should be able to only gradcheck parameters matching a string.

The interface should be something like m.gradcheck('rbf', verbose=True)

psi (probably psi2) gradients broken in sparse GPs

the RBF sparse GP is not passing the tests now.

Gradients are expensive to compute in sympy kern

computing dk_dX can get expensive when there are many input dimensions. Only ask sympy to do these derivatives when the user requests them.

Implement printing only a subset of the model parameters, ie print(model, parameter_set)

... because e.g for the BGPLVM the mu and S are also parameters and just fill the screen...

nicer interface for ARD kernels

example for RBF

GPy.kern.rbf(D, ARD=True)

right now you have to call

GPy.kern.rbf_ARD(D)

linear kernel does not have dKdiag_dX

Actually, this doesn't raise the expected NotImplementedError. Perhaps the skeleton is missing in kernpart.py also?

prod_orthogonal is inefficient

Prod_orthgonal repeatedly computes the kernel matrices for each part: once in K(), once in dK_dtheta, etc. A caching scheme would make this much faster (and admittedly more complex)

can't print gradcheck output on platforms != linux

the tick symbol ascii code is not recognized

Can all the plot methods return the axis on which the plot has been made? Does that make sense? At least we need some way of accessing it I think.

No continuous integration with github

We had this in the old days of assembla, no?

move cholesky() dtrtri() chol_inv() to the GPU (if needed/available)

should be straightforward to implement using scikits.cuda. We need to see how efficient it actually is (shipping K to the GPU could be expensive).

_log_likelihood_gradients_transformed

Previously extract_gradients reads like its main role was to combine prior and likelihood gradients. In the new naming scheme this is _log_likelihood_gradients_transformed which is really a different thing (getting the transformed gradients rather than the real ones). We need to put some thought into how to deal with this, perhaps two functions (one for combining prior and likelihood and another for doing transformations??).

likelihood_function.probit doesn;t do predictive values very well

The predictive values are not 0,1

Memory blows up when running optimizer

Greetings,

I am trying to use GP_regression on a relatively small dataset (1832 instances, 17 features) but every time I run the optimize function on a model the memory blows up to the point it starts to swap (I am using an Intel i5, 4GB RAM, Ubuntu 11.10 32bit). This happens with all optimizers. The only constraint I am using is the "constrain_positive" on all parameters.

I managed to replicate this issue using this code: https://gist.github.com/beckdaniel/5489270

I tried to track down the point where the memory starts to increase. I believe this is happening on the "_set_params_transformed" method, in class "parameterised", which is called by both the optimizer objective function and its derivative. If I comment both calls to "_set_params_transformed" on "objective_function" and "objective_function_gradients" methods on the "model" class, the memory stops to increase.

I will continue to investigate this but I believe I should open this issue so maybe you could give more insight on why this is happening.

in linalg, we should make use of scipy's C/F ordered choices

scipy provide get_lapack_funcs (and get_blas_funcs). We can use these to automagically use the correct (f|c)lapack routine: dporti, dpotrf, etc. At the moment, it's all a bit voodoo.

A quick %timeit makes me think we can gain some performance too.

PCA initialization in GPLVM.py affects model.likelihood.Y

Initialization of GPLVM via PCA currently affects the actual data matrix (both model.likelihood.Y and model.likelihood.data). I believe this is because the mean is being subtracted off the matrix which is passed (if it isn't zero mean in the base case), but this is pass by reference?? I think Python can be pretty sneaky in generating these types of bugs. I'm guessing that's the problem but since I'm not 100% I haven't edited.

Extend mdot for diagonal matrices

Would be cool if mdot could handle diagonal matrices well i.e. doing:

diagonal(diag_A)[:, None]*B

instead of

np.dot(A,B)

not sure if its worth the overhead of checking if the matrix is overhead if it is not used frequently though (it is in my code!).

move to PHP

I think GPy is now stable enough to consider a move to PHP as the main language in GPy. Yes, it will be nearly impossible to move the entire project to the same language. Parts of model.py will need to be in Objective-C, and probably some parts of the inference package will have to be written in perl 3. But, yeah, pretty much everything else can be done with PHP.
Writing models in a text editor and running them using ipython is a bit of a pain, so I would suggest we move to a web-based enterprise-class form with clickable elements. All the plots of the posterior distribution can then be generated server-side and sent via email to the user who requested them.

I think these changes will significantly reduce the time to market of our modelling work, and will help us to evolve intuitive platforms that drive compelling convergence.

Unit Tests for Kernels

When someone completes a new kernel we need a set of unit tests that ensures all parts of it are functional (gradients, psi2 statistics [if implemented] etc.). The psi2 statistics could be checked approximately by sampling, then we could do gradient checks (see the old matlab code for the tests I did there).

BGPLVM clang++

BGPLVM clang++ inline code not working/compiling on Mac OS.

Report:

Installed newest versions of clang, gcc and scipy.
Reinstalled GPy (rm find GPy/GPy -name *.pyc -type f`` and reimport GPy)

nothing helped so far, is that a known bug for weave?

structured kernels can;t do psi-statistics

psi1 etc don;t accet X_slices. Make it so.

How to include examples?

Should we include the examples in the module? Then we can do
import GPy
GPy.examples.foo()

This isn't done currently.

Gaussian.py, scaling and offset variables

It looks to me like these are currently called _mean and _std, which is misleading, because they aren't necessarily the _mean and the _std. Can we rename these _scale and _offset?

latent plots for GPLVM

GLPVM curently raises notimplementedError when we call plot_latent.

nice features please:

passing some labels to plot different classes with differnet markers.
shade the backround to represent uncertainty in the projected output

no online documnetation

Suggest sphix and gh-pages.

Here's a nice guide, but I'm not too hot on the "obscure git commands"
http://datadesk.latimes.com/posts/2012/01/sphinx-on-github/

ImportError: No module named SGD

When I use GPy-0.2 on my laptop I can't optimize a GP regression model (ImportError: No module named SGD ; arising from get_optimizer line 216). I have checked and it turns out that the .pyc file is not generated. I have tried to generate it manually but it does not help.

Is that a problem of my machine or an issue in GPy?

Thanks for your tips...

GP model predict

The predict function returns variances that have dimension model.D even if (due to covariance definition) the output variances are identical in all output dimensions. This will lead to memory issues when GPLVM variances are required for models fitted to very high dimensional outputs. Can we return (as the MATLAB code does) a num_data*1 vector of variances for this case?

the model str reports the wrong value for the objective function

it only returns the MLL, doesn't include the contribution from the prior

Parameters are not constrained by default

Suggest using a set of default constraints.

Perhaps we can use the constraints on the sympy variables?

in latent variable models the order of the param names and the order of the params are different

This affects GPLVM, sparse GPs and BGPLVM

ensure_default_constraints() should constrain S in the BGPLVM

we can probably override it in the BGPLVM class

"Cross terms" for psi2 statistics

Adding together kernel functions (kernparts) brings up some extra interaction terms when computing the psi2 matrix. We're not computing these right now.

Note that they're not needed for a single covariance function combined with white noise, just when you're combining say rbf with linear.

tied and fixed params at kernel level

The computation of gradients does not work when the kernel's parameters are tied or fixed.

import numpy as np
import GPy

K = GPy.kern.rbf(5, ARD=True)
K.tie_param('[01]')
K.constrain_fixed('2')

X = np.random.rand(5,5)
Y = np.ones((5,1))

m = GPy.models.GP_regression(X,Y,K)
m.checkgrad()

testing module missing from setup.py

Teo had a problem that when he was trying to import GPy it couldn't find the testing module, it appears this is missing from setup.py in the list of modules. I thought it would discover anything with an init.py as a module but apperently not. I believe this would break a new install on a new machine so should probably be fixed in master?

Alan

in unit tests, white is added by default

We used to require a white kernel for (noisy) regression. We can remove this now that we have likelihoods.

warped GP should be implemented as a likelihood.

checkgrad output is ugly

Integrate the checkgrad output with the model printing.

It's not necessary to print the words 'ratio', 'numerical' etc on each row.

sympykern fails randomly

I have an idea for fixing this:

Weave accepts a bunch of arguments. One of them is the code you'd like to run, the other is "support code", where you can define functions and stuff.

In our sympykern, the covariance function and its gradients are passed as support code.

Weave first hashes the code to see if it's already compiled. it it's only hashing the "code" and not the "support code", there's our bug.

To fix, define the covariance functions in the "code", by concatenating the code and support code.

Oh look Alan is at the top of the assignees list :)

James.

GPy sghould fail gracefully is sympy is not available

The dependency on sympy should be optional, imho.

Need to Discuss Examples Provision

We need to discuss how we provide examples. Importantly, I think they shouldn't be just a brain dump or a test of a new feature. They should be there for end users to understand the code. But we need to decide whether to include them as a module or whatever ...

optimize_restarts should be parallelizable

previously we abandoned the idea because GPy models were not pickleable. But now they are!

TNC max_iters keyword

Need consistency in the optimizer interface. For example, max_iters can be passed but is ignored by (for example) tnc. At least we should throw a warning that its being ignored (and perhaps translate it to a sensible number of function evaluations).

add logging framework

New GP model

We should integrate EP_GP and GP_regression models into a single one. That way it will be easier to keep them both up to date.

Since the log marginal likelihood for an EP model can be written as the log likelihood of a regression model for a new variable Y* = v_tilde/tau_tilde, with a covariance matrix K* = K + diag(1./tau_tilde) plus a normalization term, we can use most of the GP_regression code and just add other functions to call the EP algorithm.

Then we can also implement sparse _GP_regression and sparse_EP_GP into the same model.

For consistency between the GP_regression and the sparse_GP_regression, and also to make more clear the differences with EP, in the non-sparse regression, beta should be explicit rather than part of the kernel.

I'll open a branch called newGP for this.

Some Confusion in variable names in Likelihoods

It looks to me like the likelihoods are using Y as the variable that's coming out of the GP and data as the form of the data as it's provided. This clashes with the way we do this in mathematical notation where Y is the data as provided and F is the intermediate variable that the GP models. I think we need to think about what the right naming is (I've been looking at Gaussian.py, so apologies if this is a special case, although even if it is we need to make it consistent).

I'd like to see the following. Y is the data as presented by the user and F is the data as modelled by the GP internally. Would there be a problem with this?