alexanderfabisch / gmr Goto Github PK

View Code? Open in Web Editor NEW

156.0 156.0 47.0 1.11 MB

Gaussian Mixture Regression

Home Page: https://alexanderfabisch.github.io/gmr/

License: BSD 3-Clause "New" or "Revised" License

Python 97.15% TeX 2.85%

gaussian-mixture-models machine-learning python regression

gmr's People

Contributors

Stargazers

Watchers

gmr's Issues

Feature Request: Concomitant Variable Model for priors

In the R package flexmix, I can specify a model for the priors, rather than just a vector a numbers. The model can depend on some arbitrary set of associated variables, called concomitant variables.

https://cran.r-project.org/web/packages/flexmix/vignettes/mixture-regressions.pdf

The idea is that there are some explanatory variables which have some information guiding the prior distribution.

 A general model class of finite mixtures of regression models is considered in the following. The
 mixture is assumed to consist of K components where each component follows a parametric
 distribution. Each component has a weight assigned which indicates the a-priori probability
 for an observation to come from this component and the mixture distribution is given by the
 weighted sum over the K components. If the weights depend on further variables, these are
 referred to as concomitant variables.

flexmix is designed to be extensible, but requires some level of expertise in order to achieve such extension. Out of the box, it has only a single concomitant model form. Would be good to have a similar capability for python.

ImportError: No module named utils

sorry ?

Getting Coefficients

Hey there,

I have survey data where each person makes multiple observations across different brands.

Is it possible to use this library to extract each person's coefficients/class membership, and the adjusted r2 scores for each class?

Usually, I'd use a software called Latent Gold and was hoping this might be the python way of performing the same analysis.

Docstring confusion in mvn.py

It's possible my thinking is backward, but I've long considered regression from an "imputation of missing values" point of view (e.g., Schneider, 2001; see Sec. 2). Given that, the docstrings for module functions mvn.condition():

gmr/gmr/mvn.py

Lines 524 to 528 in 53185ff

  i1 : array, shape (n_features1,) 

  Input feature indices 

  i2 : array, shape (n_features2,) 

  Output feature indices

...and mvn.regression_coefficients():

gmr/gmr/mvn.py

Lines 489 to 493 in 53185ff

  i1 : array, shape (n_features1,) 

  Input feature indices 

  i2 : array, shape (n_features2,) 

  Output feature indices

...seem confusing to me because you refer to the "input feature" as the feature I would associate with the missing data (dependent variables), and the "output feature" as the feature I would associate with the available data (independent variables).

What's more, it appears you might, at least implicitly, share my view since, when you call mvn.condition() from the MVN.predict() and MVN.condition() methods, you "invert" the indices.

JOSS paper authorship

I recently submitted a paper to the journal of open source software. Submitted papers are typically tightly coupled with Zenodo releases. Here is the latest release of GMR: https://zenodo.org/record/4889867 and here is the paper review issue: openjournals/joss-reviews#3054

In addition to myself, Zenodo also lists you, @jfsantos and @mralbu, as authors. That is why I would have to agree with you both on whether you want to become author of the JOSS paper as well. Both author lists should be as similar as possible.

@jfsantos The amount of lines of code that you contributed and that survived in the latest version of gmr is 8 so that I suggest that it would not make sense to include you as an author if that is OK. Do you want to stay author of the Zenodo release or should I remove your name here? I wouldn't have any issue with any of the two options.

@mralbu You contributed the sklearn interface and I adopted your idea for doing faster batch mean predictions. So I would ask you whether you want to become co-author of the paper. Have a look at the latest article proof here: openjournals/joss-reviews#3054 (comment) . The current state of the review is that both reviewers accepted the paper and we are currently only discussing the authorship details.

`is_in_confidence_region` always False for single-feature GMM

It seems that I've run into a bug where for a GMM with a single feature the is_in_confidence_region always returns False, and so sample_confidence_region never terminates. I first noticed this with a conditioned GMM, but one constructed by hand shows the same behavior. Example:

import gmr
gmm = gmr.GMM(n_components=1, priors=[1], means=[[0]], covariances=[[[1]]])

# Works fine: gmm.sample(1)
# Never returns: gmm.sample_confidence_region(1, 1.0)

gmm.is_in_confidence_region([0], 1.0) # False

AttributeError: 'list' object has no attribute 'shape'

I'm trying to predict a next step position (using latitude and longitude as attributes and target). I've tried the following:

pred = gmm.predict(len(X)+i, np.array([X[(num-1)+i]]))
where the first value is 10 and the second "array([[41.4051453, 2.1776344]])" with shape (1, 2)

however, I get this error:

`AttributeError: 'list' object has no attribute 'shape'

What I'm doing wrong?
`

GMM.means and GMM.covariances have different number of components

Once the model is trained GMM.means has the number of components (n_components, n_features) but GMM.covariances seem to have the number of training points (n_training_points, n_features, n_features).
Can it be that even though the len(covariance) == n_training_points, the first points belong to the n_components? Because after reading the code it seems that works but the algorithms take only the first points ignoring the rest.

How to save fitted model?

Hi!

Thanks a lot for maintaining this great library! It is really convenient and elegant.

I wonder if it is possible to save fitted model as a file?

Is it possible to add multiple covariance types？

Thank you for your excellent work，I want to know if there is a more flexible GMR, similar to sklearn, which can limit the covariance type to ‘shared’, ‘spherical’ or ‘diag'. Sometimes we don’t need a 'full' type of covariance matrix. I look forward to your reply.

Question: why maintaining this code aside sklearn mixture GMM ?

Hi,
The main part of this code is present or rewrited in GMM mixture in scikit-learn. Is there any reason to continue to develop it instead of modifying the sklearn version ?

Regression

Hi, thank you for making this package. However, I am failing to see where I can do a GMM regression with a dependent variable y?

Set up CI with Github actions

Doing conditional sampling for multiple values

I would like to perform a specific kind of sampling, and I'm not sure what is the best way to go about it. Say I have two variables (1d arrays) X and Y, and I have a GMM trained on the [X Y] dataset. Now I'd like to generate values for Y based on an array of values of X, but instead of just getting the mean I'd like to obtain multiple (let's say N) values sampled according to the mixture distribution. One way to accomplish this is as follows:

Y_sampled = np.empty((len(X), N))
for i in range(len(X)):
  Y_sampled[i, :] = gmm.condition([0], X[i]).sample(N)

However, this requires a loop over all values of X (which predict avoids). Is there a better/more performant way to get this same result?

Updates from JOSS review

TODO

paper text: "Figure Figure 1/2" - one "Figure" should be removed here

Incremental learning algorithm

https://ieeexplore.ieee.org/document/5652040

NaN value for gmm.predict?

Hi.
I use my dataset when using gmr. My dataset named train is (188318 rows and 14 coloumns) and test is (122000 and 14 coloumns). My label is y_train (188318 rows,)
then, after following from the regression example you provide:
gmm.from_samples(train)
Y = gmm.predict(np.array([0]), y_train[:, np.newaxis])

Not sure why it throws the NaN values? Usually we fit the model using train and y_train, then we predict using the test data right?

ValueError: probabilities contain NaN

Hi,
I met an issue when i using my own data for training, the error message as follows, I also checked my data and confirmed that there was no NaN in the data:
Traceback (most recent call last):
File "C:/Users/FJL/Downloads/gmr-master/gmr-master/examples/Test2.py", line 43, in
initial_means = kmeansplusplus_initialization(X_train, n_components, random_state)
File "C:\Users\FJL\Downloads\gmr-master\gmr-master\gmr\gmm.py", line 46, in kmeansplusplus_initialization
i = _select_next_center(X, centers, random_state, selected_centers, all_indices)
File "C:\Users\FJL\Downloads\gmr-master\gmr-master\gmr\gmm.py", line 58, in _select_next_center
return random_state.choice(all_indices, size=1, p=selection_probability)[0]
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN

Add link to slides in readme

http://calinon.ch/teaching_EPFL.htm

http://calinon.ch/misc/EE613/EE613-slides-9.pdf

Problem with NaN in cholesky decomposition

Hi,

When I fit a GMM to data, I sometimes get the following error:

Traceback (most recent call last):
File "test_nan_problem.py", line 9, in
model.from_samples(frame.values)
File "build/bdist.linux-x86_64/egg/gmr/gmm.py", line 94, in from_samples
File "build/bdist.linux-x86_64/egg/gmr/gmm.py", line 160, in to_responsibilities
File "build/bdist.linux-x86_64/egg/gmr/mvn.py", line 105, in to_probability_density
File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 81, in cholesky
check_finite=check_finite)
File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 20, in _cholesky
a1 = asarray_chkfinite(a)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1022, in asarray_chkfinite
"array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

The occurence of the error depends on the data and on the GMM parameters. E.g. it may happen that, with the same data, the error does not occur when I use a different random state.

I am working on master branch, with python2.7 and numpy version 1.11.0. The code to reproduce the error is

import pandas as pd
from gmr import GMM
import random, time

frame=pd.read_csv("data.txt", sep=" ")
random_state = 1578569639
model = GMM(n_components=7, random_state=random_state)
model.from_samples(frame.values)

I attached the data file: data.txt

Best,
Dennis

Review issue for JOSS

This package certainly has potential, but there are quite a few issues that need to be addressed before I think it is ready for publication in JOSS. Starting from the top.

General

Substantial scholarly effort: I have no problem with the contribution you have made to the field and the functionality of the package. However, I feel you need to stress the differences between your package and the functionality available from Scikit-Learn in the paper. Indeed, in the paper you state "The prediction process for regression is not available in scikit-learn and, thus, will be provided by gmr." To what regression are you referring? The GaussianMixture class has a predict method which does, by my reckoning, perform some kind of prediction. Crucially, this "prediction" is different from the "prediction" provided by the GMM class in gmr. In the paper, you should elaborate on the difference between regression processes used here and clearly state what is in your package but not in Scikit-Learn or other packages.

Functionality

Installation. My first attempt to install the package failed because the dependencies are incorrectly set setup.py. Dependencies that are required for the installation and running of your package should be listed under the install_requires keyword argument of setup rather than requires. I think you need 2 dependencies for this package: numpy and scipy. You might also have a dependency on matplotlib, since a number of functions require matplotlib. Scikit-learn is also required for some examples.
Functionality. The package appears to perform as claimed, though I have yet to test an example of my construction. I will make a follow-up post on this issue if I cannot get this to work. The code generally looks to be well written and you have plenty of examples to illustrate the functionality. However, one run of your test suite failed and I'm not sure why. Reinstalling gmr with the local package seemed to fix the issue (pip install -e .). You might want to look into this. Test output below. The second run passed without issue.

Documentation

Statement of need. Your statement of need indicates the problems that your package is designed to solve, but could be a bit more clear about who the audience is. This need not be much, just a sentence to give some suggestions of what kind of people might use this software.
Installation instructions. The dependencies are not listed correctly in setup.py, see above.
Example usage. You give lots of examples, but they all seem a bit manufactured. There is a version of the Iris classification problem from Scikit-Learn, which is a great example, but is not explained and it doesn't appear to do a particularly good job at modelling the data.
Functionality documentation. You need to generate API documentation for your package. All of the functions have documentation strings in the code, but users probably don't want to have to read through the code to find the functions they need. You should compile some more extensive documentation that covers the functions available to the end user on each class, and illustrate their use.
Community guidelines. You need to add some guidance for people who want to contribute, report issues, or get support on your package. This can be as simple as pointing them to the relevant features of Github (issues, pull-requests, etc.) Dan Bader has a nice article on what kind of things to write in a readme: https://dbader.org/blog/write-a-great-readme-for-your-github-project

Software Paper

Summary is fine, but you misspelled Gaussian.
Statement of need. See above.
State of the field. As I mentioned above, I think you need to be a little more clear on how this differs from the features available in Scikit-Learn.

Test data

I'm not sure how, but my first run of the test failed. Reinstalling using pip install -e . seemed to fix this. It might be an error from my virtual environment, but you might want to look into this.

============================= test session starts ==============================
platform linux -- Python 3.8.7+, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/sam/tmp/gmr
collected 49 items                                                             

gmr/tests/test_gmm.py ............................FFF                    [ 63%]
gmr/tests/test_mvn.py ..................                                 [100%]

=================================== FAILURES ===================================
________________________ test_extract_mvn_negative_idx _________________________

    def test_extract_mvn_negative_idx():
        gmm = GMM(n_components=2, priors=0.5 * np.ones(2), means=np.zeros((2, 2)),
                  covariances=[np.eye(2)] * 2)
>       assert_raises(ValueError, gmm.extract_mvn, -1)
E       AttributeError: 'GMM' object has no attribute 'extract_mvn'

gmr/tests/test_gmm.py:427: AttributeError
________________________ test_extract_mvn_idx_too_high _________________________

    def test_extract_mvn_idx_too_high():
        gmm = GMM(n_components=2, priors=0.5 * np.ones(2), means=np.zeros((2, 2)),
                  covariances=[np.eye(2)] * 2)
>       assert_raises(ValueError, gmm.extract_mvn, 2)
E       AttributeError: 'GMM' object has no attribute 'extract_mvn'

gmr/tests/test_gmm.py:433: AttributeError
______________________________ test_extract_mvns _______________________________

    def test_extract_mvns():
        gmm = GMM(n_components=2, priors=0.5 * np.ones(2),
                  means=np.array([[1, 2], [3, 4]]), covariances=[np.eye(2)] * 2)
>       mvn0 = gmm.extract_mvn(0)
E       AttributeError: 'GMM' object has no attribute 'extract_mvn'

gmr/tests/test_gmm.py:439: AttributeError
=============================== warnings summary ===============================
../venv/lib/python3.8/site-packages/nose/importer.py:12
  /home/sam/tmp/venv/lib/python3.8/site-packages/nose/importer.py:12: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    from imp import find_module, load_module, acquire_lock, release_lock

gmr/tests/test_gmm.py: 12 warnings
  /home/sam/tmp/venv/lib/python3.8/site-packages/gmr/gmm.py:175: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    dtype=np.float) / self.n_components

gmr/tests/test_gmm.py: 1204 warnings
gmr/tests/test_mvn.py: 5 warnings
  /home/sam/tmp/venv/lib/python3.8/site-packages/gmr/mvn.py:8: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    inv = np.ones(n_features, dtype=np.bool)

gmr/tests/test_mvn.py::test_unscented_transform_linear_transformation
gmr/tests/test_mvn.py::test_unscented_transform_linear_combination
gmr/tests/test_mvn.py::test_unscented_transform_projection_to_more_dimensions
gmr/tests/test_mvn.py::test_unscented_transform_quadratic
  /home/sam/tmp/venv/lib/python3.8/site-packages/gmr/mvn.py:316: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    D = np.maximum(D, np.finfo(np.float).eps)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED gmr/tests/test_gmm.py::test_extract_mvn_negative_idx - AttributeError:...
FAILED gmr/tests/test_gmm.py::test_extract_mvn_idx_too_high - AttributeError:...
FAILED gmr/tests/test_gmm.py::test_extract_mvns - AttributeError: 'GMM' objec...
================= 3 failed, 46 passed, 1226 warnings in 5.11s ==================

Marginalized GMM

Magnitude issue

Hi @AlexanderFabisch ,
By going through your algorithm I have noticed that the model cannot predict GMMs when the magnitude of one variable exceeds 1000.
In my case, this is not too much an issue as my variables can be scaled. But this may be interesting to take a look at.

Thanks again for your work, GMR should definitely be part of Sklearn...

Scikit-learn RegressionMixin

Hi, Alexander!
First of all, thanks for this package, it is really awesome! In particular, I appreciate the posterior sampling features.
Do you think a scikit-learn RegressorMixin could be a good additional feature?
I think it could be useful for integration with Pipelines and other scikit-learn tooling.

I made an attempt at it on this branch: GMMRegression

Please let me know if I can be of help implementing it.

Numerical problems lead to worse scores in comparison to sklearn

Relevant examples at the and of this pull request: #28 from @mralbu

Did some additional comparisons with an experiment using sklearn.mixture.GaussianMixture machinery for fitting the regressor.
from sklego.mixture import GMMRegressor

np.set_printoptions(precision=4)

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478]

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2, init_params='random', max_iter=100)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.8157 0.8061 0.8221 0.8152 0.8221 0.8192 0.8479 0.8282 0.8251 0.7792]
Maybe using internal sklearn.mixture machinery might help ease numerical issues, though it would introduce sklearn as a hard dependency and might be out of scope. On the other hand, it would enable the introduction of other regressors such as BayesianGMMRegressor in an easy way, and would have familiar parameters (the same as in sklearn.mixture.GaussianMixture). Do you think exploring the use of sklearn.mixture inner workings would be interesting for gmr?

I guess the main reason why sklearn's GaussianMixture produces better results is their implementation of the expectation step: https://github.com/scikit-learn/scikit-learn/blob/138da7ea911274f34d28849337c2768d7e3a7a96/sklearn/mixture/_base.py#L462
I am not completely happy with the covariance shrinkage at the moment, this could be removed before the release of version 1.6
First observation: init_params="kmeans++" drastically improves stability of our results, it is not the default initialization though

References for MoE regression.

Hello!

Could you provide references for Mixtures of Experts Regression? Like a book/paper to refer algorithm from which it was implemented.

Thanks!

Faster mean prediction

Here is an example of a faster implementation:

https://github.com/mralbu/scikit-lego/blob/c65952a9c5301116976a0776a56e79dc87ca128d/sklego/mixture/gmm_regressor.py#L133

Incorrect type for alpha in docstring

Should be float, but is int:

https://github.com/AlexanderFabisch/gmr/blob/master/gmr/gmm.py#L612

Error when trying first example

This is probably a very basic mistake. But I can't seem to run the example (after installing with pip).
I get the error:

File "/Users/Harald/anaconda/lib/python3.4/site-packages/gmr/init.py", line 1, in
from mvn import MVN, plot_error_ellipse

ImportError: No module named 'mvn'

I tried cloning the complete repository and run the examples from the source but that didn't help either.

Need a reference to Multivariate GMR sample dataset

Hi Alex,
Want to solve multivariate regression problem using mixed density network. Need a reference for a open source downloable dataset.

Thanks in advance

	i1 : array, shape (n_features1,)
	Input feature indices

	i2 : array, shape (n_features2,)
	Output feature indices