polsys / ennemi Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 12.0 1.86 MB

Easy Nearest Neighbor Estimation of Mutual Information

Home Page: https://polsys.github.io/ennemi/

License: MIT License

Python 100.00%

entropy information-theory mutual-information python scientific-computing

ennemi's Issues

Reconsider the eps-(1e-12) tweak in algorithms

The MI estimation code uses the KD tree like

x_grid.query_ball_point(x, eps - 1e-12, p=np.inf, return_length=True)

Here the eps - 1e-12 tweak is necessary because cKDTree returns points with distance less than or equal to the parameter, while the algorithm expects strictly smaller distance. This tweak does the job well enough, but I'm concerned that it might be brittle in some (extreme) cases.

There are practically four options:

Ignore. Because we expect unit variance data, there should be no floating point precision loss. There would be misbehavior only at some singular points, but we support that case only implicitly.
Use a better tweak: find the next smallest floating point value. Robust, but might be too technical; needs testing for edge cases.
Wait for cKDTree to allow strict inequality. Not practical because we would need to wait ~2 years before requiring that SciPy version.
Use Algorithm 2 of Kraskov et al., which uses non-strict inequality. Might affect results in some cases, needs verification.

I would prefer Option 1, unless any issues crop up. Filing this issue so that the question is documented; I'll revisit the decision later, and then update the code accordingly.

Add guide for MATLAB users

Given that many researchers use MATLAB for data analysis, we could have a guide for using ennemi with MATLAB. Basically:

Exporting data from MATLAB and loading it with pandas
Exporting ennemi results and loading them in MATLAB
Example code for common tasks

A stretch goal could be to publish a MATLAB package for these tasks, but surfacing all features might become too complicated.

Allow completely missing data when drop_nan=True

Consider the case where there are several variables and lags. At some lag value, there are no observations of a variable (e.g. because the mask overlaps with measurement interval somehow).

Expected behavior: I would expect ennemi to return NaN for that single variable/lag combination.

Actual behavior: the whole estimation fails due to the N < k validation check.

Add a case study to documentation

Add a "real-world" example of using the package. It should go through all the steps of basic data analysis. Preferably this would use real data, but simulated data is fine in the first version.

Implement transfer entropy estimation

As described in doi:10.1016/S0167-2789(02)00432-3, transfer information is a measure of dependency that incorporates the direction of information flow. Crucially, it can be expressed as conditional MI with multidimensional second variable. Therefore it should be relatively easy to implement. I haven't looked too deep into the research, and there may still be some problems with accuracy and interpretation.

I don't see this feature necessary for 1.0, where the focus is on using MI for correlation detection. However, I'm very open for its inclusion in the future. If you think this feature would be useful, please comment on this issue. Related: #31.

Add progress callbacks

There is currently no way of seeing the progress of a large, parallel estimation task. Add a callback parameter to estimate_mi and others. The multiprocessing library has support for this, although we must then do some work done by parallel map.

Beta 1 development cycle

This is a meta-issue tracking the things that should be done before the release.

At the start of development cycle

Update the version number, set development status to Beta in PyPI metadata
Rename master branch to main, fix references in CI, Sonar and docs
- Apparently this change would break GitHub Pages, so deferring this for now
- I'd still want to change this; both for inclusivity and because the "17 commits to master since this release" message is confusing to non-technical users
Set a target date for the release
Add a py.typed file so that our annotations are visible to users (https://mypy.readthedocs.io/en/stable/installed_packages.html)
Replace #type:ignore comments by a mypy configuration file

Release

Draft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47

Discrete (or mixed) conditions

Originally, conditioning variables were always continuous. #87 will add support for discrete-discrete variables and discrete conditions. However, this leaves many cases unimplemented:

Continuous variables, discrete condition (interpreting condition as continuous might work, though)
Discrete variables, continuous condition (shouldn't be too hard to implement)
Mixed discrete-continuous condition (how hard, and how necessary?)

In table form:

	No condition	Discrete	Continuous	Mixed
Discrete-discrete	🆕	🆕	❌	❌
Discrete-continuous	✔️	❌	✔️	❌
Continuous-continuous	✔️	❌	✔️	❌

While it would be the nicest to support everything, this leads to some combinatorial explosion in the algorithms. Moreover, the user interface gets more and more complicated.

Therefore most of the cases will remain missing until sufficient demand.

1.1.0 development cycle

This is a meta-issue tracking the things that should be done before the release.

At the start of development cycle

Update the version number
Set a target date for the release
Track the master->main migration tool of GitHub (need to update references in CI and docs as well)

Release

Review and fix Sonar issues
Merge the documentation branch
Publish a release, verify that PyPI push succeeds
Update the release notes with the Zenodo DOI
Check the Zenodo metadata, update description if necessary
Set new baseline for Sonar

Draft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47

Support the discrete-continuous case

Ross (DOI: 10.1371/journal.pone.0087357) has described a variant of the algorithm where one of the variables is discrete. The derivation only contains the unconditional case, but conditioning should be straightforward to add.

This could be surfaced as a discrete_y parameter on estimate_mi, assuming that this is the typically interesting case.

Test the package without pandas

The pandas support should be completely optional; currently we don't verify this automatically. Execute the integration tests in two phases:

First, the applicable tests without pandas installed,
Then install (oldest supported) pandas and run the rest.

Consider support for Numba

Try and see how much faster Numba makes the estimation code. I think there is a lot of potential since the algorithm does a lot of tight loops in Python code.

It would be good to make the integration optional for those who do not have Numba installed. This would mean separate CI test legs for compiled and interpreted versions.

Add a method for progress visualization

Following #23, it could make sense to create a command-line interface for common use cases. This could include e.g. a progress bar and a time estimate. I anticipate users creating these kinds of methods by themselves; at least I have a

This could be released as an example script or a separate package. That way we would avoid a dependency on a CLI package.

Add method for entropy estimation

Implement differential entropy estimation using the nearest-neighbor algorithm. I don't think this is strictly necessary/useful, but it's good to include for completeness. I haven't seen too many packages doing nearest-neighbor estimation of information theoretic measures.

Conditional entropy should also be available, with arbitrarily many conditioning variables. The method signature would be like estimate_entropy(variables, k=3, cond=None). If multiple variables are given, they would be estimated separately. Pandas data types should be supported as usual.

Incorporate NumPy/SciPy into type checking

NumPy is adding type annotations in 1.20 (due December?). We can then stop ignoring it in mypy runs. It looks like SciPy 1.6.0 (in December as well?) may support the annotations as well. It appears that Pandas is not releasing their annotations either yet.

Support separate lags for each conditioning variable

Make the cond_lag parameter effectively two-dimensional, or rather have one dimension more than lag.

Automatic selection of k

Not something I'm currently planning, but could be an interesting, small research project.

High values of k provide good accuracy of low MI values and vice versa. Based on an estimate done with default value (e.g. k=3), approximate the best value to use and rerun the estimation. Because the effect of k is not terribly large, I think the second estimate could be more accurate.

Measure error (mean square, mean absolute?) as the function of k, n, and correlation for some distributions (perhaps not only Gaussian).
Fit a function for the best parameter. Maybe needs to constrain to 1 <= k <= 20 or similar so that the execution time does not blow up.
Verify that this improves the results in realistic scenarios.

I assume that this would not be beneficial in all situations, and anyways it changes the results and execution time dramatically. Therefore the behavior should be opt-in, e.g. by k="auto".

Improve the parallelism heuristics

Test the scenario with many small estimation tasks; does the batching work?
Extend the parallel parameter to take the number of processes
Do more measurements of multiprocessing overhead, preferably with both laptops and desktops, and Windows too (where process overhead is usually larger)

Add discrete-continuous case to tutorial

This case is probably useful, it would be good to "advertise" it more. Currently it is only included in the API docs. The tutorial may need some refactoring; a Table of Contents at minimum.

Do another profiling round

I'm especially interested in seeing if the unconditional MI case can be improved.

Use extras_require for development dependencies

Use the existing pip functionality to simplify the CI script and development install instructions. It should be possible to install all development-time dependencies with pip install -e .[dev].

Test with Python 3.9

Will be released in October. There will probably be associated NumPy/SciPy updates, and if we support Numba, certainly that too.

We need to do these things:

Add Python 3.9 to the CI script
Consider the timeline for dropping Python 3.6 (see what NumPy does)

Track Numba 0.51

Part of #25 meta-issue.

This version might remove the need for special annotations on classes. Hopefully it would also enable more code for JITting; some issues seem to be bugs in Numba. To make the fixes more likely, we should report any minimal reproducers. Additionally, support for axis argument in ndarray.max would make more of the code work in nopython mode.

Implement initial support for Numba

Part of #25 meta-issue.

I want to trial Numba in production, and hence would like to include it in Alpha 2. The library seems to be not complete/stable enough for the conditional MI code path, but the unconditional MI path works and has significant performance benefits.

Implement Numba acceleration for the unconditional MI code path
Try to see which parts of the conditional MI code can be JIT-compiled
Make the acceleration optional; only enable it if Numba is installed in the current environment
Add an extras_require entry for Numba
Run the CI both with and without Numba installed (note that coverage is not yet supported)
Add instructions for troubleshooting, NUMBA_DISABLE_JIT and using a clean virtual environment
Determine the minimum version of Numba to support

Add useful code snippets

Add an example of using estimation callbacks to implement simple progress reporting
Others?

Add a CI leg with minimum requirements

Oldest supported Python, oldest supported versions of libraries, oldest OS available on CI. This could be an opportunity to retry simplifying the matrix definition (inclusions instead of exclusions).

It would also be nice to have tests without SciPy and friends installed, although that might complicate the test code too much compared to the benefit.

Go through TODO comments

Verify that no TODO comments remain in the code. The comments should represent work that needs to be done before 1.0.0. If any such comments remain, they should include a reference to a tracking issue.

1.0.0 development cycle

This is a meta-issue tracking the things that should be done before the release.

At the start of development cycle

Update the version number, set development status to Stable in PyPI metadata
Set a target date for the release
Track the master->main migration tool of GitHub (need to update references in CI and docs as well)

Release

Draft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47

Describe plan for future support

Write down (in the README, maybe) the plan for support. The users then know what level of support to expect. Things I have in mind:

Once feature complete, the development activity will slow down
- New features when needed, provided that someone will implement them
Bug fixes will be done; what is the timespan for releases?
New Python/NumPy/SciPy/OS versions will be tested and supported
- Need to think about documenting the supported versions
Support for old Python/NumPy/SciPy versions will be dropped on some schedule
- Probably best to follow what NumPy does
- Support should be dropped in minor releases; those should include other changes too

Related to #49. This also includes updating the README when I am no longer at INAR.

Consider lag for condition in pairwise_mi()

It could be useful to specify a lag for the conditioning variable in pairwise_mi(). This would enable some useful cases (conditioning on earlier observations) that now must be done manually. The lag should be a scalar in this case.

Support discrete-discrete MI

This would be for completeness: no need to use another library for this use-case. Performance shouldn't be a first priority here, it can be improved later.

Implement naive algorithm for discrete-discrete MI
Implement naive algorithm for conditional d-d MI
Implement discrete entropy estimation (with integration tests)
Add discrete_x parameter to estimate_mi
Add discrete parameter (array) to pairwise_mi
Validate that lags/normalization etc. still work
Extend warnings for continuous data to new methods
Add some simple benchmark
Update feature list in docs

Related: #75 (maybe add another section to docs).

Add a CONTRIBUTING file

Describe some basic things about contributing:

Code of Conduct (https://www.contributor-covenant.org/)
Contact details
Submitting issues
Project scope, PR process
Code style, unit tests, type checking

Support multidimensional covariates

It should be possible to override the estimate_mi behavior of estimating each x variable separately. Regression models usually include more than one covariate, and this would allow the MI estimation of such models. The n-dimensional estimation infrastructure should make the implementation straightforward.

How should this be represented in the interface?

Add a multidim_x parameter. This would change the interpretation of the x variable. A bit ugly and would not support several multidimensional variables, but easy to implement.
Make the x parameter three-dimensional. This would be the most flexible but awfully complex to use. This might also be ambiguous, and I've already fought enough with numpy on 1D/2D interpretation.
Create two separate methods. Clear but leads to duplication.
Change the interface altogether. Something like a builder, where 1D/2D covariates would be added one by one or in batches. Probably un-Pythonically complex for the basic use case.

On the other hand, the interpretation of MI becomes more difficult in this case. It is no longer analogous with Pearson correlation.

Support multivariate mutual information

The estimation method should be extensible to MI between arbitrarily many variables.

Because the interpretation of multivariate MI is hard, I do not see this as a useful feature. Therefore this is currently not planned for inclusion. If you think this feature would be useful, please comment on this issue.

Parallelize the entropy estimation

The basic case is really fast and needs no parallelism, but large-dimensional and conditional cases could benefit from parallelization. Consider extending the parallelism heuristic with a "performance score" tailored for each of these cases. Related to #22.

Add error message reference page

List the error messages of the package and how to fix them.

Add utility method for creating pairwise MI matrices

Add a estimate_pairwise_mi method that takes in an array of variables and outputs an MI matrix such that result[i,j] = estimate_mi(vars[i], vars[j]). The user can already implement this method using estimate_mi, but our implementation would offer stronger parallelization.

Should probably return a symmetric matrix while the calculation produces a triangular matrix.
Should have the same parameters for k and conditioning as estimate_mi.
What about lags? Should it be possible to lag one axis?
With no lag, the diagonal could be excluded as auto-MI is usually not useful.
Consider an optional parameter for normalization.

Add automatic data preprocessing

Implement these two steps that are always done before analysis:

Scale variables to zero mean and unit variance
Add low-amplitude noise

This should be opt-outable by a parameter (preprocess=True), and implemented in both public MI methods. For estimate_entropy, only the noise must be included. For reproducibility, a fixed random seed should be used.

Implement the algorithm in a compiled language

This is not planned for 1.0, but recorded here for completeness. If somebody wants to try this and can demonstrate it to be beneficial, I'm open for including it in the package, preferably as an optional component.

The current implementation runs on any platform supported by NumPy without any need for compilation. In Python code, it is much harder to violate memory safety or hit platform/compiler-specific issues. Because of this, I see #25 as the primary way of speeding up the code.

Support two-dimensional masks

Allow all masks that can be broadcasted to the shape of x. This would enable specifying a separate mask for each x variable, or even some more advanced scenarios.

For estimate_entropywith multidim=True, the masks should be combined, or two-dimensional masks be disallowed.

Drop support for Python 3.6

Related to #59. Should be done in 1.1 as NumPy is also dropping 3.6 support by then. Not going to drop support in 1.0 because at least Ubuntu 18.04 LTS uses 3.6 as its default interpreter. This should also include a bump of the minimum supported NumPy/SciPy versions.

By using from __future__ import annotations, we should get a performance benefit and simplify the annotations (will still need to check if e.g. list becomes a synonym for typing.List already in 3.7)
- list is a synonym but apparently mypy does not recognize it until in 3.9 – keeping the current style.
- https://www.python.org/dev/peps/pep-0585/
Other changes we can take advantage of?

Go through docstrings

Go through all method documentation strings and make sure they are consistent and clear. It looks like I have sometimes forgotten to update them as I have made code changes.

Add more unit test cases

~~To get the code coverage to 100%, there must be a test of conditional MI with k=N.~~
- Driver requires k<N, I don't want to introduce special code for this test only.
There is not yet a driver test of conditional MI with multidimensional condition. (#42)
More distributions with analytically known MI, see Darbellay & Vajda (DOI: 10.1109/18.825848) and Kraskov et al. (DOI: 10.1103/PhysRevE.69.066138) (#42)
The "too large k after mask" test does not take lags into account. (#38)
Other corner cases with confusing error messages? (#38)

Properly annotate DataFrame returns

estimate_mi and pairwise_mi return DataFrames when passed ones. Investigate how to annotate this case. Note that this might need a hard dependency on pandas. (Related to #48.)

This would let us remove some manual annotations in pandas tests, and any annotated user code (which I'm skeptical about).

Originally posted by @polsys in #91 (comment)

A data analysis case with data loaded from the disk
Coupled Lorenz systems as in the Frenzel & Pompe paper (DOI: 10.1103/PhysRevLett.99.204101)
Truncated distribution, with numerically evaluated MI.

polsys / ennemi Goto Github PK

ennemi's Issues

At the start of development cycle

Release

At the start of development cycle

Release

At the start of development cycle

Release

Recommend Projects

Recommend Topics

Recommend Org