Git Product home page Git Product logo

ennemi's Introduction

ennemi

(Easy Nearest Neighbor Estimation of Mutual Information)

Continuous Integration Integration Tests Coverage DOI

This Python 3 package provides simple methods for estimating mutual information of continuous and discrete variables. With one method call, you can estimate several variable pairs and time lags at once. Both unconditional and conditional MI (including multivariate condition) are supported. The interface is aimed at practical, non-linear correlation analysis.

The package uses the nearest neighbor methods decscribed in:

The latest source code on GitHub might be less stable than the released version on PyPI. You can see the roadmap of planned additions on the Milestones page. See also the support statement. You can also follow the development by clicking Watch releases on the GitHub page.

Getting started

This package requires Python 3.8 or higher, and it is tested to work on the latest versions of Ubuntu, macOS and Windows. The only hard dependencies are reasonably recent versions of NumPy and SciPy; Pandas is strongly suggested for more enjoyable data analysis.

This package is available on PyPI:

pip install ennemi

(If your machine has multiple Python installations, you may need to run e.g. pip3.)

For documentation, please see https://polsys.github.io/ennemi.

Building

The tests depend on pandas, so you need that installed in addition. Additionally, pytest and mypy are required for building the project. All of these are installed by the "extras" syntax of pip.

To install the package in development mode, clone this repository and execute

pip install -e .[dev]

in the repository root folder.

All methods, including tests, are type annotated and checked with mypy. The CI script runs the check automatically on each pushed commit. To run the check yourself, execute

python -m mypy ennemi/ tests/unit tests/integration tests/pandas

in the repository root (configuration is stored in mypy.ini file).

Please see also the contribution guidelines.

Citing

This work is licensed under the MIT license. In short, you are allowed to use, modify and distribute it as you wish, as long as the original copyright notice is preserved. This software is provided "as is", functioning to the extent of passing the unit and integration tests in the tests directory.

This package is archived on Zenodo. The DOI 10.5281/zenodo.3834018 always resolves to the latest version of ennemi. For reproducibility, you should cite the exact version of the package you have used. To do so, use the DOI given on the Releases page or on Zenodo.

If you want to cite an article (although you should still mention the version number you used), the reference is:

@article{ennemi,
  title = {ennemi: Non-linear correlation detection with mutual information},
  journal = {SoftwareX},
  volume = {14},
  pages = {100686},
  year = {2021},
  doi = {https://doi.org/10.1016/j.softx.2021.100686},
  author = {Petri Laarne and Martha A. Zaidan and Tuomo Nieminen}
}

This package is maintained by Petri Laarne, and was initially developed at Institute for Atmospheric and Earth System Research (INAR), University of Helsinki. Please feel free to contact me at [email protected] about any questions or suggestions related to this project.

ennemi's People

Contributors

morrme avatar polsys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

ennemi's Issues

Support discrete-discrete MI

This would be for completeness: no need to use another library for this use-case. Performance shouldn't be a first priority here, it can be improved later.

  • Implement naive algorithm for discrete-discrete MI
  • Implement naive algorithm for conditional d-d MI
  • Implement discrete entropy estimation (with integration tests)
  • Add discrete_x parameter to estimate_mi
  • Add discrete parameter (array) to pairwise_mi
  • Validate that lags/normalization etc. still work
  • Extend warnings for continuous data to new methods
  • Add some simple benchmark
  • Update feature list in docs

Related: #75 (maybe add another section to docs).

Discrete (or mixed) conditions

Originally, conditioning variables were always continuous. #87 will add support for discrete-discrete variables and discrete conditions. However, this leaves many cases unimplemented:

  • Continuous variables, discrete condition (interpreting condition as continuous might work, though)
  • Discrete variables, continuous condition (shouldn't be too hard to implement)
  • Mixed discrete-continuous condition (how hard, and how necessary?)

In table form:

No condition Discrete Continuous Mixed
Discrete-discrete ๐Ÿ†• ๐Ÿ†• โŒ โŒ
Discrete-continuous โœ”๏ธ โŒ โœ”๏ธ โŒ
Continuous-continuous โœ”๏ธ โŒ โœ”๏ธ โŒ

While it would be the nicest to support everything, this leads to some combinatorial explosion in the algorithms. Moreover, the user interface gets more and more complicated.

Therefore most of the cases will remain missing until sufficient demand.

Add integration test cases

There should be a few larger tests documenting and verifying real-world use cases. These should not be counted in code coverage. Depending on their run time, it might make sense to exclude these from PR runs.

  • A data analysis case with data loaded from the disk
  • Coupled Lorenz systems as in the Frenzel & Pompe paper (DOI: 10.1103/PhysRevLett.99.204101)
  • Truncated distribution, with numerically evaluated MI.

Add more unit test cases

  • To get the code coverage to 100%, there must be a test of conditional MI with k=N.
    • Driver requires k<N, I don't want to introduce special code for this test only.
  • There is not yet a driver test of conditional MI with multidimensional condition. (#42)
  • More distributions with analytically known MI, see Darbellay & Vajda (DOI: 10.1109/18.825848) and Kraskov et al. (DOI: 10.1103/PhysRevE.69.066138) (#42)
  • The "too large k after mask" test does not take lags into account. (#38)
  • Other corner cases with confusing error messages? (#38)

Reconsider the eps-(1e-12) tweak in algorithms

The MI estimation code uses the KD tree like

x_grid.query_ball_point(x, eps - 1e-12, p=np.inf, return_length=True)

Here the eps - 1e-12 tweak is necessary because cKDTree returns points with distance less than or equal to the parameter, while the algorithm expects strictly smaller distance. This tweak does the job well enough, but I'm concerned that it might be brittle in some (extreme) cases.

There are practically four options:

  1. Ignore. Because we expect unit variance data, there should be no floating point precision loss. There would be misbehavior only at some singular points, but we support that case only implicitly.
  2. Use a better tweak: find the next smallest floating point value. Robust, but might be too technical; needs testing for edge cases.
  3. Wait for cKDTree to allow strict inequality. Not practical because we would need to wait ~2 years before requiring that SciPy version.
  4. Use Algorithm 2 of Kraskov et al., which uses non-strict inequality. Might affect results in some cases, needs verification.

I would prefer Option 1, unless any issues crop up. Filing this issue so that the question is documented; I'll revisit the decision later, and then update the code accordingly.

Automatic selection of k

Not something I'm currently planning, but could be an interesting, small research project.

High values of k provide good accuracy of low MI values and vice versa. Based on an estimate done with default value (e.g. k=3), approximate the best value to use and rerun the estimation. Because the effect of k is not terribly large, I think the second estimate could be more accurate.

  1. Measure error (mean square, mean absolute?) as the function of k, n, and correlation for some distributions (perhaps not only Gaussian).
  2. Fit a function for the best parameter. Maybe needs to constrain to 1 <= k <= 20 or similar so that the execution time does not blow up.
  3. Verify that this improves the results in realistic scenarios.

I assume that this would not be beneficial in all situations, and anyways it changes the results and execution time dramatically. Therefore the behavior should be opt-in, e.g. by k="auto".

Incorporate NumPy/SciPy into type checking

NumPy is adding type annotations in 1.20 (due December?). We can then stop ignoring it in mypy runs. It looks like SciPy 1.6.0 (in December as well?) may support the annotations as well. It appears that Pandas is not releasing their annotations either yet.

Go through TODO comments

Verify that no TODO comments remain in the code. The comments should represent work that needs to be done before 1.0.0. If any such comments remain, they should include a reference to a tracking issue.

Properly annotate DataFrame returns

estimate_mi and pairwise_mi return DataFrames when passed ones. Investigate how to annotate this case. Note that this might need a hard dependency on pandas. (Related to #48.)

This would let us remove some manual annotations in pandas tests, and any annotated user code (which I'm skeptical about).

Originally posted by @polsys in #91 (comment)

1.1.0 development cycle

This is a meta-issue tracking the things that should be done before the release.

At the start of development cycle

  • Update the version number
  • Set a target date for the release
  • Track the master->main migration tool of GitHub (need to update references in CI and docs as well)

Release

  • Review and fix Sonar issues
  • Merge the documentation branch
  • Publish a release, verify that PyPI push succeeds
  • Update the release notes with the Zenodo DOI
  • Check the Zenodo metadata, update description if necessary
  • Set new baseline for Sonar

Draft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47

Add a CI leg with minimum requirements

Oldest supported Python, oldest supported versions of libraries, oldest OS available on CI. This could be an opportunity to retry simplifying the matrix definition (inclusions instead of exclusions).

It would also be nice to have tests without SciPy and friends installed, although that might complicate the test code too much compared to the benefit.

Add pandas test for discrete data

Add an integration test that uses discrete data in conjuction with pandas. There are currently no unit tests for this case. Toying with the feature might still uncover some bugs too ๐Ÿ˜‡

Improve the parallelism heuristics

  • Test the scenario with many small estimation tasks; does the batching work?
  • Extend the parallel parameter to take the number of processes
  • Do more measurements of multiprocessing overhead, preferably with both laptops and desktops, and Windows too (where process overhead is usually larger)

Describe plan for future support

Write down (in the README, maybe) the plan for support. The users then know what level of support to expect. Things I have in mind:

  • Once feature complete, the development activity will slow down
    • New features when needed, provided that someone will implement them
  • Bug fixes will be done; what is the timespan for releases?
  • New Python/NumPy/SciPy/OS versions will be tested and supported
    • Need to think about documenting the supported versions
  • Support for old Python/NumPy/SciPy versions will be dropped on some schedule
    • Probably best to follow what NumPy does
    • Support should be dropped in minor releases; those should include other changes too

Related to #49. This also includes updating the README when I am no longer at INAR.

Support multidimensional covariates

It should be possible to override the estimate_mi behavior of estimating each x variable separately. Regression models usually include more than one covariate, and this would allow the MI estimation of such models. The n-dimensional estimation infrastructure should make the implementation straightforward.

How should this be represented in the interface?

  • Add a multidim_x parameter. This would change the interpretation of the x variable. A bit ugly and would not support several multidimensional variables, but easy to implement.
  • Make the x parameter three-dimensional. This would be the most flexible but awfully complex to use. This might also be ambiguous, and I've already fought enough with numpy on 1D/2D interpretation.
  • Create two separate methods. Clear but leads to duplication.
  • Change the interface altogether. Something like a builder, where 1D/2D covariates would be added one by one or in batches. Probably un-Pythonically complex for the basic use case.

On the other hand, the interpretation of MI becomes more difficult in this case. It is no longer analogous with Pearson correlation.

Consider support for numpy masked arrays

Make sure that all user-facing methods work well with numpy.ma.masked_array types. Masked observations should be excluded from the estimations. The existing masking infrastructure would work in addition to this.

Use extras_require for development dependencies

Use the existing pip functionality to simplify the CI script and development install instructions. It should be possible to install all development-time dependencies with pip install -e .[dev].

Check if conditional entropy has estimation bias

As the conditional entropy estimation uses the chain rule H(X|Y) = H(X,Y) - H(Y), it is possible that the errors in the two entropy estimates do not cancel out. Check the derivation of conditional entropy, based on Kraskov et al., for this.

If there is little room for improvement, the current algorithm may be good performance-wise. The marginal distances for Y need to be computed only once. On the other hand, this is less expensive than the computation of joint entropies.

Parallelize the entropy estimation

The basic case is really fast and needs no parallelism, but large-dimensional and conditional cases could benefit from parallelization. Consider extending the parallelism heuristic with a "performance score" tailored for each of these cases. Related to #22.

Drop support for Python 3.6

Related to #59. Should be done in 1.1 as NumPy is also dropping 3.6 support by then. Not going to drop support in 1.0 because at least Ubuntu 18.04 LTS uses 3.6 as its default interpreter. This should also include a bump of the minimum supported NumPy/SciPy versions.

  • By using from __future__ import annotations, we should get a performance benefit and simplify the annotations (will still need to check if e.g. list becomes a synonym for typing.List already in 3.7)
  • Other changes we can take advantage of?

Support two-dimensional masks

Allow all masks that can be broadcasted to the shape of x. This would enable specifying a separate mask for each x variable, or even some more advanced scenarios.

For estimate_entropywith multidim=True, the masks should be combined, or two-dimensional masks be disallowed.

Test the package without pandas

The pandas support should be completely optional; currently we don't verify this automatically. Execute the integration tests in two phases:

  • First, the applicable tests without pandas installed,
  • Then install (oldest supported) pandas and run the rest.

1.0.0 development cycle

This is a meta-issue tracking the things that should be done before the release.

At the start of development cycle

  • Update the version number, set development status to Stable in PyPI metadata
  • Set a target date for the release
  • Track the master->main migration tool of GitHub (need to update references in CI and docs as well)

Release

  • Address any review comments from the journal
  • Review and fix remaining Sonar issues
  • Mention the stable status in docs (index, README, DESCRIPTION)
  • Link to release notes / Watch releases in README
  • Copy-edit the README more
  • Merge the documentation branch, if applicable
  • Publish a release, verify that PyPI push succeeds
  • Update the release notes with the Zenodo DOI
  • Check the Zenodo metadata, update description if necessary
  • Mention funding in Zenodo metadata?
  • Set new baseline for Sonar

Draft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47

Add progress callbacks

There is currently no way of seeing the progress of a large, parallel estimation task. Add a callback parameter to estimate_mi and others. The multiprocessing library has support for this, although we must then do some work done by parallel map.

Implement initial support for Numba

Part of #25 meta-issue.

I want to trial Numba in production, and hence would like to include it in Alpha 2. The library seems to be not complete/stable enough for the conditional MI code path, but the unconditional MI path works and has significant performance benefits.

  • Implement Numba acceleration for the unconditional MI code path
  • Try to see which parts of the conditional MI code can be JIT-compiled
  • Make the acceleration optional; only enable it if Numba is installed in the current environment
  • Add an extras_require entry for Numba
  • Run the CI both with and without Numba installed (note that coverage is not yet supported)
  • Add instructions for troubleshooting, NUMBA_DISABLE_JIT and using a clean virtual environment
  • Determine the minimum version of Numba to support

Add discrete-continuous case to tutorial

This case is probably useful, it would be good to "advertise" it more. Currently it is only included in the API docs. The tutorial may need some refactoring; a Table of Contents at minimum.

Implement the algorithm in a compiled language

This is not planned for 1.0, but recorded here for completeness. If somebody wants to try this and can demonstrate it to be beneficial, I'm open for including it in the package, preferably as an optional component.

The current implementation runs on any platform supported by NumPy without any need for compilation. In Python code, it is much harder to violate memory safety or hit platform/compiler-specific issues. Because of this, I see #25 as the primary way of speeding up the code.

Test with Python 3.9

Will be released in October. There will probably be associated NumPy/SciPy updates, and if we support Numba, certainly that too.

We need to do these things:

  • Add Python 3.9 to the CI script
  • Consider the timeline for dropping Python 3.6 (see what NumPy does)

Implement transfer entropy estimation

As described in doi:10.1016/S0167-2789(02)00432-3, transfer information is a measure of dependency that incorporates the direction of information flow. Crucially, it can be expressed as conditional MI with multidimensional second variable. Therefore it should be relatively easy to implement. I haven't looked too deep into the research, and there may still be some problems with accuracy and interpretation.

I don't see this feature necessary for 1.0, where the focus is on using MI for correlation detection. However, I'm very open for its inclusion in the future. If you think this feature would be useful, please comment on this issue. Related: #31.

Allow completely missing data when drop_nan=True

Consider the case where there are several variables and lags. At some lag value, there are no observations of a variable (e.g. because the mask overlaps with measurement interval somehow).

Expected behavior: I would expect ennemi to return NaN for that single variable/lag combination.

Actual behavior: the whole estimation fails due to the N < k validation check.

Add automatic data preprocessing

Implement these two steps that are always done before analysis:

  • Scale variables to zero mean and unit variance
  • Add low-amplitude noise

This should be opt-outable by a parameter (preprocess=True), and implemented in both public MI methods. For estimate_entropy, only the noise must be included. For reproducibility, a fixed random seed should be used.

Add utility method for creating pairwise MI matrices

Add a estimate_pairwise_mi method that takes in an array of variables and outputs an MI matrix such that result[i,j] = estimate_mi(vars[i], vars[j]). The user can already implement this method using estimate_mi, but our implementation would offer stronger parallelization.

  • Should probably return a symmetric matrix while the calculation produces a triangular matrix.
  • Should have the same parameters for k and conditioning as estimate_mi.
  • What about lags? Should it be possible to lag one axis?
  • With no lag, the diagonal could be excluded as auto-MI is usually not useful.
  • Consider an optional parameter for normalization.

Support the discrete-continuous case

Ross (DOI: 10.1371/journal.pone.0087357) has described a variant of the algorithm where one of the variables is discrete. The derivation only contains the unconditional case, but conditioning should be straightforward to add.

This could be surfaced as a discrete_y parameter on estimate_mi, assuming that this is the typically interesting case.

Go through docstrings

Go through all method documentation strings and make sure they are consistent and clear. It looks like I have sometimes forgotten to update them as I have made code changes.

Consider support for Numba

Try and see how much faster Numba makes the estimation code. I think there is a lot of potential since the algorithm does a lot of tight loops in Python code.

It would be good to make the integration optional for those who do not have Numba installed. This would mean separate CI test legs for compiled and interpreted versions.

Support multivariate mutual information

The estimation method should be extensible to MI between arbitrarily many variables.

Because the interpretation of multivariate MI is hard, I do not see this as a useful feature. Therefore this is currently not planned for inclusion. If you think this feature would be useful, please comment on this issue.

Track Numba 0.51

Part of #25 meta-issue.

This version might remove the need for special annotations on classes. Hopefully it would also enable more code for JITting; some issues seem to be bugs in Numba. To make the fixes more likely, we should report any minimal reproducers. Additionally, support for axis argument in ndarray.max would make more of the code work in nopython mode.

Add guide for MATLAB users

Given that many researchers use MATLAB for data analysis, we could have a guide for using ennemi with MATLAB. Basically:

  • Exporting data from MATLAB and loading it with pandas
  • Exporting ennemi results and loading them in MATLAB
  • Example code for common tasks

A stretch goal could be to publish a MATLAB package for these tasks, but surfacing all features might become too complicated.

Beta 1 development cycle

This is a meta-issue tracking the things that should be done before the release.

At the start of development cycle

  • Update the version number, set development status to Beta in PyPI metadata
  • Rename master branch to main, fix references in CI, Sonar and docs
    • Apparently this change would break GitHub Pages, so deferring this for now
    • I'd still want to change this; both for inclusivity and because the "17 commits to master since this release" message is confusing to non-technical users
  • Set a target date for the release
  • Add a py.typed file so that our annotations are visible to users (https://mypy.readthedocs.io/en/stable/installed_packages.html)
  • Replace #type:ignore comments by a mypy configuration file

Release

  • Fix Sonar issues
  • Add a CI step to check the package build
  • Mention the beta status in docs (index, README, DESCRIPTION)
  • Mention "and contributors" in LICENSE and Zenodo
  • Add email address to README
  • Merge the documentation branch
  • Review docs, remove warning about beta1 being still unreleased
  • Consider simplifying the acronym to "Easy nearest neighbor [...]"
  • Mention @morrme in release notes
  • Publish a release, verify that PyPI push succeeds
  • Update the release notes with the Zenodo DOI
  • Check the Zenodo metadata, update description if necessary
  • Set new baseline for Sonar

Draft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47

Add a case study to documentation

Add a "real-world" example of using the package. It should go through all the steps of basic data analysis. Preferably this would use real data, but simulated data is fine in the first version.

Add a method for progress visualization

Following #23, it could make sense to create a command-line interface for common use cases. This could include e.g. a progress bar and a time estimate. I anticipate users creating these kinds of methods by themselves; at least I have a

This could be released as an example script or a separate package. That way we would avoid a dependency on a CLI package.

Add method for entropy estimation

Implement differential entropy estimation using the nearest-neighbor algorithm. I don't think this is strictly necessary/useful, but it's good to include for completeness. I haven't seen too many packages doing nearest-neighbor estimation of information theoretic measures.

Conditional entropy should also be available, with arbitrarily many conditioning variables. The method signature would be like estimate_entropy(variables, k=3, cond=None). If multiple variables are given, they would be estimated separately. Pandas data types should be supported as usual.

Consider lag for condition in pairwise_mi()

It could be useful to specify a lag for the conditioning variable in pairwise_mi(). This would enable some useful cases (conditioning on earlier observations) that now must be done manually. The lag should be a scalar in this case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.