polsys / ennemi Goto Github PK
View Code? Open in Web Editor NEWEasy Nearest Neighbor Estimation of Mutual Information
Home Page: https://polsys.github.io/ennemi/
License: MIT License
Easy Nearest Neighbor Estimation of Mutual Information
Home Page: https://polsys.github.io/ennemi/
License: MIT License
The MI estimation code uses the KD tree like
x_grid.query_ball_point(x, eps - 1e-12, p=np.inf, return_length=True)
Here the eps - 1e-12
tweak is necessary because cKDTree
returns points with distance less than or equal to the parameter, while the algorithm expects strictly smaller distance. This tweak does the job well enough, but I'm concerned that it might be brittle in some (extreme) cases.
There are practically four options:
cKDTree
to allow strict inequality. Not practical because we would need to wait ~2 years before requiring that SciPy version.I would prefer Option 1, unless any issues crop up. Filing this issue so that the question is documented; I'll revisit the decision later, and then update the code accordingly.
Given that many researchers use MATLAB for data analysis, we could have a guide for using ennemi
with MATLAB. Basically:
pandas
ennemi
results and loading them in MATLABA stretch goal could be to publish a MATLAB package for these tasks, but surfacing all features might become too complicated.
Consider the case where there are several variables and lags. At some lag value, there are no observations of a variable (e.g. because the mask overlaps with measurement interval somehow).
Expected behavior: I would expect ennemi
to return NaN
for that single variable/lag combination.
Actual behavior: the whole estimation fails due to the N < k
validation check.
Add a "real-world" example of using the package. It should go through all the steps of basic data analysis. Preferably this would use real data, but simulated data is fine in the first version.
As described in doi:10.1016/S0167-2789(02)00432-3, transfer information is a measure of dependency that incorporates the direction of information flow. Crucially, it can be expressed as conditional MI with multidimensional second variable. Therefore it should be relatively easy to implement. I haven't looked too deep into the research, and there may still be some problems with accuracy and interpretation.
I don't see this feature necessary for 1.0, where the focus is on using MI for correlation detection. However, I'm very open for its inclusion in the future. If you think this feature would be useful, please comment on this issue. Related: #31.
There is currently no way of seeing the progress of a large, parallel estimation task. Add a callback parameter to estimate_mi
and others. The multiprocessing
library has support for this, although we must then do some work done by parallel map
.
This is a meta-issue tracking the things that should be done before the release.
Beta
in PyPI metadatamaster
branch to main
, fix references in CI, Sonar and docs
py.typed
file so that our annotations are visible to users (https://mypy.readthedocs.io/en/stable/installed_packages.html)#type:ignore
comments by a mypy configuration file@morrme
in release notesDraft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47
Originally, conditioning variables were always continuous. #87 will add support for discrete-discrete variables and discrete conditions. However, this leaves many cases unimplemented:
In table form:
No condition | Discrete | Continuous | Mixed | |
---|---|---|---|---|
Discrete-discrete | ๐ | ๐ | โ | โ |
Discrete-continuous | โ๏ธ | โ | โ๏ธ | โ |
Continuous-continuous | โ๏ธ | โ | โ๏ธ | โ |
While it would be the nicest to support everything, this leads to some combinatorial explosion in the algorithms. Moreover, the user interface gets more and more complicated.
Therefore most of the cases will remain missing until sufficient demand.
This is a meta-issue tracking the things that should be done before the release.
master->main
migration tool of GitHub (need to update references in CI and docs as well)Draft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47
Ross (DOI: 10.1371/journal.pone.0087357) has described a variant of the algorithm where one of the variables is discrete. The derivation only contains the unconditional case, but conditioning should be straightforward to add.
This could be surfaced as a discrete_y
parameter on estimate_mi
, assuming that this is the typically interesting case.
The pandas
support should be completely optional; currently we don't verify this automatically. Execute the integration tests in two phases:
pandas
installed,pandas
and run the rest.Try and see how much faster Numba makes the estimation code. I think there is a lot of potential since the algorithm does a lot of tight loops in Python code.
It would be good to make the integration optional for those who do not have Numba installed. This would mean separate CI test legs for compiled and interpreted versions.
Following #23, it could make sense to create a command-line interface for common use cases. This could include e.g. a progress bar and a time estimate. I anticipate users creating these kinds of methods by themselves; at least I have a
This could be released as an example script or a separate package. That way we would avoid a dependency on a CLI package.
Implement differential entropy estimation using the nearest-neighbor algorithm. I don't think this is strictly necessary/useful, but it's good to include for completeness. I haven't seen too many packages doing nearest-neighbor estimation of information theoretic measures.
Conditional entropy should also be available, with arbitrarily many conditioning variables. The method signature would be like estimate_entropy(variables, k=3, cond=None)
. If multiple variables are given, they would be estimated separately. Pandas data types should be supported as usual.
NumPy is adding type annotations in 1.20 (due December?). We can then stop ignoring it in mypy
runs. It looks like SciPy 1.6.0 (in December as well?) may support the annotations as well. It appears that Pandas is not releasing their annotations either yet.
Make the cond_lag
parameter effectively two-dimensional, or rather have one dimension more than lag
.
Not something I'm currently planning, but could be an interesting, small research project.
High values of k
provide good accuracy of low MI values and vice versa. Based on an estimate done with default value (e.g. k=3
), approximate the best value to use and rerun the estimation. Because the effect of k
is not terribly large, I think the second estimate could be more accurate.
k
, n
, and correlation for some distributions (perhaps not only Gaussian).1 <= k <= 20
or similar so that the execution time does not blow up.I assume that this would not be beneficial in all situations, and anyways it changes the results and execution time dramatically. Therefore the behavior should be opt-in, e.g. by k="auto"
.
parallel
parameter to take the number of processesThis case is probably useful, it would be good to "advertise" it more. Currently it is only included in the API docs. The tutorial may need some refactoring; a Table of Contents at minimum.
I'm especially interested in seeing if the unconditional MI case can be improved.
Use the existing pip
functionality to simplify the CI script and development install instructions. It should be possible to install all development-time dependencies with pip install -e .[dev]
.
Will be released in October. There will probably be associated NumPy/SciPy updates, and if we support Numba, certainly that too.
We need to do these things:
Part of #25 meta-issue.
This version might remove the need for special annotations on classes. Hopefully it would also enable more code for JITting; some issues seem to be bugs in Numba. To make the fixes more likely, we should report any minimal reproducers. Additionally, support for axis
argument in ndarray.max
would make more of the code work in nopython
mode.
Part of #25 meta-issue.
I want to trial Numba in production, and hence would like to include it in Alpha 2. The library seems to be not complete/stable enough for the conditional MI code path, but the unconditional MI path works and has significant performance benefits.
extras_require
entry for NumbaNUMBA_DISABLE_JIT
and using a clean virtual environmentOldest supported Python, oldest supported versions of libraries, oldest OS available on CI. This could be an opportunity to retry simplifying the matrix definition (inclusions instead of exclusions).
It would also be nice to have tests without SciPy and friends installed, although that might complicate the test code too much compared to the benefit.
Verify that no TODO
comments remain in the code. The comments should represent work that needs to be done before 1.0.0. If any such comments remain, they should include a reference to a tracking issue.
This is a meta-issue tracking the things that should be done before the release.
Stable
in PyPI metadatamaster->main
migration tool of GitHub (need to update references in CI and docs as well)Watch releases
in READMEDraft release notes: https://gist.github.com/polsys/f0757723b73194a0bccb9043e7c75e47
Write down (in the README, maybe) the plan for support. The users then know what level of support to expect. Things I have in mind:
Related to #49. This also includes updating the README when I am no longer at INAR.
It could be useful to specify a lag for the conditioning variable in pairwise_mi()
. This would enable some useful cases (conditioning on earlier observations) that now must be done manually. The lag should be a scalar in this case.
This would be for completeness: no need to use another library for this use-case. Performance shouldn't be a first priority here, it can be improved later.
discrete_x
parameter to estimate_mi
discrete
parameter (array) to pairwise_mi
Related: #75 (maybe add another section to docs).
Describe some basic things about contributing:
It should be possible to override the estimate_mi
behavior of estimating each x
variable separately. Regression models usually include more than one covariate, and this would allow the MI estimation of such models. The n-dimensional estimation infrastructure should make the implementation straightforward.
How should this be represented in the interface?
multidim_x
parameter. This would change the interpretation of the x
variable. A bit ugly and would not support several multidimensional variables, but easy to implement.x
parameter three-dimensional. This would be the most flexible but awfully complex to use. This might also be ambiguous, and I've already fought enough with numpy
on 1D/2D interpretation.On the other hand, the interpretation of MI becomes more difficult in this case. It is no longer analogous with Pearson correlation.
The estimation method should be extensible to MI between arbitrarily many variables.
Because the interpretation of multivariate MI is hard, I do not see this as a useful feature. Therefore this is currently not planned for inclusion. If you think this feature would be useful, please comment on this issue.
The basic case is really fast and needs no parallelism, but large-dimensional and conditional cases could benefit from parallelization. Consider extending the parallelism heuristic with a "performance score" tailored for each of these cases. Related to #22.
List the error messages of the package and how to fix them.
Add a estimate_pairwise_mi
method that takes in an array of variables and outputs an MI matrix such that result[i,j] = estimate_mi(vars[i], vars[j])
. The user can already implement this method using estimate_mi
, but our implementation would offer stronger parallelization.
k
and conditioning as estimate_mi
.Implement these two steps that are always done before analysis:
This should be opt-outable by a parameter (preprocess=True
), and implemented in both public MI methods. For estimate_entropy
, only the noise must be included. For reproducibility, a fixed random seed should be used.
This is not planned for 1.0, but recorded here for completeness. If somebody wants to try this and can demonstrate it to be beneficial, I'm open for including it in the package, preferably as an optional component.
The current implementation runs on any platform supported by NumPy without any need for compilation. In Python code, it is much harder to violate memory safety or hit platform/compiler-specific issues. Because of this, I see #25 as the primary way of speeding up the code.
Allow all masks that can be broadcasted to the shape of x
. This would enable specifying a separate mask for each x
variable, or even some more advanced scenarios.
For estimate_entropy
with multidim=True
, the masks should be combined, or two-dimensional masks be disallowed.
Related to #59. Should be done in 1.1 as NumPy is also dropping 3.6 support by then. Not going to drop support in 1.0 because at least Ubuntu 18.04 LTS uses 3.6 as its default interpreter. This should also include a bump of the minimum supported NumPy/SciPy versions.
from __future__ import annotations
, we should get a performance benefit and simplify the annotations (will still need to check if e.g. list
becomes a synonym for typing.List
already in 3.7)
list
is a synonym but apparently mypy
does not recognize it until in 3.9 โ keeping the current style.Go through all method documentation strings and make sure they are consistent and clear. It looks like I have sometimes forgotten to update them as I have made code changes.
k=N
.k<N
, I don't want to introduce special code for this test only.k
after mask" test does not take lags into account. (#38)estimate_mi
and pairwise_mi
return DataFrame
s when passed ones. Investigate how to annotate this case. Note that this might need a hard dependency on pandas
. (Related to #48.)
This would let us remove some manual annotations in pandas
tests, and any annotated user code (which I'm skeptical about).
Originally posted by @polsys in #91 (comment)
Make sure that all user-facing methods work well with numpy.ma.masked_array
types. Masked observations should be excluded from the estimations. The existing masking infrastructure would work in addition to this.
As the conditional entropy estimation uses the chain rule H(X|Y) = H(X,Y) - H(Y)
, it is possible that the errors in the two entropy estimates do not cancel out. Check the derivation of conditional entropy, based on Kraskov et al., for this.
If there is little room for improvement, the current algorithm may be good performance-wise. The marginal distances for Y
need to be computed only once. On the other hand, this is less expensive than the computation of joint entropies.
There should be a few larger tests documenting and verifying real-world use cases. These should not be counted in code coverage. Depending on their run time, it might make sense to exclude these from PR runs.
This is a good practice.
Add an integration test that uses discrete data in conjuction with pandas
. There are currently no unit tests for this case. Toying with the feature might still uncover some bugs too ๐
Not necessarily generated from docstrings, but something like that.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.