Git Product home page Git Product logo

inform's People

Contributors

dglmoore avatar gvalentini85 avatar jakehanson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inform's Issues

Excess Entropy

Implement excess entropy in line with Feldman and Crutchfield [1, 2]. In particular, we will be implementing the one-dimensional excess entropy using the mutual information approach.

Proposed API

EXPORT double inform_excess_entropy(int const *series, size_t n, size_t m, int b, size_t k,
    inform_error *err);
EXPORT double *inform_excess_entropy(int const *series, size_t n, size_t m, int b, size_t k,
    double* ee, inform_error *err);

Example Usage

#define N 9

int xs[N] = {0,0,1,0,0,1,0,0,1};
inform_error err = INFORM_SUCCESS;
inform_excess_entropy(xs, 1, N, 2, 3, &err); // == 1.5

Compilation Fails with C++ Compiler

It looks like the transfer entropy header has an extra

#ifdef __cplusplus
}
#endif

which causes the compilation to fail when using a C++ compiler.

Copying distributions result in uninitialized values

There appears to be a bug in the inform_dist_copy code that results in an incorrect copy. Cursory investigation suggest that we have the source and destination arguments in memcpy swapped, so it is copying the destination into the source rather than the other way around.

Add tests to check for this.

Partial Information Decomposition

Implement partial information decomposition as described by Williams and Beer.

Proposed API

Implementing PID is going to take some effort on our part. I think the best approach will be to have a structure which represents a redundancy lattice of a given size:

typedef struct inform_pid_lattice inform_pid_lattice;

Each node in the lattice will have information about which nodes are below it and what the node's PI-function evaluates to. A single function call will take the various time series and construct the lattice:

inform_pid_lattice *inform_pid(int const *stimulus, int const *responses, 
    size_t l, size_t n, size_t m, int b, inform_pid_lattice *lattice, inform_error *err);

Example Usage

int stimulus[4]    = {0, 0, 1, 1};
int responses[8] = {0, 0, 1, 1, 0, 1, 0, 1};

inform_pid_lattice *pid = inform_pid(stimulus, responses, 2, 1, 4, 2, NULL, NULL);

Can't compute block entropy when k > 31

This seems like an overflow problem where the base b is multiplied k times without any check in block_entropy.c.

I got the issue using PyInform's block entropy function, but the issue clearly seems to be due to Inform.

Code:

from pyinform.blockentropy import block_entropy

x = (np.random.random([100]) > .5).astype(np.uint8)
for k in range(1, 50):
    print(k, block_entropy(x, k))

Output:

1 0.9953784388202257
2 1.9878129812393763
...
31 6.129283016944973

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-311-af5262631fbd> in <module>
      3 x = (np.random.random([100]) > .5).astype(np.uint8)
      4 for k in range(1, 50):
----> 5     print(k, block_entropy(x, k))

~\AppData\Roaming\Python\Python36\site-packages\pyinform\blockentropy.py in block_entropy(series, k, local)
    109         _local_block_entropy(data, c_ulong(n), c_ulong(m), c_int(b), c_ulong(k), out, byref(e))
    110     else:
--> 111         ai = _block_entropy(data, c_ulong(n), c_ulong(m), c_int(b), c_ulong(k), byref(e))
    112 
    113     error_guard(e)

OSError: exception: access violation writing 0x00000217586956EC

A better solution would be to use the biggest int type available, or at least raise an appropriate error message.

Obviously memory and computational complexity are always going to be limiting factors here. Any suggestion for working around this? Curve-fitting has been suggested here, but in my case I don't think that block entropy converges fast enough to a "fittable" curve (referring to the fact that it is supposed to converge to a straight line with a slope corresponding to the entropy rate, as k goes to infinity).

Transfer entropy is missing

The only time-series specific information metric that has been implemented is active information. To be useful, transfer entropy is a minimum requirement.

Information Flow

Implement information flow as proposed by Ay and Polani.

Proposed API

double inform_information_flow(int const *a, int const *b, int const *s, size_t l,
    size_t n, size_t m, int b, inform_error *err);

Example Usage

Pending

Black Boxing Function

When analyzing complex systems, it is often interesting to consider information processing in compound subsystems, e.g. transfer entropy between two groups of nodes instead of between to individual nodes. Instead of implementing that directly, we might want to provide a black boxing facility which essentially collapses the state of a group of nodes into a single encoded value.

Proposed API

EXPORT int* inform_black_box(int const *series, size_t l, size_t n, size_t m, int const *b,
    int const *r, int const *s, int *box, inform_error *err);

Example Usage

int *series = {
    1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
    0, 0, 1, 1, 1, 0, 1, 1, 0, 0
};
int box[8];
inform_error err = INFORM_ERROR;
inform_black_box(series, 2, 1, 10, (int[]){2,2}, (int[]){2, 1}, (int[]){1,0}, box, &err);
// box == {1, 12, 10, 9, 4, 14, 15, 3}

Information Measures on Floating-point Arrays

The current implementation of Inform implements the basic information measures using histogram-based, empirical probability distributions, inform_dist. However, sometimes users have a priori probability distributions which they would like to analyze. This can be done by approximating the probability distribution with inform_dist_approximate, but this approach has two downsides. First, it puts some burden on the user requiring them to manage allocated memory. Second, there is both a computation cost, between memory allocation and function calls, and typically a loss of precision.

For these reasons, we should implement a suite of information measures that operates on floating-point arrays instead of inform_dist instances.

Uniform Distributions

Create a function to construct uniform distributions.

Proposed API

EXPORT inform_dist* inform_dist_uniform(size_t n);

Example Usage

inform_dist *dist = inform_dist_uniform(5);

Predictive Information

Implement predictive information in line with Bialek, Nemenman and Tishby [1, 2]. The excess entropy of Feldman and Crutchfield [3, 4].

Predictive information, as defined by Bialek, et. al., is the mutual information between a finite past and the semi-infinite future. Our implementation will use finite past and futures.

Note that we can think of active information as a special case of predictive information.

Proposed API

EXPORT double inform_predictive_info(int const *series, size_t n, size_t m, int b, size_t kpast,
    size_t kfuture, inform_error *err);
EXPORT double *inform_predictive_info(int const *series, size_t n, size_t m, int b, size_t kpast,
    size_t kfuture, double* ee, inform_error *err);

Example Usage

#define N 9

int xs[N] = {0,0,1,0,0,1,0,0,1};
inform_error err = INFORM_SUCCESS;
inform_predictive_info(xs, 1, N, 2, 3, 3, &err); // == 1.5

Thoughts on Implementations

We might consider implementing excess entropy in terms of predictive information, e.g.

double inform_excess_entropy(int const *series, size_t n, size_t m, int b, size_t k,
    inform_error *err)
{
    return inform_predictive_info(series, n, m, b, k, k, err);
}

Fix Unit Testing Library

The unit testing library, provided by the test\unit\unit.h header provides an entry point that does not let the user run individual test suites. If a suite is provided on the command line, the argument is effectively ignored and all tests are run. The make test command does exactly, so all of the tests are all run multiple times increasing the time it takes to run the tests.

This is a negligible problem since the tests are so fast, but I'd be nice to make it go away.

Unexpected NaN for reasonable history lengths

The Bug
The active information algorithm fails and returns an untagged NaN when the requested history length is 29 or greater.

Confirmed on Platforms

  • OS X 10.11 (Apple LLVM 7.3.0)
  • Debian 8 (GCC 4.9.2).

Sample Code

#include "inform/inform.h"
#include "assert.h"

uint64_t* random_series(size_t size, uint64_t base);

int main()
{
    uint64_t *series = random_series(1000, 2);
    assert!(isnan(inform_active_info(series, 1000, 2, 28))); // succeeds
    assert(isnan(inform_active_info(series, 1000, 2, 28))); // fails
}

There is probably a 32-bit integer somewhere that is overflowing.

Possible Fixes

  • Fix the (hypothetical) overflow
  • Make the NaN return explicit and tag it with an error code.

Add Conditional Entropy

As a generalization of entropy rate, it would be able to compute the conditional entropy between two timeseries.

Proposed API:

double inform_conditional_entropy(int const *xs, int const *ys, size_t n,
    int bx, int by, double b, inform_error *err);

double *inform_local_conditional_entropy(int const *xs, int const *ys, size_t n,
    int bx, int by, double b, double *lce, inform_error *err)

Example Usage:

#define N 9

int xs[N] = {0,0,1,1,1,1,0,0,0};
int ys[N] = {1,0,0,1,0,0,1,0,0};

inform_error err = INFORM_SUCCESS;
inform_conditional_entropy(xs, ys, N, 2, 2, 2.0, &err); // == 0.899985
inform_conditional_entropy(ys, xs, N, 2, 2, 2.0, &err); // == 0.972765

double lce[N];
inform_conditional_entropy(xs, ys, N, 2, 2, 2.0, lce, &err); // == lce
// lce == {1.322 0.737, 0.415, 2.000,  0.415, 0.415, 1.322, 0.737, 0.737}
inform_conditional_entropy(ys, xs, N, 2, 2, 2.0, lce, &err); // == lce
// lce == {0.585, 1.000, 1.000, 1.585, 1.000, 1.000, 0.585, 1.000, 1.000}

Logarithm bases should be arguments to the entropy functions

At the moment the various entropy functions assume base-2logarithms. This is fine for boolean networks were the states of a given node is 2; however, for more obscure networks whose states are non-boolean, the resulting entropies are not easily comparable to boolean cases. When interpreting the results, we have to keep in mind the base.

One method of resolving this issue is to compute the entropies using the same base as the network uses for its states, e.g. if the state of a node can be one of four values, use base-4 logarithms.

Remove Arbitrary-base Logarithms

The current API conflates two different notions: the base of the time series and the base of the logarithm. The base of the time series is always used as the base of the logarithm which is somewhat confusing.

This change will not change the public API, but it will change the return values of all time series measures.

Relative Entropy

One useful value to compute is relative entropy. We already have relative entropy implemented on distributions (commit 2715539). It would be to construct a distribution of each of two timeseries and compute the relative entropy at one fell swoop.

Note that a local measure of relative entropy may not be so well defined on timeseries. Unlike the other local measures so far implemented, averaging the local relative entropy in the naive way will not generally return the global relative entropy. This is because the average is to be taken of the posterior distribution, not the joint distribution. This point should be discussed further.

Proposed API:

double inform_relative_entropy(int const *xs, int const *ys, size_t n,
    int bx, int by, double b, inform_error *err);

Example Usage:

#define N 9

int xs[N] = {0,0,1,1,1,1,0,0,0};
int ys[N] = {1,0,0,1,0,0,1,0,0};

inform_error err = INFORM_SUCCESS;
inform_relative_entropy(xs, ys, N, 2, 2, 2.0, &err); // == 0.038330
inform_relative_entropy(ys, xs, N, 2, 2, 2.0, &err); // == 0.037010

Add accumulation functionality for distributions

The inform_dist API could stand to have a few additional functions.

Proposed API

// Accumulate observations from an array of events and return the number of observations made.
EXPORT size_t inform_dist_accumulate(inform_dist *dist, size_t *events, size_t n);

Example Usage

inform_dist *dist = inform_dist_alloc(2);
size_t events[5] = {0,1,1,0,1,0};
size_t n = inform_dist_accumulate(dist, events, 5);
if (n != 5)
{
    fprintf(stderr, "invalid event at index %ld\n", n);
}

Separable Information

Implement separable information as presented by Lizier, Prokopenko and Zomaya.

Separable information for node x in a system with causal sources ys is the sum of the active information storage of x and the transfer entropy from each source in ys to x.

Proposed API

EXPORT double inform_separable_info(int const *srcs, int const *dst, size_t l, size_t n, size_t m,
    int b, size_t k, inform_error *err);
EXPORT double *inform_separable_info(int const *srcs, int const *dst, size_t l, size_t n, size_t m,
    int b, size_t k, double *sep, inform_error *err);

Example Usage

#define L 1
#define N 9
#define M 2

int xs[L * N] = {0,1,0,1,0,1,0,0,1};
int ys[L * N * M] = {
    0,0,1,0,1,1,1,0,1,
    1,0,1,1,0,1,1,1,0
};

inform_error err = INFORM_SUCCESS;
inform_separable_info(ys, xs, L, N, M, 2, 2, &err) // 0.699514

New to Inform - Have a few questions

Dear Developers,

I currently actively use a code from your colleagues called JIDT/IDTxl, I'm sure you are familiar with it. Currently my code works well, but has prohibitively high computation time and memory footprint (even after I have parallelized it to multiple cores). I am looking for potential alternatives, and your code seems to be implemented in C, which gives me hope that it might be a good alternative for my tasks.

I am interested in computing Entropy, Mutual Information and Transfer Entropy for 3D matrices of [Channel x Time x Trial], where Trial stands for repetitions of the same experiment.

I have read through your documentation and some part of the source code, and still have unanswered questions. Would you be so kind to answer, or direct me to the correct place to ask the questions:

  • Is it currently possible to use real data? All examples seem to be using integer time-series
  • In the source code I have only found histogram-based estimators. Are there currently other estimators available (such as Kraskov)? Is the histogram estimator bias-corrected?
  • What exactly does block-entropy for k>1 do? Does it split time-steps into subsets of length k, or does is sweep the timesteps with a window of length k?
  • I am not able to figure out from the documentation what is an initial condition. Could you explain this concept or direct me to literature? Is this the same as what I call Trial? In that case, is it possible to, for example, find mutual information between two variables, for which only one time step, but many trials are given?
  • Transfer Entropy operates with lags. Questions of interest are "what is TE for X->Y at lag=7" or "what is TE for X->Y given all lags={1,2,3,4,5}". Can a lag parameter be provided? What is the current convention?
  • JIDT provides multivatiate TE estimators, which allow (to some extent) to eliminate spurious connections such as those due to common ancestry and intermediate link. Is such functionality present or foreseen in the nearest future?
  • For TE and MI, another super valuable measure is test against zero. Currently, JIDT performs such tests and returns p-values along with the estimates, allowing the user to estimate if there is at all a relationship between variables above chance? Is such functionality implemented or intended?

In principle, I would be interested in contributing, if I can achieve my goals with your toolbox given a few weeks of coding.

Best,
Aleksejs

Various issues with the documentation

List of various issues I found in the documentation of inform while developing the R wrapper (rinform):

  • many time series measures point to a wiki page, Transfer Entropy lacks this link although there is a wiki page (https://en.wikipedia.org/wiki/Transfer_entropy)
  • transfer entropy: "between an information source and destination" -> missing "a destination"
  • transfer entropy: "interest have more just those two" -> missing "more than"
  • around the whole documentation: links to the inform github are sometimes broken, e.g. the "yet!" link https://github.com/elife-asu/issues/24 should be #24
  • predictive info, examples: "between a the current time step" should be "between the current time step"
  • the following ref is missing the journal name: "[Hoel2017] Hoel, E.P. (2017) "When the map is better than the territory". 19 (5): 188. doi:10.3390/e19050188."
  • reference [Ay2008] is not defined in the documentation
  • equations for entropy rate is missing a negative sign

Multivariate Mutual Information

The current implementation of mutual information can compute the mutual information between two time-series. However, some applications, such as integration measures, require calculating the mutual information between more than two variables.

Proposed API Change

EXPORT double inform_mutual_info(int const *series, size_t l, size_t n, size_t m, int b,
    inform_error *err);
EXPORT double *inform_mutual_info(int const *series, size_t l, size_t n, size_t m, int b,
    double *mi, inform_error *err);

Example Usage

#include <assert.h>
#include <inform/mutual_info.h>
#include <stdio.h>

int main()
{
    int series[5] = {
        0, 1, 1, 0, 1,
        1, 1, 0, 1, 1,
        0, 1, 1, 0, 1,
    }

    inform_error err = INFORM_SUCCESS;
    double mi = inform_mutual_info(series, 3, 1, 5, 2, &err);
    assert(inform_succeeded(&err));
    printf("%0.6lf\n", mi); // 7.828819
}

Create a "Transition Probability Matrix" Function

Implementations of Effective Information would benefit from a function which computes transition probability matrices: a matrix encoding the probability of state transitions.

Proposed API

EXPORT double *inform_tpm(int const *series, size_t n, size_t m, int b, double *tpm,
    inform_error *err);

Example Usage

#include <inform/tpm.h>
#include <stdio.h>

int main()
{
    int series[13] = {0,0,1,0,1,0,0,1,0,1,0,0,1};
    double tpm[4];
    inform_error err = INFORM_SUCCESS;
    inform_tpm(series, 1, 13, 2, tpm, &err);
    if (inform_failed(&err))
    {
        fprintf(stderr, "an error occurred (%d)", err);
    }
    else
    {
        for (size_t i = 0; i < 2; ++i)
        {
            for (size_t j = 0; j < 2; ++j)
            {
                printf("%0.3lf ", tpm[2 * i + j]);
            }
            printf("\n");
        }
    }
}

This should produce the following output:

0.375 0.627
1.000 0.000

where the (i,j)th element represents the probability of transitioning from the ith state to the jth state, (given the system is in the ith state).

Extrapolating to Infinite History Lengths

One feature that may be of value is a method for extrapolating the finite-history estimates of active information, transfer entropy, etc... to their full infinite-history counterparts. This is a feature that none of the related projects, e.g. JIDT, seem to have.

Proposed Approach

The obvious method for doing this is to implement a suite of curve-fitting functions which, given a sequence of values parameterized by k, would fit a closed-form curve. Taking the limit as k → ∞, would give an approximation of the infinite-history form of the various measures.

Rename inform_shannon

The inform_shannon function computes the Shannon entropy of an inform_dist. The name is a bit too ambiguous and doesn't really follow the inform_shannon_* naming convention used for the other measures. We should really rename inform_shannon to inform_shannon_entropy.

Missing unit tests

The current release of inform is missing a dedicated unit test for:

  • conditional entropy function
  • conditional mutual information
    computed on probability distributions.

See files:

  • "test/unittests/shannon/multivariate.c"
  • "test/unittests/shannon/univariate.c"

Integrated Spatiotemporal Patterns

Implement algorithms for computing time series integration based on the work of Biehl, Ikgami and Polani

One of the problems with measures of integration such as integrated spatiotemporal patterns (ISTP) is that their computational complexity is super-exponential in the number of sources. This is a result of the fact that one has to consider every possible partitioning scheme for the sources. As an approximation, we will have to take a "level" argument which specifies how many different partitions we wish to consider.

For example, a level = 0 will mean that we wish to consider every possible partitioning, level = 1 will mean only consider the finest partition, level = 2 will mean consider the two finest levels of partitioning. We will also allow negative level, e.g. level = -1 will consider the coarsest (non-trivial) partitioning.

Proposed API

EXPORT double *inform_istp(int const *series, size_t l, size_t n, size_t m,
    int const *b, int level, double *istp, inform_error *err);

Example Usage

int series[18] = {0,0,1,0,1,0,0,1,0,  1,0,0,1,0,1,1,0,1};
double istp[9];
inform_error err = INFORM_SUCCESS;
inform_istp(series, 2, 1, 9, (int[]){2,2}, 1, istp, &err);
// istp == { 0.58,-1.41, 1.17, 0.58, 1.17, 0.58, 0.58, 1.17, 0.58}

Add OS X binaries

The v0.0.3 and v0.0.4 binary releases only contain Windows (MSVC) and Linux 64-bit binaries. In order for the wrapper projects to support OS X, we need to include OS X binaries in future releases.

Generalized Local Information Measures

Premise: what follows makes sense whereby metrics are computed from a multiple initial conditions dataset (i.e., a set of time series). Disregard otherwise.

It would be nice to have a way to tell the library to compute probabilities (or distributions) as a function of time (in contrast to using the entire time series) in local variants of metrics like AI or TE.
Where for "probabilities as a function of time" I mean that, if I'm computing AI at time step i with an history k, then probabilities are computed only over entries (i-k, .. i-1, i, i+1) of each time series in the dataset.

Effective Information

Implement cause, effect and effective information as described by Hoel, Albantakis and Tononi.

Proposed API

EXPORT double inform_effective_info(int const *tpm, int const *intervention, size_t n,
    inform_error *err);

Example Usage

int series[10] = {0,1,1,0,1,0,0,1,0,1}

inform_error err = INFORM_SUCCESS;
int *tpm = inform_tpm(series, 1, 10, 2, NULL, &err);
assert(inform_succeeded(&err));

inform_effective_info(tpm, (double[2]){0.25, 0.75}, 2, &err); // 0.471407
assert(inform_succeeded(&err);

Fix Potential Memory Leaks

Several of the time series measures perform multiple allocations. The difficulty is that if one of those allocations fail, then we have to free all of the successful allocations. We are not doing that at present.

Cross Entropy

We have already defined mutual information and relative entropy between time series and between distributions. Let's round out the basic measures with cross entropy!

Proposed API

EXPORT double* inform_cross_entropy(const int *p, const int *q, size_t n, size_t m, int b,
    inform_error *err);

Example Usage

int p[10] = {0,1,1,0,1,0,0,1,0,0}; (6,4)
int q[10] = {1,1,1,0,1,1,0,0,0,1}; (4,6)
inform_error err = INFORM_SUCCESS;
inform_cross_entropy(p, q, 1, 10, &err); // 1.087943

Black boxing encoding errors

From a discussion with @jakehanson, we realize that inform_black_box could result in integer overflow in some pathological situations. For example, if the user tries to encode the state of a 32-node boolean network with history length 1, and future length 0, the resulting encoded state will almost necessarily overflow a signed 32-bit integer.

There isn't much we can do about this aside from checking, based on the function parameters, whether we are in such a pathological situation and return an INFORM_EENCODE error.

Setup documentation

We've so far been documenting the API using Doxygen-style source comments. We need to go through the process of setting up some documentation generation solution using either Doxygen or AsciiDoc.

OSError when computing transfer entropy

Hi, I'm trying to compute the transfer entropy using PyInform with background processes. The data I have is quite large (many random boolean networks with many initial conditions, trying to reproduce the results in this paper) and above a certain amount of data I get this kind of error on my Windows 10 laptop:

OSError: exception: access violation reading 0x0000025CD86EF000

and on linux (on my uni's high performance computing cluster) I get the following kind of error:

207277 Segmentation fault python inform_transfer_entropy_issue.py

I have written a bit of code to reproduce this error:

import numpy as np
from pyinform import transfer_entropy

init_conditions = [np.random.randint(0,2,(400,250)).astype(bool) for _ in range(4480)]

sources = [states[:, 1] for states in init_conditions]
recievers = [states[:, 2] for states in init_conditions]
apparent_transfer = transfer_entropy(sources, recievers, k=13)
conditions = None
other_inputs = [3, 4]
conditions = []
for other_inp in other_inputs:
    conditions.append([states[:, other_inp] for states in init_conditions])
complete_transfer = transfer_entropy(sources, recievers, k=13, condition=conditions)

Note that the error is only triggered on the line that computes the transfer entropy with conditions (the last line). Also note that I am generating (mock) data for 4480 initial conditions (the amount of initial conditions for a random boolean network in practice), but it even happens when it is just for 430 initial conditions. I have assumed this is some kind of bug. But I also want to make sure I'm going about this correctly. For the conditions argument, if I have multiple background processes and multiple initial conditions, is the third dimension of the multi-dim array the different initial conditions (like I have set it up in the code example)? Is the function simply not able to handle large amounts of initial conditions? If my usage makes sense and this is an issue with inform, is there some workaround I can use?

I would appreciate any help you can provide.

Function-Call Overhead

The entropy measures are written for maximum code reuse. The advantage of this is that it makes maintenance easier, but it comes with a cost. For example, inform_shannon uses inform_shannon_si in a tight loop. The function-call overhead means we loose performance.

Maintainability is a higher priority than performance, but the gain in performance (~2x) is worth the mild loss in maintainability.

Incorrect Results for Complete Transfer Entropy with Multiple Initial Conditions

Description

Complete transfer entropy is designed in such a way that if a background process is informationally equivalent to the source node, then TE=0. For example, inform_transfer_entropy behaves as expected in the following example:

int const xs[9] = {0,1,1,1,1,0,0,0,0};
int const ys[9] = {0,0,1,1,1,1,0,0,0};
int const *back = xs;
double te =  inform_transfer_entropy(xs, ys, back, 1, 1, 9, 2, 2, NULL);
assert(te == 0.0);

However, when we move up to multiple initial conditions, something goes awry:

int const xs[18] = {1,0,0,0,0,1,1,1,1,
                    1,1,1,1,0,0,0,1,1};                                              
int const ys[18] = {0,0,1,1,1,1,0,0,0,                                               
                    1,0,0,0,0,1,1,1,0};  
int const *back = xs;
double te =  inform_transfer_entropy(xs, ys, back, 1, 2, 9, 2, 2, NULL);
assert(te != 0.0);
// te ~ 0.536413

Continuously-Valued Timeseries

This is a discussion post. Please feel free to comment and contribute to the discussion even if you are not directly involved in the development of inform or its wrapper libraries.

The Problem

The various information measures are really designed around discrete-valued timeseries data. In reality, most data are continuous in nature, and up to this point our go-to approach is to bin.

At this point we've implemented several binning procedures (see 1355d68). Binning works fine for some problems (e.g. if the system has a natural threshold), but when it is applied artificially it can introduce hefty bias. The problem gets worse when you attempt to compare two different timeseries. Should they be binned in the same way, e.g. uniform bin sizes, specific number of bins, etc...?

Possible Solutions

All of the information measures are built around probability distributions. The timeseries measures simply construct empirical probability distributions and call an information measure on the distribution. "All" that must be done to accommodate continuously-valued distributions is to attempt to infer the distribution from the data.

Machine learning is more or less built around inferring probability distributions and then making some sort of decision from that. Consequently there are easily dozens of algorithms for inferring distributions from continuously-valued observations. One simple example of such an algorithm, kernel density estimation, has been around since the 50's.

Usefulness

This would likely be useful to @dglmoore and @colemathis as the systems that we deal with are either continously-valued, or are so discrete that treating as continuous is more memory efficient than otherwise. Would this be useful to anyone else? If so we can prioritize it over some of the other new features that we are considering.

Acknowledgments

The JIDT project, written and maintained by the estimable Joe Lizier, implements such an approach. The work produced a paper which describes the three inference algorithms they've implemented.

Also, thank you @hbsmith and @colemathis for pointing out JIDT.

Alternative Entropies

This is a discussion post. Please feel free to comment and contribute to the discussion even if you are not directly involved in the development of inform or its wrapper libraries.

Premise

Claude Shannon introduced his measure of entropy in his 1948 paper A Mathematical Theory of Communication. Since then there have been several new measures of entropy have been developed, (see Renyi, 1961 and Tsallis, 1988 for notable examples). Each of these measures are actually families of entropy measures parameterized by at least one continuous parameter, and tend toward Shannon's measure in some limit of those parameters. They also admit divergences which tend toward Kullback–Leibler divergence in the same limit.

Question

These alternative entropy measures do have a place in our toolbox. The question is, is it worthwhile to put some effort into implementing these measures. To my knowledge Renyi and Tsallis entropies are not implemented in any of the standard information toolkits. This could be because they are not useful, in which case implementing them would be a waste of a lot of thought, energy and drive us a little closer to carpel-tunnel. Or it could be that no one has considered using them and treasures are waiting to be uncovered.

Let us know what you think!

Add Intervention Distribution to Information Flow

Description

To properly discuss causal structure, we need to be able to apply generic intervention distributions. The current implementation of inform_effective_info admits this possibility, but inform_info_flow does not. We should rectify this situation.

Proposed Resolution

EXPORT double inform_information_flow(int const *src, int const *dst,
    int const *back, double const *inter, size_t l_src, size_t l_dst,
    size_t l_back, size_t n, size_t m, int b, inform_error *err);

Distributions for Probabilities

Sometimes we'd like to be able to use the functionality of Inform given an known probability distribution rather than one inferred from time series data. To that end, we need a function to construct a distribution which reproduces the original probabilities up to some tolerance.

This has already been done in PyInform by @jakehanson.

Proposed API

EXPORT inform_dist *inform_dist_estimated(double *probs, size_t n, double tol);

Example Usage

double probs[3] = {0.5, 0.2, 0.3};
inform_dist *dist = inform_dist_estimated(probs, 3, 1e-6);
assert(dist->counts == 10);
assert(inform_dist_get(dist, 0) == 5);
assert(inform_dist_get(dist, 1) == 2);
assert(inform_dist_get(dist, 2) == 3);

Is Inform active or not? What about kNN-based estimators?

Good day,

My question is whether Inform is still an active project. Are there will to consider the implementation of kNN-based estimators of entropy and mutual information, i.e. Kozachenko-Leonenko and Kraskov-Stögbauer-Grassberger?

Thanks.

Complete Transfer Entropy

Implement complete transfer entropy to parallel the currently implemented apparent transfer entropy inform_transfer_entropy. See Lizier, Prokopenko and Zomaya for more information.

Proposed API

EXPORT double inform_transfer_entropy(int const *ys, int const *xs, int const *vs, size_t l,
    size_t n, size_t m, int b, size_t k, inform_error *err);
EXPORT double *inform_transfer_entropy(int const *ys, int const *xs, int const *vs, size_t l,
    size_t n, size_t m, int b, size_t k, double *te, inform_error *err);

Example Usage

int xs[8] = {0,1,0,0,1,1,0,1};
int ys[8] = {0,1,1,0,0,1,0,0};
int zs[8] = {1,1,0,0,0,1,0,0};

inform_error err = INFORM_SUCCESS;
inform_transfer_entropy(ys, xs, zs, 1, 1, 8, 2, 2, &err); // 0.333333

Add Artifacts to Continuous Integration Builds

Description

We have to build on each of the target platforms, Linux, Windows and OS X. This requires having access to a computer with that operating system installed. We should set up the continuous integration builds to save the compiled binaries so that we can use those for releases. This will make the release process easier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.