Git Product home page Git Product logo

pymccorrelation's Introduction

pymccorrelation

A tool to calculate correlation coefficients for data, using bootstrapping and/or perturbation to estimate the uncertainties on the correlation coefficient. This was initially a python implementation of the Curran (2014) method for calculating uncertainties on Spearman's Rank Correlation Coefficient, but has since been expanded. Curran's original C implementation is MCSpearman (ASCL entry).

Currently the following correlation coefficients can be calculated (with bootstrapping and/or perturbation):

Kendall's tau can also calculated when some of the data are left/right censored, following the method described by Isobe+1986.

Requirements

  • python3
  • scipy
  • numpy

Installation

pymccorrelation is available via PyPi and can be installed with:

pip install pymccorrelation

Usage

pymccorrelation exports a single function to the user (also called pymccorrelation).

from pymccorrelation import pymccorrelation

[... load your data ...]

The correlation coefficient can be one of pearsonr, spearmanr, or kendallt.

For example, to compute the Pearson's r for a sample, using 1000 bootstrapping iterations to estimate the uncertainties:

res = pymccorrelation(data['x'], data['y'],
                      coeff='pearsonr',
                      Nboot=1000)

The output, res is a tuple of length 2, and the two elements are:

  • numpy array with the correlation coefficient (Pearson's r, in this case) percentiles (by default 16%, 50%, and 84%)
  • numpy array with the p-value percentiles (by default 16%, 50%, and 84%)

The percentile ranges can be adjusted using the percentiles keyword argument.

Additionally, if the full posterior distribution is desired, that can be obtained by setting the return_dist keyword argument to True. In that case, res becomes a tuple of length four:

  • numpy array with the correlation coefficient (Pearson's r, in this case) percentiles (by default 16%, 50%, and 84%)
  • numpy array with the p-value percentiles (by default 16%, 50%, and 84%)
  • numpy array with full set of correlation coefficient values from the bootstrapping
  • numpy array with the full set of p-values computed from the bootstrapping

Please see the docstring for the full set of arguments and information including measurement uncertainties (necessary for point perturbation) and for marking censored data.

Citing

If you use this script as part of your research, I encourage you to cite the following papers:

  • Curran 2014: Describes the technique and application to Spearman's rank correlation coefficient
  • Privon+ 2020: First use of this software, as pymcspearman.

Please also cite scipy and numpy.

If your work uses Kendall's tau with censored data please also cite:

  • Isobe+ 1986: Censoring of data when computing Kendall's rank correlation coefficient.

pymccorrelation's People

Contributors

privong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pymccorrelation's Issues

Mock dataset to test recoverability

From Ilsang:

I am really curious if the output correlation coefficient recovers the intrinsic correlation coefficient within certain confidence interval. The correlation coefficient was defined by cov(X,Y)/{var(X)*var(Y)}, so you can easily generate random data points with given covariance matrix and you know what the input correlation coefficient is. If you do simulations of measuring correlation coefficient of the simulated dataset (using your program) with different sample size and measurement errors, how reliably can you recover the input correlation coefficient?

https://scipy-cookbook.readthedocs.io/items/CorrelatedRandomSamples.html

Add pearson

Implement the pearson correlation coefficient, after finishing the single-interface change (#2).

De-duplicate bootstrapping / perturbation code

Instead of separate pymcspearman() and pymckendalltau() functions, provide a single pymccorrelation() function where the user can specify the correlation coefficient to calculate:

  • Spearman's Rank Correlation Coefficient
  • Kendall Rank Correlation Coefficient (optionally considering censored data)
  • others...

Convergence of bootstrapping results, `nan` values for small sample sizes

If the sample size is small (or the number of bootstraps is large), the correlation coefficients can be undefined and return nan values. The use of np.percentile() then returns nan from pymccorrelation(). If there's many nan values this probably suggests the bootstrapping is not well-converged. When looking at the mock dataset to check recovery (#4), the convergence of bootstrapping would be good to consider.

Ultimately, decide if nanpercentile() should be used, optionally with a warning if the size of the dataset is too small for reliable bootstrap error estimation.

There is probably statistics literature about this too...

Speed-up Kendall's tau w/ censoring

The implementation of Kendall's tau method for censored data is slow. This is likely due in part to the nested for loops. The code should be looked at to see if it can be sped up (by vectorization or other approaches).

More flexible error distributions

As of now, the library doesn't allow non-symmetric uncertainties, or values at a boundary (in which the uncertainty is one sided).
These cases are extremely common for e.g. astrophysics fitting results. I've done it myself locally and it's a very minor change. I think a proper implementation up to the standards of such an important library would be very beneficiary to the community.

Simplify function calls

In functions such as pymcspearman(), having bootstrap=True and Nboot=10000 is redundant. Eliminate the flag argument and set Nboot=None by default. Non-none values will enable bootstrapping or perturbation.

Minor misspelling on README

Hey, just a minor misspelling on the README both here and on the pypi page that could give some trouble in the future, so I thought I could give you a heads up.
On the example to compute the Pearson's r for a sample, using 1000 bootstrapping iterations, you have the formula:
res = pymccorrelation(data['x'], data['y]',
coeff='pearsonr',
Nboot=1000)

The data for the y variable is typed data['y]' with que quotes off in one side, as it should be data['y'], if copy and pasted directly this simple error can cause some damage until repaired.
That's all, thanks for the package, it really helped me.

Feature request: tests of data drawn from distributions

Apply the monte carlo (bootstrapping and perturbation) methods to statistical tests comparing data with empirical or parametric probability distributions.

Tests to Consider for Implementation

Top Priority

  • Anderson-Darling Test (scipy)
  • Kolmogorov-Smirnov Test (data with distribution [scipy], data with other data [scipy])
  • Shapiro-Wik (scipy).
    • The scipy implementation does not handle censored data, but an algorithm for that is given in ALGORITHM AS R94 APPL. STATIST. (1995) VOL. 44, NO. 4.

Lower priority

  • Fligner-Killeen (equality of variance, scipy)
  • Mood’s median test (scipy)

Warn user if Nboot > Npermutations

If the number of bootstraps requested is larger than the number of possible permutations, duplicate coefficient/p-value results will be returned. This will lead to over-representation of that value in the probability distribution.

This check can be done at the start, when data lengths are checked for consistency:

# do some checks on input array lengths and ensure the necessary data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.