Git Product home page Git Product logo

pingouin's Introduction


https://pepy.tech/badge/pingouin/month

https://pingouin-stats.org/build/html/_images/logo_pingouin.png

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation.

  1. ANOVAs: N-ways, repeated measures, mixed, ancova
  2. Pairwise post-hocs tests (parametric and non-parametric) and pairwise correlations
  3. Robust, partial, distance and repeated measures correlations
  4. Linear/logistic regression and mediation analysis
  5. Bayes Factors
  6. Multivariate tests
  7. Reliability and consistency
  8. Effect sizes and power analysis
  9. Parametric/bootstrapped confidence intervals around an effect size or a correlation coefficient
  10. Circular statistics
  11. Chi-squared tests
  12. Plotting: Bland-Altman plot, Q-Q plot, paired plot, robust correlation...

Pingouin is designed for users who want simple yet exhaustive statistical functions.

For example, the ttest_ind function of SciPy returns only the T-value and the p-value. By contrast, the ttest function of Pingouin returns the T-value, the p-value, the degrees of freedom, the effect size (Cohen's d), the 95% confidence intervals of the difference in means, the statistical power and the Bayes Factor (BF10) of the test.

Documentation

Chat

If you have questions, please ask them in GitHub Discussions.

Installation

Dependencies

The main dependencies of Pingouin are :

In addition, some functions require :

Pingouin is a Python 3 package and is currently tested for Python 3.8-3.11.

User installation

Pingouin can be easily installed using pip

pip install pingouin

or conda

conda install -c conda-forge pingouin

New releases are frequent so always make sure that you have the latest version:

pip install --upgrade pingouin

Development

To build and install from source, clone this repository or download the source archive and decompress the files

cd pingouin
python -m build            # optional, build a wheel and sdist
pip install .              # install the package
pip install --editable .   # or editable install
pytest                     # test the package

Quick start

Click on the link below and navigate to the notebooks/ folder to run a collection of interactive Jupyter notebooks showing the main functionalities of Pingouin. No need to install Pingouin beforehand, the notebooks run in a Binder environment.

10 minutes to Pingouin

1. T-test

import numpy as np
import pingouin as pg

np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T

# T-test
pg.ttest(x, y)
Output
T dof alternative p-val CI95% cohen-d BF10 power
-3.401 58 two-sided 0.001 [-1.68 -0.43] 0.878 26.155 0.917

2. Pearson's correlation

pg.corr(x, y)
Output
n r CI95% p-val BF10 power
30 0.595 [0.3 0.79] 0.001 69.723 0.950

3. Robust correlation

# Introduce an outlier
x[5] = 18
# Use the robust biweight midcorrelation
pg.corr(x, y, method="bicor")
Output
n r CI95% p-val power
30 0.576 [0.27 0.78] 0.001 0.933

4. Test the normality of the data

The pingouin.normality function works with lists, arrays, or pandas DataFrame in wide or long-format.

print(pg.normality(x))                                    # Univariate normality
print(pg.multivariate_normality(np.column_stack((x, y)))) # Multivariate normality
Output
W pval normal
0.615 0.000 False
(False, 0.00018)

5. One-way ANOVA using a pandas DataFrame

# Read an example dataset
df = pg.read_dataset('mixed_anova')

# Run the ANOVA
aov = pg.anova(data=df, dv='Scores', between='Group', detailed=True)
print(aov)
Output
Source SS DF MS F p-unc np2
Group 5.460 1 5.460 5.244 0.023 0.029
Within 185.343 178 1.041 nan nan nan

6. Repeated measures ANOVA

pg.rm_anova(data=df, dv='Scores', within='Time', subject='Subject', detailed=True)
Output
Source SS DF MS F p-unc ng2 eps
Time 7.628 2 3.814 3.913 0.023 0.04 0.999
Error 115.027 118 0.975 nan nan nan nan

7. Post-hoc tests corrected for multiple-comparisons

# FDR-corrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_tests(data=df, dv='Scores', within='Time', subject='Subject',
                             parametric=True, padjust='fdr_bh', effsize='hedges')

# Pretty printing of table
pg.print_table(posthoc, floatfmt='.3f')
Output
Contrast A B Paired Parametric T dof alternative p-unc p-corr p-adjust BF10 hedges
Time August January True True -1.740 59.000 two-sided 0.087 0.131 fdr_bh 0.582 -0.328
Time August June True True -2.743 59.000 two-sided 0.008 0.024 fdr_bh 4.232 -0.483
Time January June True True -1.024 59.000 two-sided 0.310 0.310 fdr_bh 0.232 -0.170

8. Two-way mixed ANOVA

# Compute the two-way mixed ANOVA
aov = pg.mixed_anova(data=df, dv='Scores', between='Group', within='Time',
                     subject='Subject', correction=False, effsize="np2")
pg.print_table(aov)
Output
Source SS DF1 DF2 MS F p-unc np2 eps
Group 5.460 1 58 5.460 5.052 0.028 0.080 nan
Time 7.628 2 116 3.814 4.027 0.020 0.065 0.999
Interaction 5.167 2 116 2.584 2.728 0.070 0.045 nan

9. Pairwise correlations between columns of a dataframe

import pandas as pd
np.random.seed(123)
z = np.random.normal(5, 1, 30)
data = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
pg.pairwise_corr(data, columns=['X', 'Y', 'Z'], method='pearson')
Output
X Y method alternative n r CI95% p-unc BF10 power
X Y pearson two-sided 30 0.366 [0.01 0.64] 0.047 1.500 0.525
X Z pearson two-sided 30 0.251 [-0.12 0.56] 0.181 0.534 0.272
Y Z pearson two-sided 30 0.020 [-0.34 0.38] 0.916 0.228 0.051

10. Pairwise T-test between columns of a dataframe

data.ptests(paired=True, stars=False)
Pairwise T-tests, with T-values on the lower triangle and p-values on the upper triangle
  X Y Z
X
0.226 0.165
Y -1.238
0.658
Z -1.424 -0.447

11. Multiple linear regression

pg.linear_regression(data[['X', 'Z']], data['Y'])
Linear regression summary
names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%]
Intercept 4.650 0.841 5.530 0.000 0.139 0.076 2.925 6.376
X 0.143 0.068 2.089 0.046 0.139 0.076 0.003 0.283
Z -0.069 0.167 -0.416 0.681 0.139 0.076 -0.412 0.273

12. Mediation analysis

pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)
Mediation summary
path coef se pval CI[2.5%] CI[97.5%] sig
Z ~ X 0.103 0.075 0.181 -0.051 0.256 No
Y ~ Z 0.018 0.171 0.916 -0.332 0.369 No
Total 0.136 0.065 0.047 0.002 0.269 Yes
Direct 0.143 0.068 0.046 0.003 0.283 Yes
Indirect -0.007 0.025 0.898 -0.069 0.029 No

13. Contingency analysis

data = pg.read_dataset('chi2_independence')
expected, observed, stats = pg.chi2_independence(data, x='sex', y='target')
stats
Chi-squared tests summary
test lambda chi2 dof p cramer power
pearson 1.000 22.717 1.000 0.000 0.274 0.997
cressie-read 0.667 22.931 1.000 0.000 0.275 0.998
log-likelihood 0.000 23.557 1.000 0.000 0.279 0.998
freeman-tukey -0.500 24.220 1.000 0.000 0.283 0.998
mod-log-likelihood -1.000 25.071 1.000 0.000 0.288 0.999
neyman -2.000 27.458 1.000 0.000 0.301 0.999

Integration with Pandas

Several functions of Pingouin can be used directly as pandas DataFrame methods. Try for yourself with the code below:

import pingouin as pg

# Example 1 | ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)

# Example 2 | Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])

# Example 3 | Partial correlation matrix
data.pcorr()

The functions that are currently supported as pandas method are:

Development

Pingouin was created and is maintained by Raphael Vallat, a postdoctoral researcher at UC Berkeley, mostly during his spare time. Contributions are more than welcome so feel free to contact me, open an issue or submit a pull request!

To see the code or report a bug, please visit the GitHub repository.

This program is provided with NO WARRANTY OF ANY KIND. Pingouin is still under heavy development and there are likely hidden bugs. Always double check the results with another statistical software.

Contributors

How to cite Pingouin?

If you want to cite Pingouin, please use the publication in JOSS:

Acknowledgement

Several functions of Pingouin were inspired from R or Matlab toolboxes, including:

pingouin's People

Contributors

adamnarai avatar arthurpaulino avatar dominicchm avatar ejolly avatar ganshengt avatar gedeck avatar getzze avatar jajcayn avatar jinwx avatar joelfner avatar julibeg avatar kncrabtree avatar kraktus avatar legrandnico avatar leicas avatar michalkahle avatar puncocharm avatar qbarthelemy avatar raphaelvallat avatar remrama avatar sappelhoff avatar sjg2203 avatar smathot avatar spaak avatar swallan avatar systole-docs avatar tirkarthi avatar viktorwase avatar vojtech-filipec avatar xfz329 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pingouin's Issues

BF10 in Anova

Hi Raphael,

I'm wondering if there is any possibility to compute BF10 from your implementations of ANOVA alongside p-values?

Wish you a nice day,
Clément.

Rmcorr Issues

Hi,

Noticed some issues while trying rmcorr.

  • If the column you pass is non-numeric, it just results in an assertion error with no error message. The r library rmcorr() returns message saying: 'Measure 1' and 'Measure 2' must be numeric. Maybe return some message like this?

  • In ancova, bw = ss_slopes.sum() / ss.sum() (and ss_c = ss_slopes.sum() * bw) returns NaN if any of the values in ss_slopes is NaN. Basically, sum() on a numpy array == NaN if it has any NaNs in the array. As a result, rmcorr or ancova result turns out to be NaN. I think this should be changed to np.nansum(ss_slopes) instead. This will ignore NaNs but still sum the remaining values.

Thank you!

Shapiro-Wilk test should have option to return W statistic

Currently pingouin.normality() silently discards the W statistic returned by scipy.stats.shapiro(). I would love to have an option to retrieve these stats, as I need to present them together with the p values when drafting a scientific publication. I wonder if the function could not be adjusted to return a DataFrame with numerous information like most of the other pingouin functions?

Type hints

A great addition in Python 3 are type hints, which can make development less error-prone if an IDE is used that supports this feature.

Here's an example from one of my projects that's using type hints (and keyword-only arguments, for that matter):

https://github.com/hoechenberger/questplus/blob/9d5a6b9fc7bafd000dd82c1737be274ea31ee3de/questplus/qp.py#L10-L20

And another one:
https://github.com/hoechenberger/questplus/blob/9d5a6b9fc7bafd000dd82c1737be274ea31ee3de/questplus/psychometric_function.py#L7-L13

I think type hints could be a welcome addition to pingouin as well.

Two-Way ANOVA: 'Substract' did not contain a loop with signature matching types

Good morning.

I'm trying to do a two-way ANOVA. My dataset is looking like this:

gender success d t v 
M      TRUE    1 1 2
M      FALSE   1 2 7
F      TRUE    1 2 1
M      TRUE    3 5 4
F      FALSE   2 1 6

Gender and Success are my between variables.

So I'm doing:
pg.anova(dv=v, between=["gender", "success"], data=output, detailed=False)

And I got this error:

Traceback (most recent call last):

  File "<ipython-input-93-35b88e82087c>", line 8, in <module>
    print(pg.anova(dv=v, between=["gender", "success"], data=output, detailed=False), "\n")

  File "C:\Users\poire\Anaconda3\lib\site-packages\pingouin\parametric.py", line 801, in anova
    export_filename=export_filename)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pingouin\parametric.py", line 939, in anova2
    ss_resid = np.sum(data.groupby([fac1, fac2]).apply(lambda x:

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 701, in apply
    return self._python_apply_general(f)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 707, in _python_apply_general
    self.axis)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\groupby\ops.py", line 190, in apply
    res = f(group)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pingouin\parametric.py", line 940, in <lambda>
    (x - x.mean())**2))[0]

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 2030, in f
    level=level)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1930, in _combine_series_frame
    return self._combine_match_columns(other, func, level=level)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\frame.py", line 5116, in _combine_match_columns
    return ops.dispatch_to_series(left, right, func, axis="columns")

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1157, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py", line 208, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py", line 123, in _evaluate_numexpr
    result = _evaluate_standard(op, op_str, a, b)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py", line 68, in _evaluate_standard
    return op(a, b)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1144, in column_op
    for i in range(len(a.columns))}

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1144, in <dictcomp>
    for i in range(len(a.columns))}

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1583, in wrapper
    result = safe_na_op(lvalues, rvalues)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1533, in safe_na_op
    lambda x: op(x, rvalues))

  File "pandas/_libs/algos.pyx", line 690, in pandas._libs.algos.arrmap

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1533, in <lambda>
    lambda x: op(x, rvalues))

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')

If I remove either gender or success, the code is working well.
Is there any fix?

Thanks,
Clément.

Posthoc test for non parametric test.

Hi,
I like the simplicity of the library, especially running a Friedman test on a panda dataframe with one line is nice.

However due to the lack of posthoc test I have to use other library like Scikit-posthoc.
It would be nice to be able to use pingouin for that as well !

Best,

Antoine

qqplot() should allow for NaN removal

Currently, if the passed iterable contains one or more NaN values, qqplot() will return a basically empty plot. There should be the option to automatically remove NaN values before plotting (might even be the default).

module 'numpy' has no attribute 'format_float_positional'

I try to run the example

import numpy as np
import pingouin as pg

np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T

# T-test
pg.ttest(x, y)

The erroe is module 'numpy' has no attribute 'format_float_positional'

Python 3.6.7
statsmodels-0.10.1
pingouin-0.2.8
numpy-1.16.4

Potentially add option to retrieve outliers removed by correlation.shepherd()

correlation.shepherd() removes outliers before calculating Spearman's rho. Ideally this function would return a list of removed outliers, just like correlation.skipped() does.

Edit I haven't looked into the literature, but I would assume that the degrees of freedom of Shepherd's pi are N – N_outliers – 1, right? It could be helpful if the correlation functions were also to return the dof, then :)

Effect size for (Welch) ANOVA

  1. Add more effect size options to pingouin.anova and pingouin.rm_anova
  2. Add a measure of effect size for pingouin.welch_anova (but which one -- is partial eta-squared suitable?)

welch_anova docs

Both Wikipedia and Real Statistics mention that the Welch's test is useful when we have samples of different sizes, but pingouin's documentation does not. Wouldn't it be interesting to contemplate that piece of information?

wilcoxon p-value

There's a line (245) in the wilcoxon function of nonparametric.py:

pval *= .5 if tail == 'one-sided' else pval

I think this is intended to work as

pval = pval*.5 if tail == 'one-sided' else pval

but is actually doing

pval *= (.5 if tail == 'one-sided' else pval)

Easy fix if I'm right! :)

Also, far less important but while I'm here:
pingouin docs for wilcoxon says "A continuity correction is applied by default..." but this is not the case, right?? Scipy wilcoxon's default is False, and pingouin also explicitly passes that.

Circular correlation for uniform marginals

According to Jammalamadaka & Sengupta (2001, pg. 177) the circular means are not well defined if the marginal distribution of one of the angles is uniform. This leads to wrong estimates of circular correlations in these cases. There is an alternative formulation of the circular correlation coefficient that deals with this problem (see equation 8.2.4 in Jammalamadaka & Sengupta, 2001). This implementation can also lead to slightly different values for the coefficient in other cases, though. This is the implementation:

def circcorr(x, y, tail='two-sided'):
    from scipy.stats import norm
    
    x = np.asarray(x)
    y = np.asarray(y)

    # Check size
    if x.size != y.size:
        raise ValueError('x and y must have the same length.')

    n = x.size
    
    x_sin = np.sin(x - stats.circmean(x))
    y_sin = np.sin(y - stats.circmean(y))

    
    num = np.abs(np.sum(np.exp((x - y)*1j))) - np.abs(np.sum(np.exp((x + y)*1j)))
    denom = 2 * np.sqrt(np.sum(x_sin**2) * np.sum(y_sin**2))
    
    r = num / denom
    
    # Compute T- and p-values
    tval = np.sqrt((n * (x_sin**2).mean() * (y_sin**2).mean())
                   / np.mean(x_sin**2 * y_sin**2)) * r

    # Approximately distributed as a standard normal
    pval = 2 * norm.sf(abs(tval))
    pval = pval / 2 if tail == 'one-sided' else pval
    
    return r, pval

I'm not sure if this should be added to pingouin, whether as an extra function or an alternative to circ_corrcc with a kwarg.

References:

  • Jammalamadaka, S. R., & Sengupta, A. (2001). Topics in circular statistics (Vol. 5). world scientific.
     

import fails, package points to external directory

importing pingouin returns the following:


ModuleNotFoundError Traceback (most recent call last)
in ()
1 import pandas as pd
2 import numpy as np
----> 3 import pingouin

~/.local/lib/python3.6/site-packages/pingouin/init.py in ()
1 # Import pingouin objects
----> 2 from .utils import *
3 from .bayesian import *
4 from .effsize import *
5 from .multicomp import *

~/.local/lib/python3.6/site-packages/pingouin/utils.py in ()
3 import numpy as np
4 from six import string_types
----> 5 from pingouin.external.tabulate import tabulate
6 import pandas as pd
7

ModuleNotFoundError: No module named 'pingouin.external'

Request: Support for MultiIndex columns

I typically end up having MultiIndex columns in my DataFrames during analysis, which currently does not seem to be (well) supported by pingouin. Here's an example when trying to use pairwise_corr() on MultiIndex columns:

In[31]: import pandas as pd
   ...: 
   ...: 
   ...: results_a = list(range(5))
   ...: results_b = list(range(5))
   ...: 
   ...: columns = pd.MultiIndex.from_tuples([('Result', 'a'),
   ...:                                      ('Result', 'b')])
   ...: data = pd.DataFrame(dict(a=results_a, b=results_b))
   ...: data.columns = columns
   ...: 
   ...: print(data)
   ...: 
  Result   
       a  b
0      0  0
1      1  1
2      2  2
3      3  3
4      4  4
In[32]: import pingouin
   ...: pingouin.pairwise_corr(data, columns=[('Result', 'a'),
   ...:                                       ('Result', 'b')])
   ...: 
Traceback (most recent call last):
  File "/Users/hoechenberger/miniconda3/envs/QUOTE/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-32-f548ba39714e>", line 3, in <module>
    ('Result', 'b')])
  File "/Users/hoechenberger/miniconda3/envs/QUOTE/lib/python3.7/site-packages/pingouin/pairwise.py", line 818, in pairwise_corr
    assert comb[0] in keys
AssertionError

In this specific case, this behavior is caused because np.intersect1d(keys, columns) returns array(['Result', 'a', 'b'], dtype='<U6'), which then yields the combinations [('Result', 'a'), ('Result', 'b'), ('a', 'b')]; of course, these don't make much sense.

Would be nice if pingouin could handle MultiIndex columns directly.

conda package

Thanks for creating this amazing project! Only found out about it by accident :)

At any rate, I just created a conda recipe and proposed it for inclusion in the conda-forge channel, so users would be able to install pingouin and all dependencies simply by doing conda install -c conda-forge pingouin

I wanted to ask you if you had any interest in

  • getting included as a recipe maintainer in the conda-forge recipe, and
  • adding a conda badge as well as
  • amending the installation instructions

once the PR to conda-forge is accepted and the packages become available? (Meaning, I could do a PR for all of that.)

Two way within-subject ANOVA

Hello,
one small problem: rm_anova collapses multiple observation for each subject and condition to mean (which is correct behavior) but if you have 2 within subject factors this function runs rm_anova2 which does not. Due to this fact when you run rm_anova with two within-subject variables it returns wrong value. If you do agree I can submit a pull request - as far as i can see the fix is to collapse values to a mean and THEN run rm_anova2.

SyntaxError when importing on python < 3.6

Hey,

Thanks for the great package.

The documentation of Pingouin states that it is strongly recommended that python >= 3.6 is used, which seems to imply that the package is usable with other versions. This is not the case.
Importing Pingouin on python < 3.6 fails, because this line uses f-strings, which were only introduced in python 3.6.

AFAIK, the file linked above is the only place where f-strings are used.
I'm not sure whether you want to create a work-around for this, but it threw me for a loop when I saw the error. I'll leave it up to you to decide whether removing these f-strings is a good choice, they are the way forward after all.

On a related note, installing pingouin using python 2.7 does not work, because the matching matplotlib version is not available. So it seems to me that python 2.7 support can be dropped from the readme as well.

error: Setup script exited with
Matplotlib 3.0+ does not support Python 2.x, 3.0, 3.1, 3.2, 3.3, or 3.4.
Beginning with Matplotlib 3.0, Python 3.5 and above is required.

Use pandas_flavor to extend pandas functionality

Currently, pingouin imports pandas and monkey-patches the DataFrame to add new methods. The "proper" way to achieve this, however, is to use the pandas extension API, most probably via pandas_flavor (pyjanitor, for example, does this too)

One simply needs to import pandas_flavor and add a decorator:

# pingouin/pairwise.py

import pandas as pd
import pandas_flavor as pf

@pf.register_dataframe_method
def pairwise_ttests(dv=None, between=None, within=None, subject=None,
                    data=None, parametric=True, alpha=.05, tail='two-sided',
                    padjust='none', effsize='hedges', return_desc=False,
                    export_filename=None):
...

Usage from the user's side:

import pandas as pd
import pingouin as pg

data = pd.DataFrame(...)
data.pairwise_ttests(...)

Feature request - option for not testing interaction in pairwise ttests

Hello,

Might be a bit specific, but I was wondering if that was an idea. I'm currently using it as a cheap way to posthoc test everything and am not super interested in interaction effects as for some cases it would make no sense. However, it seems that doing these tests is

1a. memory hungry, because it probably checks for all interactions
1b. could maybe reuse / make better use of temporary variables, it currently blows up my 16 go of ram the instant it goes into the interaction effects for a dataset with around 300 000 lines (and 6 columns, for each group I'm checking).

Normality function (Shapiro-Wilk test) and NaNs

At the moment the normality function does not account properly for NaNs in the data.

normality([np.nan, 1, 2])
(False, nan)

normality([np.nan, 1, 2, 10])
(True, 1.0)

Though a p value of 1 is suspicious anyway, it would make sense to provide a warning message or an argument to remove the NaNs.

normality([np.nan, 1, 2, 10], remove_na=True)
(True, 0.194)

New feature: Pearson's chi-squared test.

This is a test that I use a lot to understand the relationship between categorical data.

SciPy has it's own implementation, but we have to feed it with a contingency table. The contingency table is easy to build (using pandas.crosstab, for instance), but I think it would be an improvement for the user experience to simply call pg.pearson_chi2(x, y) or something like that.

SciPy's implementation returns the χ², the p-value, the degrees of freedom and a contingency table with the expected values (for which there would be no correlation between the columns). But I think it would be possible to infer more information from the test, just like you did with pg.ttest?

Best regards,
Arthur Paulino

logistic regression is using regularization by default

Thanks for the awesome package! I noticed a small concern with the logistic regression implementation. Since pingouin relies on sklearn's LogisticRegression implementation, it's worth noting that it implements L2 regularization by default with a hyperparameter value C=1.

This results in coefficients being penalized by default, and I suspect (but not 100% sure) the standard-error and p-values not being totally correct/interpretable. I say this because estimating standard errors for regularized/penalized models is not trivial and the computation often involves incorporating the penalty term somehow, but even then may not all that informative. See for example this stackoverflow discussion and links.

Here's an example jupyter notebook demonstrating the results different from the same analysis using glm in R (using the rpy2 package).

One possibility is setting the C=1e9 by default but letting it be overridden if desired. This effectively removes the effect of regularization as sklearn doesn't have a flag to disable it completely but this seems to be the accepted community solution.

Opened a pull request if you go with that solution!

Keyword-only arguments

The docs mention that one should ideally pass parameters as keyword arguments, not by position. I'm wondering why this isn't simply enforced? Python 3 supports keyword-only arguments. It's as simple as changing a function signature from

def f(foo=1, bar=2)

to

def f(*, foo=1, bar=2)

Now, positional arguments won't be accepted anymore.

Empirical CDF shift functions

I sometimes plot shift functions as / via empirical cumulative distribution functions, to avoid the troubles that come with kernel-density estimates. Would be great to see this feature in Pingouin as well. I typically use statsmodels.distributions.empirical_distribution.ECDF to estimate the CDFs.

Example (based on my own implementation) using a KDE:
uni vs bi - Mean - SJ

Same example using CDFs:
uni vs bi - Mean - SJ - CDF

(Yes, could use some visual tuning, I know :))

Add support for Aligned Rank Transformed data Anova

Good morning,

I had to work on non-normal and heteroscedastic variables but had to compute interactions between those variables. Unfortunately, no methods in your package allow to do so.
I found ART beeing quite straight forward and easy to use. I think it could be very valuable to include ART inside your library.
The actual solution is to run ARTool through rpy2 which is neither performant nor easy to do.

Here is the original paper: http://faculty.washington.edu/wobbrock/pubs/chi-11.06.pdf

And here are some R documentation:

Thanks, and have a nice day,
Clément.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.