raphaelvallat / pingouin Goto Github PK

Statistical package in Python based on Pandas

License: GNU General Public License v3.0

Python 99.64% R 0.36%

pandas statistics anova effect-size multiple-comparisons cohens-d bayesian-statistics ttest correlations circular-statistics

pingouin's Introduction

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation.

ANOVAs: N-ways, repeated measures, mixed, ancova
Pairwise post-hocs tests (parametric and non-parametric) and pairwise correlations
Robust, partial, distance and repeated measures correlations
Linear/logistic regression and mediation analysis
Bayes Factors
Multivariate tests
Reliability and consistency
Effect sizes and power analysis
Parametric/bootstrapped confidence intervals around an effect size or a correlation coefficient
Circular statistics
Chi-squared tests
Plotting: Bland-Altman plot, Q-Q plot, paired plot, robust correlation...

Pingouin is designed for users who want simple yet exhaustive statistical functions.

For example, the ttest_ind function of SciPy returns only the T-value and the p-value. By contrast, the ttest function of Pingouin returns the T-value, the p-value, the degrees of freedom, the effect size (Cohen's d), the 95% confidence intervals of the difference in means, the statistical power and the Bayes Factor (BF10) of the test.

Documentation

Link to documentation

Chat

If you have questions, please ask them in GitHub Discussions.

Installation

Dependencies

The main dependencies of Pingouin are :

In addition, some functions require :

Pingouin is a Python 3 package and is currently tested for Python 3.8-3.11.

User installation

Pingouin can be easily installed using pip

pip install pingouin

or conda

conda install -c conda-forge pingouin

New releases are frequent so always make sure that you have the latest version:

pip install --upgrade pingouin

Development

To build and install from source, clone this repository or download the source archive and decompress the files

cd pingouin
python -m build            # optional, build a wheel and sdist
pip install .              # install the package
pip install --editable .   # or editable install
pytest                     # test the package

Quick start

Click on the link below and navigate to the notebooks/ folder to run a collection of interactive Jupyter notebooks showing the main functionalities of Pingouin. No need to install Pingouin beforehand, the notebooks run in a Binder environment.

10 minutes to Pingouin

1. T-test

import numpy as np
import pingouin as pg

np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T

# T-test
pg.ttest(x, y)

Output

T	dof	alternative	p-val	CI95%	cohen-d	BF10	power
-3.401	58	two-sided	0.001	[-1.68 -0.43]	0.878	26.155	0.917

2. Pearson's correlation

pg.corr(x, y)

Output

n	r	CI95%	p-val	BF10	power
30	0.595	[0.3 0.79]	0.001	69.723	0.950

3. Robust correlation

# Introduce an outlier
x[5] = 18
# Use the robust biweight midcorrelation
pg.corr(x, y, method="bicor")

Output

n	r	CI95%	p-val	power
30	0.576	[0.27 0.78]	0.001	0.933

4. Test the normality of the data

The pingouin.normality function works with lists, arrays, or pandas DataFrame in wide or long-format.

print(pg.normality(x))                                    # Univariate normality
print(pg.multivariate_normality(np.column_stack((x, y)))) # Multivariate normality

Output

W	pval	normal
0.615	0.000	False

(False, 0.00018)

5. One-way ANOVA using a pandas DataFrame

# Read an example dataset
df = pg.read_dataset('mixed_anova')

# Run the ANOVA
aov = pg.anova(data=df, dv='Scores', between='Group', detailed=True)
print(aov)

Output

Source	SS	DF	MS	F	p-unc	np2
Group	5.460	1	5.460	5.244	0.023	0.029
Within	185.343	178	1.041	nan	nan	nan

6. Repeated measures ANOVA

pg.rm_anova(data=df, dv='Scores', within='Time', subject='Subject', detailed=True)

Output

Source	SS	DF	MS	F	p-unc	ng2	eps
Time	7.628	2	3.814	3.913	0.023	0.04	0.999
Error	115.027	118	0.975	nan	nan	nan	nan

7. Post-hoc tests corrected for multiple-comparisons

# FDR-corrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_tests(data=df, dv='Scores', within='Time', subject='Subject',
                             parametric=True, padjust='fdr_bh', effsize='hedges')

# Pretty printing of table
pg.print_table(posthoc, floatfmt='.3f')

Output

Contrast	A	B	Paired	Parametric	T	dof	alternative	p-unc	p-corr	p-adjust	BF10	hedges
Time	August	January	True	True	-1.740	59.000	two-sided	0.087	0.131	fdr_bh	0.582	-0.328
Time	August	June	True	True	-2.743	59.000	two-sided	0.008	0.024	fdr_bh	4.232	-0.483
Time	January	June	True	True	-1.024	59.000	two-sided	0.310	0.310	fdr_bh	0.232	-0.170

8. Two-way mixed ANOVA

# Compute the two-way mixed ANOVA
aov = pg.mixed_anova(data=df, dv='Scores', between='Group', within='Time',
                     subject='Subject', correction=False, effsize="np2")
pg.print_table(aov)

Output

Source	SS	DF1	DF2	MS	F	p-unc	np2	eps
Group	5.460	1	58	5.460	5.052	0.028	0.080	nan
Time	7.628	2	116	3.814	4.027	0.020	0.065	0.999
Interaction	5.167	2	116	2.584	2.728	0.070	0.045	nan

9. Pairwise correlations between columns of a dataframe

import pandas as pd
np.random.seed(123)
z = np.random.normal(5, 1, 30)
data = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
pg.pairwise_corr(data, columns=['X', 'Y', 'Z'], method='pearson')

Output

X	Y	method	alternative	n	r	CI95%	p-unc	BF10	power
X	Y	pearson	two-sided	30	0.366	[0.01 0.64]	0.047	1.500	0.525
X	Z	pearson	two-sided	30	0.251	[-0.12 0.56]	0.181	0.534	0.272
Y	Z	pearson	two-sided	30	0.020	[-0.34 0.38]	0.916	0.228	0.051

10. Pairwise T-test between columns of a dataframe

data.ptests(paired=True, stars=False)

Pairwise T-tests, with T-values on the lower triangle and p-values on the upper triangle

	X	Y	Z
X		0.226	0.165
Y	-1.238		0.658
Z	-1.424	-0.447

11. Multiple linear regression

pg.linear_regression(data[['X', 'Z']], data['Y'])

Linear regression summary

names	coef	se	T	pval	r2	adj_r2	CI[2.5%]	CI[97.5%]
Intercept	4.650	0.841	5.530	0.000	0.139	0.076	2.925	6.376
X	0.143	0.068	2.089	0.046	0.139	0.076	0.003	0.283
Z	-0.069	0.167	-0.416	0.681	0.139	0.076	-0.412	0.273

12. Mediation analysis

pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)

Mediation summary

path	coef	se	pval	CI[2.5%]	CI[97.5%]	sig
Z ~ X	0.103	0.075	0.181	-0.051	0.256	No
Y ~ Z	0.018	0.171	0.916	-0.332	0.369	No
Total	0.136	0.065	0.047	0.002	0.269	Yes
Direct	0.143	0.068	0.046	0.003	0.283	Yes
Indirect	-0.007	0.025	0.898	-0.069	0.029	No

13. Contingency analysis

data = pg.read_dataset('chi2_independence')
expected, observed, stats = pg.chi2_independence(data, x='sex', y='target')
stats

Chi-squared tests summary

test	lambda	chi2	dof	cramer	power
pearson	1.000	22.717	1.000	0.274	0.997
cressie-read	0.667	22.931	1.000	0.275	0.998
log-likelihood	0.000	23.557	1.000	0.279	0.998
freeman-tukey	-0.500	24.220	1.000	0.283	0.998
mod-log-likelihood	-1.000	25.071	1.000	0.288	0.999
neyman	-2.000	27.458	1.000	0.301	0.999

Integration with Pandas

Several functions of Pingouin can be used directly as pandas DataFrame methods. Try for yourself with the code below:

import pingouin as pg

# Example 1 | ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)

# Example 2 | Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])

# Example 3 | Partial correlation matrix
data.pcorr()

The functions that are currently supported as pandas method are:

Development

Pingouin was created and is maintained by Raphael Vallat, a postdoctoral researcher at UC Berkeley, mostly during his spare time. Contributions are more than welcome so feel free to contact me, open an issue or submit a pull request!

To see the code or report a bug, please visit the GitHub repository.

This program is provided with NO WARRANTY OF ANY KIND. Pingouin is still under heavy development and there are likely hidden bugs. Always double check the results with another statistical software.

Contributors

How to cite Pingouin?

If you want to cite Pingouin, please use the publication in JOSS:

Vallat, R. (2018). Pingouin: statistics in Python. Journal of Open Source Software, 3(31), 1026, https://doi.org/10.21105/joss.01026

Acknowledgement

Several functions of Pingouin were inspired from R or Matlab toolboxes, including:

pingouin's People

Contributors

Stargazers

Watchers

Forkers

medic20 rockgecko hoechenberger heyifei1984 arthurpaulino cyzhangathit agamemnonc leicas ejolly palline1 inonchiu harksodje etiennecmb jjwelton187 vishalbelsare annatruzzi spaak jingfei-liu gunnups arita37 kristianeschenburg odgaard jnecus nvinh luizcbprt aespar21 drjeym adam2392 snijesh kncrabtree khemlalnirmalkar joelfner baryos rjybidda duybluemind1988 achennings dominikstrb devount puncocharm yadevi caiquanyou seralouk dominicchm viktorwase giupardeb szev77 zzy17667036 andreasphilippi christinadelta ganshengt jinwx xzllxls cvos-lab musicinmybrain qiguangyao indymnv thomas-haslwanter daybright-david discdiver seankmartin zhongchen1yang mathsml pri-cph longluu qbarthelemy michalkahle world4jason jeanbaptisteb dongyi1996 j-wall hanchenresearch rafaelvalero amgfernandes kschuerholt jumbojing swallan adamnarai gedeck julibeg mvarelasan ssuarezbe ulamaca vojtech-filipec igoralves1 python-repository-hub lahdjirayhan xfz329 knowledgecluster hikarimusic sejalmistry deviyantiam arauchen geo7 smathot digbrain gkanwar shandc1110 mawen1250 avmi franklelechen

pingouin's Issues

BF10 in Anova

Hi Raphael,

I'm wondering if there is any possibility to compute BF10 from your implementations of ANOVA alongside p-values?

Wish you a nice day,
Clément.

Rmcorr Issues

Hi,

Noticed some issues while trying rmcorr.

If the column you pass is non-numeric, it just results in an assertion error with no error message. The r library rmcorr() returns message saying: 'Measure 1' and 'Measure 2' must be numeric. Maybe return some message like this?
In ancova, bw = ss_slopes.sum() / ss.sum() (and ss_c = ss_slopes.sum() * bw) returns NaN if any of the values in ss_slopes is NaN. Basically, sum() on a numpy array == NaN if it has any NaNs in the array. As a result, rmcorr or ancova result turns out to be NaN. I think this should be changed to np.nansum(ss_slopes) instead. This will ignore NaNs but still sum the remaining values.

Thank you!

Shapiro-Wilk test should have option to return W statistic

Currently pingouin.normality() silently discards the W statistic returned by scipy.stats.shapiro(). I would love to have an option to retrieve these stats, as I need to present them together with the p values when drafting a scientific publication. I wonder if the function could not be adjusted to return a DataFrame with numerous information like most of the other pingouin functions?

Wish list: power analysis

Hi Raphael,

I was thinking a great addition to the toolbox would be some python based tools for power analysis such as something similar to R's 'pwr'. https://cran.r-project.org/web/packages/pwr/pwr.pdf. Thanks again for your great work on this tool!

Best,
Alex

Add support for listwise deletion in pairwise_corr

Type hints

A great addition in Python 3 are type hints, which can make development less error-prone if an IDE is used that supports this feature.

Here's an example from one of my projects that's using type hints (and keyword-only arguments, for that matter):

https://github.com/hoechenberger/questplus/blob/9d5a6b9fc7bafd000dd82c1737be274ea31ee3de/questplus/qp.py#L10-L20

And another one:
https://github.com/hoechenberger/questplus/blob/9d5a6b9fc7bafd000dd82c1737be274ea31ee3de/questplus/psychometric_function.py#L7-L13

I think type hints could be a welcome addition to pingouin as well.

Two-Way ANOVA: 'Substract' did not contain a loop with signature matching types

Good morning.

I'm trying to do a two-way ANOVA. My dataset is looking like this:

gender success d t v 
M      TRUE    1 1 2
M      FALSE   1 2 7
F      TRUE    1 2 1
M      TRUE    3 5 4
F      FALSE   2 1 6

Gender and Success are my between variables.

So I'm doing:
pg.anova(dv=v, between=["gender", "success"], data=output, detailed=False)

And I got this error:

Traceback (most recent call last):

  File "<ipython-input-93-35b88e82087c>", line 8, in <module>
    print(pg.anova(dv=v, between=["gender", "success"], data=output, detailed=False), "\n")

  File "C:\Users\poire\Anaconda3\lib\site-packages\pingouin\parametric.py", line 801, in anova
    export_filename=export_filename)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pingouin\parametric.py", line 939, in anova2
    ss_resid = np.sum(data.groupby([fac1, fac2]).apply(lambda x:

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 701, in apply
    return self._python_apply_general(f)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 707, in _python_apply_general
    self.axis)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\groupby\ops.py", line 190, in apply
    res = f(group)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pingouin\parametric.py", line 940, in <lambda>
    (x - x.mean())**2))[0]

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 2030, in f
    level=level)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1930, in _combine_series_frame
    return self._combine_match_columns(other, func, level=level)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\frame.py", line 5116, in _combine_match_columns
    return ops.dispatch_to_series(left, right, func, axis="columns")

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1157, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py", line 208, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py", line 123, in _evaluate_numexpr
    result = _evaluate_standard(op, op_str, a, b)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py", line 68, in _evaluate_standard
    return op(a, b)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1144, in column_op
    for i in range(len(a.columns))}

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1144, in <dictcomp>
    for i in range(len(a.columns))}

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1583, in wrapper
    result = safe_na_op(lvalues, rvalues)

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1533, in safe_na_op
    lambda x: op(x, rvalues))

  File "pandas/_libs/algos.pyx", line 690, in pandas._libs.algos.arrmap

  File "C:\Users\poire\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1533, in <lambda>
    lambda x: op(x, rvalues))

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')

If I remove either gender or success, the code is working well.
Is there any fix?

Thanks,
Clément.

Posthoc test for non parametric test.

Hi,
I like the simplicity of the library, especially running a Friedman test on a panda dataframe with one line is nice.

However due to the lack of posthoc test I have to use other library like Scikit-posthoc.
It would be nice to be able to use pingouin for that as well !

Best,

Antoine

Convert invalid dichotomous crosstabs

Make it 2x2. If b == 0 and c == 0, raise ValueError.

qqplot() should allow for NaN removal

Currently, if the passed iterable contains one or more NaN values, qqplot() will return a basically empty plot. There should be the option to automatically remove NaN values before plotting (might even be the default).

Epsilon and Mauchly interaction in rm_anova2 differ from JASP

module 'numpy' has no attribute 'format_float_positional'

I try to run the example

import numpy as np
import pingouin as pg

np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T

# T-test
pg.ttest(x, y)

The erroe is module 'numpy' has no attribute 'format_float_positional'

Python 3.6.7
statsmodels-0.10.1
pingouin-0.2.8
numpy-1.16.4

Contingency: Odds and risk ratios

Should we incorporate them on the chi2_exact? Or create a new function? Or just leave those metrics out?

Source to study: http://www.real-statistics.com/chi-square-and-f-distributions/effect-size-chi-square/

Add power and effect size for chi-square test

New function: pingouin.power_chi2
Add Cramer's V and statistical power to pingouin.chi2

Contingency: Cochran's Q test

https://en.wikipedia.org/wiki/Cochran%27s_Q_test

Potentially add option to retrieve outliers removed by correlation.shepherd()

correlation.shepherd() removes outliers before calculating Spearman's rho. Ideally this function would return a list of removed outliers, just like correlation.skipped() does.

Edit I haven't looked into the literature, but I would assume that the degrees of freedom of Shepherd's pi are N – N_outliers – 1, right? It could be helpful if the correlation functions were also to return the dof, then :)

Effect size for (Welch) ANOVA

Add more effect size options to pingouin.anova and pingouin.rm_anova
Add a measure of effect size for pingouin.welch_anova (but which one -- is partial eta-squared suitable?)

Rewrite homoscedasticity function to support pandas wide and long format

See new version of normality()

Contingency: tests against other software

Validate contingency tests against implementations from R, SPSS and JASP

welch_anova docs

Both Wikipedia and Real Statistics mention that the Welch's test is useful when we have samples of different sizes, but pingouin's documentation does not. Wouldn't it be interesting to contemplate that piece of information?

wilcoxon p-value

There's a line (245) in the wilcoxon function of nonparametric.py:

pval *= .5 if tail == 'one-sided' else pval

I think this is intended to work as

pval = pval*.5 if tail == 'one-sided' else pval

but is actually doing

pval *= (.5 if tail == 'one-sided' else pval)

Easy fix if I'm right! :)

Also, far less important but while I'm here:
pingouin docs for wilcoxon says "A continuity correction is applied by default..." but this is not the case, right?? Scipy wilcoxon's default is False, and pingouin also explicitly passes that.

Add support for pairwise deletion in pairwise_ttests

See Gitter chat

Add 95% CI for ttest function

See gitter chat.

Contingency: Cochran–Mantel–Haenszel test

pg.chi2_cmh also requires a stratify parameter.

Sources to study:

Circular correlation for uniform marginals

According to Jammalamadaka & Sengupta (2001, pg. 177) the circular means are not well defined if the marginal distribution of one of the angles is uniform. This leads to wrong estimates of circular correlations in these cases. There is an alternative formulation of the circular correlation coefficient that deals with this problem (see equation 8.2.4 in Jammalamadaka & Sengupta, 2001). This implementation can also lead to slightly different values for the coefficient in other cases, though. This is the implementation:

def circcorr(x, y, tail='two-sided'):
    from scipy.stats import norm
    
    x = np.asarray(x)
    y = np.asarray(y)

    # Check size
    if x.size != y.size:
        raise ValueError('x and y must have the same length.')

    n = x.size
    
    x_sin = np.sin(x - stats.circmean(x))
    y_sin = np.sin(y - stats.circmean(y))

    
    num = np.abs(np.sum(np.exp((x - y)*1j))) - np.abs(np.sum(np.exp((x + y)*1j)))
    denom = 2 * np.sqrt(np.sum(x_sin**2) * np.sum(y_sin**2))
    
    r = num / denom
    
    # Compute T- and p-values
    tval = np.sqrt((n * (x_sin**2).mean() * (y_sin**2).mean())
                   / np.mean(x_sin**2 * y_sin**2)) * r

    # Approximately distributed as a standard normal
    pval = 2 * norm.sf(abs(tval))
    pval = pval / 2 if tail == 'one-sided' else pval
    
    return r, pval

I'm not sure if this should be added to pingouin, whether as an extra function or an alternative to circ_corrcc with a kwarg.

References:

Jammalamadaka, S. R., & Sengupta, A. (2001). Topics in circular statistics (Vol. 5). world scientific.

import fails, package points to external directory

importing pingouin returns the following:

ModuleNotFoundError Traceback (most recent call last)
in ()
1 import pandas as pd
2 import numpy as np
----> 3 import pingouin

~/.local/lib/python3.6/site-packages/pingouin/init.py in ()
1 # Import pingouin objects
----> 2 from .utils import *
3 from .bayesian import *
4 from .effsize import *
5 from .multicomp import *

~/.local/lib/python3.6/site-packages/pingouin/utils.py in ()
3 import numpy as np
4 from six import string_types
----> 5 from pingouin.external.tabulate import tabulate
6 import pandas as pd
7

ModuleNotFoundError: No module named 'pingouin.external'

Request: Support for MultiIndex columns

I typically end up having MultiIndex columns in my DataFrames during analysis, which currently does not seem to be (well) supported by pingouin. Here's an example when trying to use pairwise_corr() on MultiIndex columns:

In[31]: import pandas as pd
   ...: 
   ...: 
   ...: results_a = list(range(5))
   ...: results_b = list(range(5))
   ...: 
   ...: columns = pd.MultiIndex.from_tuples([('Result', 'a'),
   ...:                                      ('Result', 'b')])
   ...: data = pd.DataFrame(dict(a=results_a, b=results_b))
   ...: data.columns = columns
   ...: 
   ...: print(data)
   ...: 
  Result   
       a  b
0      0  0
1      1  1
2      2  2
3      3  3
4      4  4

In[32]: import pingouin
   ...: pingouin.pairwise_corr(data, columns=[('Result', 'a'),
   ...:                                       ('Result', 'b')])
   ...: 
Traceback (most recent call last):
  File "/Users/hoechenberger/miniconda3/envs/QUOTE/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-32-f548ba39714e>", line 3, in <module>
    ('Result', 'b')])
  File "/Users/hoechenberger/miniconda3/envs/QUOTE/lib/python3.7/site-packages/pingouin/pairwise.py", line 818, in pairwise_corr
    assert comb[0] in keys
AssertionError

In this specific case, this behavior is caused because np.intersect1d(keys, columns) returns array(['Result', 'a', 'b'], dtype='<U6'), which then yields the combinations [('Result', 'a'), ('Result', 'b'), ('a', 'b')]; of course, these don't make much sense.

Would be nice if pingouin could handle MultiIndex columns directly.

conda package

Thanks for creating this amazing project! Only found out about it by accident :)

At any rate, I just created a conda recipe and proposed it for inclusion in the conda-forge channel, so users would be able to install pingouin and all dependencies simply by doing conda install -c conda-forge pingouin

I wanted to ask you if you had any interest in

getting included as a recipe maintainer in the conda-forge recipe, and
adding a conda badge as well as
amending the installation instructions

once the PR to conda-forge is accepted and the packages become available? (Meaning, I could do a PR for all of that.)

Add support for three-way ANOVA

Using statsmodels:

Statsmodels fails with scipy 1.3.0

Need to wait for the next statsmodels release to fix that:
See statsmodels/statsmodels#5747 and statsmodels/statsmodels#5759

Functions that fail are:

ancovan
anova2 with unbalanced design

Two way within-subject ANOVA

Hello,
one small problem: rm_anova collapses multiple observation for each subject and condition to mean (which is correct behavior) but if you have 2 within subject factors this function runs rm_anova2 which does not. Due to this fact when you run rm_anova with two within-subject variables it returns wrong value. If you do agree I can submit a pull request - as far as i can see the fix is to collapse values to a mean and THEN run rm_anova2.

SyntaxError when importing on python < 3.6

Hey,

Thanks for the great package.

The documentation of Pingouin states that it is strongly recommended that python >= 3.6 is used, which seems to imply that the package is usable with other versions. This is not the case.
Importing Pingouin on python < 3.6 fails, because this line uses f-strings, which were only introduced in python 3.6.

AFAIK, the file linked above is the only place where f-strings are used.
I'm not sure whether you want to create a work-around for this, but it threw me for a loop when I saw the error. I'll leave it up to you to decide whether removing these f-strings is a good choice, they are the way forward after all.

On a related note, installing pingouin using python 2.7 does not work, because the matching matplotlib version is not available. So it seems to me that python 2.7 support can be dropped from the readme as well.

error: Setup script exited with
Matplotlib 3.0+ does not support Python 2.x, 3.0, 3.1, 3.2, 3.3, or 3.4.
Beginning with Matplotlib 3.0, Python 3.5 and above is required.

Use pandas_flavor to extend pandas functionality

Currently, pingouin imports pandas and monkey-patches the DataFrame to add new methods. The "proper" way to achieve this, however, is to use the pandas extension API, most probably via pandas_flavor (pyjanitor, for example, does this too)

One simply needs to import pandas_flavor and add a decorator:

# pingouin/pairwise.py

import pandas as pd
import pandas_flavor as pf

@pf.register_dataframe_method
def pairwise_ttests(dv=None, between=None, within=None, subject=None,
                    data=None, parametric=True, alpha=.05, tail='two-sided',
                    padjust='none', effsize='hedges', return_desc=False,
                    export_filename=None):
...

Usage from the user's side:

import pandas as pd
import pingouin as pg

data = pd.DataFrame(...)
data.pairwise_ttests(...)

Feature request - option for not testing interaction in pairwise ttests

Hello,

Might be a bit specific, but I was wondering if that was an idea. I'm currently using it as a cheap way to posthoc test everything and am not super interested in interaction effects as for some cases it would make no sense. However, it seems that doing these tests is

1a. memory hungry, because it probably checks for all interactions
1b. could maybe reuse / make better use of temporary variables, it currently blows up my 16 go of ram the instant it goes into the interaction effects for a dataset with around 300 000 lines (and 6 columns, for each group I'm checking).

Normality function (Shapiro-Wilk test) and NaNs

At the moment the normality function does not account properly for NaNs in the data.

normality([np.nan, 1, 2])
(False, nan)

normality([np.nan, 1, 2, 10])
(True, 1.0)

Though a p value of 1 is suspicious anyway, it would make sense to provide a warning message or an argument to remove the NaNs.

normality([np.nan, 1, 2, 10], remove_na=True)
(True, 0.194)

Rmcorr plot

See https://rdrr.io/cran/rmcorr/man/plot.rmc.html

New feature: Pearson's chi-squared test.

This is a test that I use a lot to understand the relationship between categorical data.

SciPy has it's own implementation, but we have to feed it with a contingency table. The contingency table is easy to build (using pandas.crosstab, for instance), but I think it would be an improvement for the user experience to simply call pg.pearson_chi2(x, y) or something like that.

SciPy's implementation returns the χ², the p-value, the degrees of freedom and a contingency table with the expected values (for which there would be no correlation between the columns). But I think it would be possible to infer more information from the test, just like you did with pg.ttest?

Best regards,
Arthur Paulino

Contingency: exact tests

Tests currently planned to be supported:

Fisher's exact test
Barnard's test

logistic regression is using regularization by default

Thanks for the awesome package! I noticed a small concern with the logistic regression implementation. Since pingouin relies on sklearn's LogisticRegression implementation, it's worth noting that it implements L2 regularization by default with a hyperparameter value C=1.

This results in coefficients being penalized by default, and I suspect (but not 100% sure) the standard-error and p-values not being totally correct/interpretable. I say this because estimating standard errors for regularized/penalized models is not trivial and the computation often involves incorporating the penalty term somehow, but even then may not all that informative. See for example this stackoverflow discussion and links.

Here's an example jupyter notebook demonstrating the results different from the same analysis using glm in R (using the rpy2 package).

One possibility is setting the C=1e9 by default but letting it be overridden if desired. This effectively removes the effect of regularization as sklearn doesn't have a flag to disable it completely but this seems to be the accepted community solution.

Opened a pull request if you go with that solution!

Keyword-only arguments

The docs mention that one should ideally pass parameters as keyword arguments, not by position. I'm wondering why this isn't simply enforced? Python 3 supports keyword-only arguments. It's as simple as changing a function signature from

def f(foo=1, bar=2)

def f(*, foo=1, bar=2)

Now, positional arguments won't be accepted anymore.

Empirical CDF shift functions

I sometimes plot shift functions as / via empirical cumulative distribution functions, to avoid the troubles that come with kernel-density estimates. Would be great to see this feature in Pingouin as well. I typically use statsmodels.distributions.empirical_distribution.ECDF to estimate the CDFs.

Example (based on my own implementation) using a KDE:

Same example using CDFs:

(Yes, could use some visual tuning, I know :))

Add support for Aligned Rank Transformed data Anova

Good morning,

I had to work on non-normal and heteroscedastic variables but had to compute interactions between those variables. Unfortunately, no methods in your package allow to do so.
I found ART beeing quite straight forward and easy to use. I think it could be very valuable to include ART inside your library.
The actual solution is to run ARTool through rpy2 which is neither performant nor easy to do.

Here is the original paper: http://faculty.washington.edu/wobbrock/pubs/chi-11.06.pdf

And here are some R documentation:

Thanks, and have a nice day,
Clément.

T-test: The confidence intervals of the difference in means should be Inf on one of the two sides, as in the R t.test function.

Or should we automatically determine the 'greater' and 'less' using the sign of the T-value?

raphaelvallat / pingouin Goto Github PK

pingouin's Introduction

Documentation

Chat

Installation

Dependencies

User installation

Development

Quick start

10 minutes to Pingouin

1. T-test

2. Pearson's correlation

3. Robust correlation

4. Test the normality of the data

5. One-way ANOVA using a pandas DataFrame

6. Repeated measures ANOVA

7. Post-hoc tests corrected for multiple-comparisons

8. Two-way mixed ANOVA

9. Pairwise correlations between columns of a dataframe

10. Pairwise T-test between columns of a dataframe

11. Multiple linear regression

12. Mediation analysis

13. Contingency analysis

Integration with Pandas

Development

How to cite Pingouin?

Acknowledgement

pingouin's People

Contributors

Stargazers

Watchers

Forkers

pingouin's Issues

Recommend Projects

Recommend Topics

Recommend Org