Git Product home page Git Product logo

jvirico / normality-tests-pvalues-boxcoxtransformations Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 1.96 MB

Strategies for analyzing the distribution of datasets, switching the data towards a normal distribution testing different manual transformations and Box-Cox transformation.

Python 100.00%
box-cox normalization p-value python-statistics normality-test hypothesis-test shapiro-wilk dagostino-test

normality-tests-pvalues-boxcoxtransformations's Introduction

Normality Tests, p-values, and data normalization

Introduction

In statistical analysis there are usually three ways to explore data for a normality contrast or normality test, where we want to find how close a given distribution is to a normal one.

  • Graphical representation of the data distribution.
    • Histogram
    • Boxplot
  • Analytic methods.
    • Kurtosis
    • Skewness
  • Hypothesis tests.
    • Shapiro-Wilk test
    • D'Agostino's K-squared test
  • Normalization
    • Box-Cox power transformation.

Experiment 1: Normality Contrast

The aim is to analyze how different a data is distributed compared to a Normal Distribution.

Data source: Men and women's height and weight recordings.

Histogram and Normal curve

The data is represented via a histogram, together with a normal distribution that follows mean and standard deviation of the data.

drawing

Fig. 1. Our data already follows closely a normal distribution.

Quantile plot (Q-Q plot)

A quantile-quantile plot is a probability plot, whish is as a graphical method for comparing two probability distributions by comparing their quantiles against each other.

drawing

Fig. 2. Normal Q-Q plot for weights data.

Asymmetry and Kurtosis

The normality of a distribution can be assesed looking at the skewness of the data distribution and kurtosis.

Skewness is a measure of the asymmetry, and kurtosis is a measure of peakedness of a distribution.

In the present data, we observe a kurtosis of 0.293 and a skewness of -0.330, while the normal distribution following the data mean and std shows a kurtosis of -1.38 and skewness of 0.35.

Hypothesis Contrast

We consider as null hypothesis the normality of the data, the alternative hypothesis is that the data does not follow a normal distribution.

The p-value of the hypothesis tests indicates the probability of obtaining a distribution such as the observed if the data comes from a normally distributed data source.

It is important to consider that the bigger the data sample observed, the more reliable the p-values will be. However, the bigger is the data sample, less sensible to tha lack of normality the parametric methods become. For this reason, it is important to consider not only the p-values, but also the graphical representations and the size of the data sample.

Following two common methods for Hypothesis testing are used to evaluate the data.

Shapiro-Wilk Test

The Shapiro-Wilk test does not reject the null hypothesis since p-value > 0.005.

> from scipy import stats
> print(stats.shapiro(data)) # returns (test statistic, p-value)
(0.9898967146873474, 0.2541850805282593)

D'Agostino's K-squared test

The D'Agostino's K-squared test does not reject the null hypothesis either, since p-value > 0.005.

> from scipy import stats
> print(stats.normaltest(data)) # returns (test statistic, p-value)
(4.257630726093381, 0.11897815632110796)

Experiment 2: Data Normalization

For the second experiment, a dataset that does not follow a normal distribution is manually transformed using a set of functions, to later run the Hypothesis Constrast on each transformed data sample to compare the amount of normalization achieved. Finally, Box-Cox power transformed [2] is used for adjusting the data distribution to a normal curve.

Data source: Solar Energy process data.

As we can see in Fig. 3, the data does not follow a Normal Distribution, its distribution is highly skewed to the right and both Hypothesis Contrast test present a p-value below 0.005, rejecting the null hypothesis.

drawing

Fig. 3. Evaluating normality of data distribution.

Manual data transformation (y = sqrt(x))

drawing

Fig. 4. Evaluating normality of transformed data distribution using *y = sqrt(x)*.

Manual data transformation (y = 1/x)

drawing

Fig. 5. Evaluating normality of transformed data distribution using *y = 1/x*.

Manual data transformation (y = Ln(x))

drawing

Fig. 6. Evaluating normality of transformed data distribution using *y = Ln(x)*.

Manual data transformation (y = x^2)

drawing

Fig. 7. Evaluating normality of transformed data distribution using *y = x^2*.

Box-Cox power Transform

drawing

Fig. 8. Evaluating normality of transformed data distribution using Box-Cox transformation.

Conclusions

The fact of not being able to assume normality affects mainly to hypothesis parametrics like t-test and ANOVA. Assumptions on normality may also be present in machine learning models, e.g. linear regression assumes the residuals are normally distributed with zero-mean [3].

Since some statistical methods lack robusteness against normality, the presented methods are usefull to discover the normality of a data distribution. Furthemore, as we show in Fig. 8, Box-Cox is a family of transformation that helps to normalize a distribution, to fix the non-linearity of the data, or to fix unequal variances [4].

Dependencies

pip install -r requirements.txt

Cite this work

J. Rico (2021) Normality Tests, p-values, and data normalization with Python.
[Source code](https://github.com/jvirico/normality-tests-pvalues-boxcoxtransformations)

References

[1] - Analysis Normality in Python.
[2] - Box-Cox Transformation.
[3] - Data need to be normally-distributed, and other myths of linear regression.
[4] - Transformación de Box-Cox.

normality-tests-pvalues-boxcoxtransformations's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.