haskell / statistics Goto Github PK

A fast, high quality library for computing with statistics in Haskell.

Home Page: http://hackage.haskell.org/package/statistics

License: BSD 2-Clause "Simplified" License

Haskell 99.35% Python 0.36% Makefile 0.03% C 0.26%

statistics's Introduction

Statistics: efficient, general purpose statistics

This package provides the Statistics module, a Haskell library for working with statistical data in a space- and time-efficient way.

Where possible, we give citations and computational complexity estimates for the algorithms used.

Performance

This library has been carefully optimised for high performance. To obtain the best runtime efficiency, it is imperative to compile libraries and applications that use this library using a high level of optimisation.

Get involved!

Please report bugs via the github issue tracker.

Master git mirror:

git clone git://github.com/bos/statistics.git

There's also a Mercurial mirror:

hg clone https://bitbucket.org/bos/statistics

(You can create and contribute changes using either Mercurial or git.)

Authors

This library is written and maintained by Bryan O'Sullivan, [email protected].

statistics's People

Contributors

Stargazers

Watchers

statistics's Issues

Release 0.10.2.0

It's good moment to release 0.10.2.0 with all bug fixes nd tweaks

Int overflow(?) in wilcoxonMatchedPairCriticalValue

There is a bug in calculation of critical value for Wilcoxon-T test.

λ wilcoxonMatchedPairCriticalValue 35 0.05
Just 213
λ  wilcoxonMatchedPairCriticalValue 36 0.05
Just 2147483647   --  Which equal to Just (2^31-1)

It looks like at some point integer overflow occurs.

Also it cannot handle samples larger than 1023 at all since it uses 2**sampleSize in calculation of critical value which overflows to +∞ for sampleSize > 1023

Test probably should fall back to normal approximation at some point.

histogram_ fails to build with GHC HEAD

this is still minor, as it only applies to GHC HEAD, but this is a heads-up nevertheless, as it will apply to GHC 7.10. Something changed (possibly in vector or in GHC) and histogram_ fails to compile because it has trouble figuring out how to instantiate the s variable. I have a hack at https://ghc.haskell.org/trac/ghc/attachment/ticket/9190/statistics-0.11.0.3.patch to continue working, but not a proper patch at this moment.

Sometimes functions are not inlined

Sometimes functions in the Statistics.Sample are not inlined and GHC just uses generic version which leads to considerable slowdown. AFAIK function gets inlined only if fully saturated as declared.

variance xs = ...

some_function variance -- Variance won't be inlined there

There are two possible ways to deal with it. First is to declare functions with zero arguments using pointfree style or with lambdas variance = \x → .... Another is to add SPECIALIZE pragmas to get efficient versions for different types of vectors.

I'm not sure how does it affect real world code but benchmark suite is affected badly.

Failing tests

Two possibly different things at work here:

When running the tests with LANG=C (e.g. via a distribution's build system), the tests fail almost instantly:

Running 1 test suites...
Test suite tests: RUNNING...
Tests for all distributions:
  Tests for: BetaDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +tests: <stdout>: commitBuffer: invalid argument (invalid character)
Test suite tests: FAIL
Test suite logged to: dist/test/statistics-0.10.2.0-tests.log
0 of 1 test suites (0 of 1 test cases) passed.

Could this be from the infinity symbol?

When running it with my usual UTF-8 based locale, the "Quantile is CDF inverse" tests all failed:

Running 1 test suites...
Test suite tests: RUNNING...
Tests for all distributions:
  Tests for: BetaDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [Failed]
Falsifiable with seed 7414117163583754365, after 69 tests. Reason: BD {bdAlpha = 8.406843987606653, bdBeta = 0.4269461043937311}
1510.9852388084596
Quantile     = 0.9999952255644082
Probability  = 0.9852388084595987
Probability' = 0.9852388084596589
Error        = 6.0285110237146e-14
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: CauchyDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [OK, passed 100 tests]
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: ChiSquared:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [Failed]
Falsifiable with seed 4208265792671652991, after 51 tests. Reason: ChiSquared 92
144.62333347082813
Quantile     = 95.64486955846601
Probability  = 0.6233334708281291
Probability' = 0.6233334708281137
Error        = 1.532107773982716e-14
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: ExponentialDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [OK, passed 100 tests]
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: GammaDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [OK, passed 100 tests]
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: NormalDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [Failed]
Falsifiable with seed -8041988097717865771, after 37 tests. Reason: ND {mean = 2.3201475383520176, stdDev = 497.49045920814746, ndPdfDenom = 1247.0236514103028, ndCdfDenom = 703.5577545633812}
124.31668308499577
Quantile     = -234.97997317574743
Probability  = 0.31668308499577336
Probability' = 0.31668308499575337
Error        = 1.9984014443252818e-14
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: UniformDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [OK, passed 100 tests]
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: StudentT:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [Failed]
Falsifiable with seed -8496255333126878786, after 44 tests. Reason: StudentT {studentTndf = 205.34925970355764}
28.50694788712507
Quantile     = 1.7437873772404973e-2
Probability  = 0.5069478871250688
Probability' = 0.5069478871246982
Error        = 3.7059244561987725e-13
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: FDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [Failed]
Falsifiable with seed 5934505402744235409, after 90 tests. Reason: F {fDistributionNDF1 = 2.9744846725671688e16, fDistributionNDF2 = 3.974722746925967e16, _pdfFactor = 1.347450939929375e18}
Last elements: [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    PDF sanity: [OK, passed 100 tests]
    Quantile is CDF inverse: [Failed]
Falsifiable with seed 8937924613887352287, after 28 tests. Reason: F {fDistributionNDF1 = 60027.0, fDistributionNDF2 = 21539.0, _pdfFactor = 461225.35985645605}
24.44359239191443
Quantile     = 0.9984277157803625
Probability  = 0.4435923919144287
Probability' = 0.4435923919164872
Error        = 2.058520021108734e-12
    quantile fails p<0||p>1: [OK, passed 100 tests]
  Tests for: BinomialDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    Prob. sanity: [OK, passed 100 tests]
    CDF is sum of prob.: [OK, passed 100 tests]
  Tests for: GeometricDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    Prob. sanity: [OK, passed 100 tests]
    CDF is sum of prob.: [OK, passed 100 tests]
  Tests for: HypergeometricDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    Prob. sanity: [OK, passed 100 tests]
    CDF is sum of prob.: [OK, passed 100 tests]
  Tests for: PoissonDistribution:
    C.D.F. sanity: [OK, passed 100 tests]
    CDF limit at +∞: [OK, passed 100 tests]
    CDF limit at -∞: [OK, passed 100 tests]
    CDF is nondecreasing: [OK, passed 100 tests]
    1-CDF is correct: [OK, passed 100 tests]
    Prob. sanity: [OK, passed 100 tests]
    CDF is sum of prob.: [OK, passed 100 tests]
  Unit tests:
    density (gammaDistr 150 1/150) 1 == 4.883311: [OK]
    density (studentT 0.3) 1.34 ≈ 0.0648215: [OK]
    density (studentT 1.0) 0.42 ≈ 0.27058: [OK]
    density (studentT 4.4) 0.33 ≈ 0.352994: [OK]
    cumulative (studentT 0.3) 3.34 ≈ 0.757146: [OK]
    cumulative (studentT 1.0) 0.42 ≈ 0.626569: [OK]
    cumulative (studentT 4.4) 0.33 ≈ 0.621739: [OK]
    density (fDistribution 1 3) 3.0 ≈ 0.05305164769729845 [got 0.0530516477147324]: [OK]
    density (fDistribution 2 2) 1.2 ≈ 0.206612 [got 0.20661157024793383]: [OK]
    density (fDistribution 10 12) 8.0 ≈ 0.0003856131792818928 [got 0.00038561318051585333]: [OK]
    cumulative (fDistribution 1 3) 3.0 ≈ 0.8183098861837906 [got 0.8183098861240833]: [OK]
    cumulative (fDistribution 2 2) 1.2 ≈ 0.545455 [got 0.5454545454545454]: [OK]
    cumulative (fDistribution 10 12) 8.0 ≈ 0.9993550986345141 [got 0.9993550986324504]: [OK]
Nonparametric tests:
  Mann-Whitney: [OK]
  Mann-Whitney: [OK]
  Mann-Whitney: [OK]
  Mann-Whitney: [OK]
  Mann-Whitney: [OK]
  Mann-Whitney: [OK]
  Mann-Whitney U Critical Values, m=1: [OK]
  Mann-Whitney U Critical Values, m=2, p=0.025: [OK]
  Mann-Whitney U Critical Values, m=6, p=0.05: [OK]
  Mann-Whitney U Critical Values, m=20, p=0.025: [OK]
  Wilcoxon Sum: [OK]
  Wilcoxon Sum: [OK]
  Wilcoxon Paired 0: [OK]
  Wilcoxon Paired 1: [OK]
  Wilcoxon Paired 2: [OK]
  Wilcoxon Paired 3: [OK]
  Wilcoxon Paired 4: [OK]
  Wilcoxon Paired 5: [OK]
  Sig 16, 35: [OK]
  Sig 16, 36: [OK]
  Wilcoxon critical values, p=0.05: [OK]
  Wilcoxon critical values, p=0.025: [OK]
  Wilcoxon critical values, p=0.01: [OK]
  Wilcoxon critical values, p=0.005: [OK]
  K-S D statistics: [OK]
  K-S 2-sample statistics: [OK]
  K-S probability: [OK]
fft:
  t_impulse: [OK, passed 100 tests]
  t_impulse_offset: [OK, passed 100 tests]
  ifft . fft = id: [OK, passed 100 tests]
  fft . ifft = id: [OK, passed 100 tests]
  idct . dct = id [up to scale]: [OK, passed 100 tests]
  dct . idct = id [up to scale]: [OK, passed 100 tests]
  DCT test for fromList [1.0,0.0]: [OK]
  DCT test for fromList [0.0,1.0]: [OK]
  DCT test for fromList [1.0,0.0,0.0,0.0]: [OK]
  DCT test for fromList [0.0,1.0,0.0,0.0]: [OK]
  DCT test for fromList [0.0,0.0,1.0,0.0]: [OK]
  DCT test for fromList [0.0,0.0,0.0,1.0]: [OK]
  IDCT test for fromList [1.0,0.0]: [OK]
  IDCT test for fromList [0.0,1.0]: [OK]
  IDCT test for fromList [1.0,0.0,0.0,0.0]: [OK]
  IDCT test for fromList [0.0,1.0,0.0,0.0]: [OK]
  IDCT test for fromList [0.0,0.0,1.0,0.0]: [OK]
  IDCT test for fromList [0.0,0.0,0.0,1.0]: [OK]
S.Function:
  Sort is sort: [OK, passed 100 tests]
KDE:
  integral(PDF) == 1: [OK, passed 100 tests]

         Properties    Test Cases   Total        
 Passed  102           52           154          
 Failed  6             0            6            
 Total   108           52           160          
Test suite tests: FAIL
Test suite logged to: dist/test/statistics-0.10.2.0-tests.log
0 of 1 test suites (0 of 1 test cases) passed.

Bootstrap is affected by bug in monad-par

Bug simonmar/monad-par#23 affects S.Resample.Bootstrap.bootstrapBCA and cause crashes in criterion when program is compiled with threading.

This issue track workarounds for this bug.

Bug in DCT

I think there is bug in DCT

dct [1,0] = [1, sqrt 2]

If defintion from wikipedia is used then result should be

[1 , 1 / sqrt 2]

Further inspection shows that every element except first is multiplied by 2. Either first element should be multiplied as well or none. Both convention are used in practice. For example fftw3 multiplies result by 2 http://www.fftw.org/fftw3_doc/1d-Real_002deven-DFTs-_0028DCTs_0029.html

P.S. I think documentation for these these function must include formulae of what actually computed. No amount of word could replace math notation. And because of multitude of conventions it's very important here.

Recently I've learned a nice trick with TeX to gif services and <<url>> haddock markup. Following code will generate inline picture which is generated using such service.

-- | Some function
--
-- <<http://tex.to.gif/$\sqrt{x^2+y^2}$>>
someFunction :: ...

There are some concerns. Service may not tolerate such use it may go away. Alternative is to generate pictures and paste them into documentation using data URI. It will require huge tens of kchars comments

Move Statistics.Function.Comparison to math-functions

I think it's worthwhile to migrate within function to math-functions. it's generally useful and not limited to statistics. I propose to put it in Numeric.MathFunctions.ApproxEq module since it deals with approximate equality.

Add chi square test

Inaccurate discription for Statistics.Distribution.complCumulative

"It's useful when one is interested in P(X≥x)" should be changed to "It's useful when one is interested in P(X>x)". This is rather misleading for discrete probability distribution.

logGammaL loses precision near zero

logGammaL loses precision near zero. For parameters less than 1e-3 it do not provide 14 digits promised by documentation any more. Precision of logGamma deteriorates much slower in this range and reach 1e-14 for x < 1e-4. I don't know whether this is property of algorithm or bug in implementation.

x    relative error

1.25e-6 2.335247745808626e-11
6.82e-5 4.5200406627037583e-13
2.46e-4 1.2419479169108893e-13
8.8e-4 7.069995230776526e-14
3.12e-3 7.69901613260452e-16

Gamma function value up to 20 decimal digits were computed using mpmath python library.

statistics 0.10.1.0: Setup.lhs: can't find source for Tests/Distribution

$ tar xf statistics-0.10.1.0.tar.gz
$ runhaskell Setup.lhs configure --enable-tests
$ runhaskell Setup.lhs build
[...]
Preprocessing test suite 'tests' for statistics-0.10.1.0...
Setup.lhs: can't find source for Tests/Distribution in tests,
dist/build/autogen

Uniform distribution parameters

How do I get a and b from a UniformDistribution?
There are no getters and its constructor is not exported!

Haddock error in statistics-0.10.3.0

The package fails to generate its documentation with the following error:

dist/build/tmp-25026/Statistics/Distribution/Transform.hs:33:3:
    parse error on input `}'

A complete build log that shows the versions of all other tools involved is available at http://hydra.cryp.to:8080/build/11948/nixlog/2/raw.

geometricMean doesn't need to get so big

Not sure if this is the intended behaviour, but:

> v = V.fromList [127.344,134.693,126.829,122.765,125.857,123.627,123.63,123.633,168.1,168.104,168.105,168.106,168.107,168.107,140.516,127.749,127.896,127.24,184.358,184.366,184.367,139.638,127.036,127.039,137.367,135.588,122.624,115.49,120.145,118.251,125.252,115.712,115.711,115.712,115.711,115.711,115.711,115.711,115.711,115.711,115.711,122.277,117.771,118.606,125.025,120.733,113.73,118.849,112.927,123.351,123.348,123.35,123.351,123.351,123.352,123.353,123.354,123.18,121.414,127.041,128.871,119.192,116.217,140.41,133.458,133.441,124.329,126.73,126.732,126.733,126.734,112.295,137.034,140.261,117.266,117.267,117.268,117.268,117.269,117.27,117.27,117.272,117.273,117.274,132.525,128.073,118.129,156.335,141.49,118.49,155.044,153.939,153.936,156.424,156.427,156.429,156.293,148.878,148.879,148.879,148.88,148.88,121.121,131.657,124.478,152.346,131.116,132.253,167.85,140.516,155.135,115.711,142.562,126.732,121.894,157.126,157.134,157.135,157.136,157.137,157.139,115.519,119.265,116.85,123.839,141.226,119.969,119.973,119.974,119.241,119.239,129.744,129.743,129.239,168.396,168.403,168.405,168.406,130.632,130.633,130.634,129.788,124.239,124.251,124.254,124.256,124.267,124.659,125.271,135.085,135.092,123.175,127.979,140.928,131.807,131.809,111.563,117.953,117.953,117.955,117.957,125.237,124.895,119.397,131.362,143.52,128.937,127.868,131.025,121.911,128.023,113.911,136.279,120.643,118.624,118.627,118.627,118.628,118.627,118.627,117.883,120.856,129.917,138.342,131.728,136.769,118.36,131.029,116.306,135.93,120.344,125.726,122.76,118.893,118.892,118.893,118.894,118.894,118.894,118.895,118.895,118.896,118.896,117.698,144.307,147.729,154.152,132.4,118.793,118.799,118.8,118.801,118.802,118.804,118.813,118.814,118.815,120.218,120.221,132.378,130.221,118.063,129.326,129.326,129.327,127.826,117.795,117.798,126.803,118.985,118.987,118.989,118.99,117.193,113.995,120.865,130.008,129.867,122.876,118.766,152.971,152.978,152.98,152.982,152.984,152.986,152.987,152.99,152.9,140.776,134.097,123.364,121.743,122.88,117.404,120.067,124.15,124.155,142.605,142.608,142.61,142.612,123.638,122.31,122.316,122.318,122.32,122.322,122.331,122.333,120.683,120.978,148.751,134.869,130.778,126.963,132.516,126.154,125.401,120.449,120.407,120.408,120.408,120.408,120.41,120.409,120.411,133.385,115.994,156.255,134.577,176.299,120.597,118.505,150.379,150.386,150.388,150.389,150.391,150.393,150.394,150.397,150.398,130.191,155.105,131.608,152.308,134.512,130.315,151.424,143.675,130.532,118.752,118.757,118.758,118.759,118.759,118.76,118.761,118.761,118.761,120.188,129.469,129.477,129.48,145.653,126.004,127.128,127.825,134.017,120.158,120.158,120.159,120.159,125.596,128.65,133.134,133.821,141.524,139.647,125.15,125.153,125.153,125.155,125.155,124.247,132.185,132.035,123.437,127.261,122.383,120.874,138.418,138.425,138.426,138.427,117.402,126.177,130.47,146.379,146.383,146.384,133.019,131.732,134.641,152.683,130.526,159.303,119.641,139.547,161.648,129.052,130.363]
> geometricMean v
Infinity
> exp . mean . map log $ v
130.01086640231853

Distribution's semantics for ∞/NaN arguments

It's not specified what should return cumulative and density when ∞ or NaN is passed as argument. There are two options for NaN: return NaN or throw error. For infinities it's obvious

cumulative d (+∞) = 1
cumulative d (-∞) = 0

Infinities are not contrived. They do arise sometimes. E.g. quantile d 1 = +∞ fordistributions with infinite support

Comparison to zero in ridders

There are still several issues with ridder. First documentation doesn't say whether error tolerance is absolute or relative. Line 89 suggest that it's absolute error but in the definition of ~= it's treated as relative error.

Lines 83 and 84 are especially troublesome. AFAIU it's meant to be approximate comparison to zero. But in my experience it's ill defined unless scale of the problem is known. For different problems 1 could be very big number or approximately zero. Here is illustration:

x ~= 0 ≡ abs (x - 0) <= abs (tol * x) ≡ abs x <= abs (tol * x)

If tol < 1 it just a comparison to zero. But since tol is absolute error it could be greater than 1 in which case condition always evaluates to True:

> ridders 1.1 (15e3,16e3) (cos . (* 0.0001))
Root 15000.0
> ridders 0.9 (15e3,16e3) (cos . (* 0.0001))
Root  15707.584948424925

Add Student-T and F-distribuions

Hello Bryan,

I want to use your library for building statistical hypothesis tests. However, for any serious hypothesis testing (e.g. when comparing normal-normal-distributions, when doing regression analysis, building confidence intervals, etc.), you need to have at least the following distributions:

Students t-distribution (for doing any kind of two-sample-tests)

F-distribution (for evaluating ANOVA-tables etc.)

Do you plan to include these distributions in your package? By chance, any time soon? That would be really great!! At the moment, you can't even build a simple t-test with the package, even though there is so much cool stuff in your package. I think it would enhance the usability of your package greatly!

What do you think?

All the best, Thomas

Migrated from bitbucket: https://bitbucket.org/bos/statistics/issue/6/adding-essential-distributions

Behavior of ContDistrib.quantile is underspecified

Behaviour of "quantile" is underspecified. Documentation doesn't state what should be returned when p is outside of [0,1] interval. I can see only two options. One is to return NaN and other is to call error. I prefer second variant since NaN chasing isn't particularly rewarding activity.

Should I documentation?

binomial distribution 'cumulative' fails for large n,n-k

I would expect behavior similar to octave:

octave> tic; a = binocdf(5000000,10000000,0.5); b = toc; [a b]
ans =

0.50012615720383635 0.00220894813537598

but statistics takes quite a while and fails:

ghci> cumulative (binomial 10000000 0.5) 5000000
NaN
(7.03 secs, 2081185088 bytes)

Non-generic findRoot look-alikes: new class?

findRoot uses a specific algorithm to calculate (roughly) the inverses of CDFs. More efficient algorithms should be available for some distributions. The uniform distribution is a trivial case of that situation. Would it make sense to add a class (or possibly more than one) for such distributions? Something vaguely like

class (Distribution d) => InvDist d where
  invCDF :: d -> Double -> Double

KDE produces negative densities

Here is simple test case:

kde 32 $ U.fromList [0,0,0,1]

It looks like negative values are artifact of FFT.

binomial distribution doesn't allow zero trials

the binomial distribution is defined for this case and is the sort of thing that should work. as far as I can tell the code already mostly works for this case, just the "binomial" function fails when given zero trials.

Thanks

Sam

Mvar Lock when used from Criterion with Threaded build

haskell/criterion#28

looks like the only use site of parallelism (in Criterion and Statistics) is at https://github.com/bos/statistics/blob/master/Statistics/Resampling/Bootstrap.hs#L86-L89

so this is likely triggering a bug in monad-par side,
i'll be trying to figure it out

Quantile calculations are wrong

Quantile calculation for gamma and χ² distributions are wrong. Following law is expected to hold for quantile and CDF:

(cumulative d . quantile d) ≈ id

But it doesn't hold and test suite easily finds failures.

>>> let d = chiSquared 52
>>> (cumulative d . quantile d)) 0.91
1.0
>>>  (quantile d) 0.91
1.2595523146049146e263

Ideally quantiles should be implemented using inverse incomplete gamma function.

Value at upper bound causes error in histogram_ and kde_

A value equal to the upper bound of histogram_ causes an array index out of bounds exception (or maybe an access violation, since unsafe operations are used). By extension, this error also occurs in kde_, since it calls histogram_.

import Statistics.Sample.Histogram
import Data.Vector.Unboxed as U
> histogram_ 10 0.0 1.0 (U.fromList [1.0]) :: U.Vector Double
fromList *** Exception: .\Data\Vector\Generic\Mutable.hs:576 (read): index out of bounds (10,10)

This happens because the calculated bin will be truncate 1.0/(1.0/10) = 10, which is just outside the bounds. Using a value of 1.0-epsilon in the list or a bound of 1.0+epsilon does not produce the error.

The reason I encountered this problem is because I was computing the kde of values known to be in the range [0...1].

To save future debugging time, could you add a note to the documentation of kde_ saying something like "Sample data not strictly inside the bounds will cause an error"?

And for histogram_, which does include such a warning:
"Sample data that falls above the upper bound will cause an error."
should be changed to the longer but slightly more correct:
"Sample data that falls above or on the upper bound will cause an error."

Add Kolmogorov-Smirnov tests

Add one complement for incomplete gamma function

It's required for accurate estimation of probability P(X>x) for large x for χ² and gamma distributions and consequently evaluate improbability of fit/statistical hypothesis.

This report is of low priority.

Read and Binary instances for ditributions does not check input validity

Both Read and Binary instances for distributions are derived automatically. This leads to number of problems.

It allows to create distributions in invalid state. Smart constructors do check whether input valid or not. Derived methid will blindly decode what they get.
It breaks compatibility. Several distributions have fields which cache expensive computation. Their number and meaning could (and did!) change over time. Each such change breaks data format compatibility. They shouldn't be serialized in first place and recalculated during deserialization.

TestType should offer choice between two-tailed, right-tail or left-tail test

At the moment the TestType is

data TestType = OneTailed
              | TwoTailed

I think it makes more sense to be

data TestType = LeftTailed
              | RightTailed
              | TwoTailed

so that e.g. in a t-test you can specify H1 to be mu /= 0, mu < 0 or mu > 0.

The workaround at the moment is to supply OneTailed to the test, and G.map negate v over the input sample.

Missing dependency on ghc-prim (for GHC.Generics)

There seems to be a dependency missing:

Statistics\Distribution\Beta.hs:25:8:
    Could not find module `GHC.Generics'
    It is a member of the hidden package `ghc-prim'.
    Perhaps you need to add `ghc-prim' to the build-depends in your .cabal file.
    Use -v to see a list of the files searched for.

Adding a dependency on ghc-prim to the cabal file indeed fixed this for me.

Probabilities should be given in the log domain; double's don't provide enough fidelity

For example, the function,

density :: d -> Double -> Double

should really be of type

density :: d -> Double -> LogFloat

That way, if I ask for the density at 1E-1000000000 on a normal distribution with mean and stddev 1, I still get a meaningful result instead of just 0.

I can't think of any good reason not to be using probabilities in the log domain, since all of them will be in the range [0,1].

I'm implementing a bunch of machine learning algorithms that require probability distributions, and I'd like to be able to use your library, but I won't be able to unless this is changed. Also, I've implemented some code for estimating these distributions from data, and would be able to add it to your library if this were changed as well.

Suboptimal bandwidth estimation in Statistics.KernelDensity

Function bandwidth use naive method of bandwidth estimation which provides reasonable results for unimodal distributions but systematically oversmoothes for distributions which significantly differ from normal especially bimodal ones. There are number of better performing method. Overview is given in paper "Brief survey of bandwidth selection for density estimation"

Attachments illustrate point. Black line is true distribution and red one is kernel density estimate.

P.S. Personally I don't need this at moment.

Migrated from bitbucket: https://bitbucket.org/bos/statistics/issue/3

partialSort slows down compilation

Modules S.Function and S.Quantile require a lot of time to compile. Further investigation showed that this is due to function partialSort. It compiles to some 200k line of GHC core and probably to a lot of object code as well.

Quite likely it's vector-algorithms fault.

S.RootFinding.ridders cannot find root for simple functions

Here is an example.

>>> ridders 1e-9 (-1) (1) id
Nothing

Tracing shows that dx becomes NaN whenever s becomes zero and then all other values become NaNs too.

P.S. Could you send me a paper with algorithm description. It's paywalled and my institution doesn't have subscription.

Statistical tests should return p-value

At the moment statistical tests just return whether null hypotheis is rejected or not. That's not sufficient. It's quite different to reject hypothesys at p=1e-18 and at p =0.03.

Test significance
Test p-value. Caller have that information still it's not nice to force him to track that info
Calculated p-value
Whatever else particular test wants to provide

Strange type signature of DCT

Signature of dct and idct both are :: Vector (Complex Double) → Vector Double. It's implied that dct is invertible but it's obviously not. Only n out 2n degrees of freedom survive transformation and this information couldn't be recovered. Probably it should be Vector Double → Vector Double

Also different variants of DCT exists wikipedia describe no less than five. It would be helpful to describe which one is actually implemented.

ContGen instance of NormalDistribution

HI,
I think that in ContGen instance of NormalDistribution
in Statistics/Distribution/Normal.hs,

stdDev d * (x - mean d)

should be

mean d + stdDev d * x

exceptions in kde

Maybe these should be two separate issues, but here goes:

> let f x y = kde x $ Data.Vector.Unboxed.replicate y 0.0
> f 2 1
(fromList [-0.1,0.1],fromList [6.255700138464551,3.744299861535448])
> f 1 1
*** Exception: Statistics.Math.log2: invalid input
> f 1 2
*** Exception: ./Data/Vector/Generic/Mutable.hs:576 (read): index out of bounds (-9223372036854775808,1)

Deprecated modules

Modules

Statistics.Math
Statistics.Test.NonParametric

are all deprecated. At some point they shall be removed. Just a reminder

Quantile for normal distribution works poorly for small p

Example

>>> quantile standard $ cumulative standard (-8)
-7.991573950493904
>>> quantile standard $ cumulative standard (-9)
-Infinity

Why the dependency on aeson?

This seems, prima facie, rather unnecessary, and it pulls in a lot of other stuff. Would it be possible to remove it?

Migrated from bitbucket: https://bitbucket.org/bos/statistics/issue/4/why-the-dependency-on-aeson

Drop S.Function.create

Function S.Function.create is not used anymore and vector provides generateM since 0.7.1. Latter is more general (works with any monad not only with PrimMonad) and fusible. I propose to drop create

histogram breaks when input vector has only 1 distinct value

It seems that when the input vector contains a single distinct value, histogram creates buckets with step size 0.

Statistics.Sample.Histogram> histogram 11 (fromList [81.0, 81.0] :: Vector
 Double) :: (Vector Double, Vector Int) 
(fromList [81.0,81.0,81.0,81.0,81.0,81.0,81.0,81.0,81.0,81.0,81.0],fromList *** 
Exception: ./Data/Vector/Generic/Mutable.hs:590 (read): index out of bounds (-92
23372036854775808,11)

Tag for 0.10.3.0 release is missing

Time for 0.10.0.0 ?

It has been a long time since last release and there are tons of changes and serious bug fixes.

Split special functions into separate package

I propose to move module Statistics.Math into separate package. It contains functions which are useful on they own and not only in context of statistics. As for package name it could be special-functions

Proposed module structure is: Numeric.Polynomial.Chebyshev for Chebyshev polinomials. It makes sense to put them into separate module since there are Lagrange polynomials etc. It could get too crowded in one module. Numeric.SpecFunction for everything else.

I could handle split but I could you create repository and give me commit rights?

I think after split module Statistics.Math should be marked deprecated but still export functions for one or two releases.

Commit ID changed for all commits.

@bos, after your last commit somehow all SHA1 identifiers of all commits changed. For example thoose are same commits:

Shimuuar@aaf7824
c9b2033

This completely breaks other peoples' forks

How can the dependency on binary ≥ 0.6.3.0 be satisfied?

My understanding is that binary-0.5.xis a core package in GHC (because bytestring depends on it). Now, in order to build statistics, it is necessary to update binary. Updating a core package can be hazardous, though, because a package that depends on both bytestring and statistics will end up being linked to two different versions of binary!

Given those circumstances, it seems to me like there is no safe way to build statistics with any of the currently released compilers. Am I missing something?

Add incomplete beta function

It's required for cumulative probability function for both Student T and Fisher F distributions. I think it would be worthwhile to add its inverse too. It's would useful for quantiles estimation since generic root finding algorithm isn't very robust.