estebanz01 / ruby-statistics Goto Github PK

View Code? Open in Web Editor NEW

93.0 5.0 17.0 230 KB

Ruby gem for some statistical operations without any statistical language dependency

License: MIT License

Ruby 99.91% Shell 0.09%

statistics ruby-gem math statistical-distributions rubygems ruby-statistics stats statistical-tests

ruby-statistics's People

Contributors

Stargazers

Watchers

Forkers

htwroclau cristiancas20011 achoudh5 architecture jimmy05 cynthia-n frenesim leornardzhou rubydoomsday stevegoobermanhill igas johnbracken ylansegal oliver-czulo jasoncaryallen shamangeorge kevin-j-m

ruby-statistics's Issues

Implement/replace standard deviation and variance.

We need to do this urgent, because the descriptive_statistics gem is using a formula to calculate variance not for samples but for populations, which is not accurate for statistical tests and distribution calculations.

It's causing #13

ChiSquaredTest#goodness_of_fit hangs with BigDecimal input

This is the simplest example that seems to trigger the hanging behaviour. If the BigDecimal is cast to a float then it works as expected:

require 'statistics'

expected = [BigDecimal(0.2, 1)]
observed = [3.3]

chi_squared = Statistics::StatisticalTest::ChiSquaredTest
stats = chi_squared.goodness_of_fit(0.05, expected, observed)
# Observe that the process hangs at this point.

Add more discrete distributions

due to rush, I was only able to add two discrete distribution. The idea is to have more common discrete elements like the negative binomial or the geometric distribution.

Standard Deviation

Excuse me but I would like to know if there is a function for the Standard Deviation. I was looking into the wiki and did not find it.

Thank you in advance.

Implement KS test for one sample

The idea is to implement the KS test where we pass a sample and a distribution object, to validate if it belongs or not.

Something like this:

distribution = Distribution::StandardNormal.new
samples = distribution.random(elements: 1_000, seed: 100)
# This should return null hypothesis as true,
# so we assume that the sample belongs to the specified distribution.
StatisticalTest::KSTest.one_group(samples: samples, distribution: distribution)

Sampling from a distribution

Implement a way to generate a random sampling using any written distribution.

Spearman's rank correlation & Kolmogorov–Smirnov test

Implement the spearman's rank correlation test and the Kolmogorov–Smirnov test to complete a proper list of non-parametric test.

This is a task extracted from #7.

ChiSquaredTest#goodness_of_fit gives ZeroDivisionError

ChiSquaredTest#goodness_of_fit gives an error with some distributions. The shortest code I could reproduce the error is as follows:

irb(main):001:0> require 'statistics'
=> true
irb(main):002:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.01, 477, [481, 483, 482, 488, 478, 471, 477, 479, 475, 462])
        9: from /usr/bin/irb:23:in `<main>'
        8: from /usr/bin/irb:23:in `load'
        7: from /usr/lib/ruby/gems/2.7.0/gems/irb-1.2.1/exe/irb:11:in `<top (required)>'
        6: from (irb):2
        5: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/statistics/statistical_test/chi_squared_test.rb:28:in `goodness_of_fit'
        4: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/statistics/distribution/chi_squared.rb:14:in `cumulative_function'
        3: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/math.rb:49:in `lower_incomplete_gamma_function'
        2: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/math.rb:27:in `simpson_rule'
        1: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/math.rb:27:in `/'
ZeroDivisionError (divided by 0)

Perhaps a problem lies in lib/math.rb:49. It looks to cause a zero division if x < 0.5.

Thank you in advance.

P-values incorrectly reported for two sided tests.

Some two sided tests are reporting wrong p-values, hence wrong hypotesis evaluation results. According to https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/manually-calculate-a-p-value/ we should calculate the p-value with the following rules:

if test is assuming a lower tail, p-value = CDF(t_score). We show this calculation as probability.
if test is assuming a greater tail, p-value = 1 - CDF(t_score). We show this calculation as p_value for one_tail tests.
if test is assuming a two sided test, p-value = 2 * ( 1 - CDF( |t_score| ). We dont calculate the CDF of the absolute value of the t_score. We use the normal t_score. This is yielding weird results reported in the p_value field.

Weibull random number generator with skewness and kurtosis parameters

Is there a way to add skewness and kurtosis parameters to a Weibull number generator?

https://github.com/estebanz01/ruby-statistics/blob/master/lib/statistics/distribution/weibull.rb

result of Distribution::ChiSquared.cumulative_function outside [0..1] range

I work with the following two test contingency tables (observed1, observed2) with these resulting variables:

observed1 = [[388,51692],[119,45633]]
expected1 = [[269.8969662278191, 51810.10303377218], [237.10303377218088, 45514.89696622782]]
chi_score1 = 111.0839904758887
df1 = 1

observed2 = [[388,51692],[119,45633],[271,40040]]
expected2 = [[293.3065012342283, 51786.69349876577], [257.66818441759625, 45494.3318155824], [227.02531434817544, 40083.974685651825]]
chi_score2 = 114.36002831520479
df2 = 2

When calculated with Python, I get the following results when extracting the pvalue (replace observed with observed1 or observed2):

import scipy.stats as stats
# include above variables here
X2 = stats.chi2_contingency(observed, correction=False)[0]

# observed1 => pvalue = 5.671618200219206e-26
# observed2 => pvalue = 1.469045936431957e-25

When using Distribution::ChiSquared.cumulative_function from the ruby-statistics package, I get the following results (replace dfwith df1or df2, chi_score likewise):

probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
p_value = 1.0 - probability

# observed1 => p_value = Infinity
# observed2 => p_value = 1.0000333200206515

While I cannot confirm the Python module yields the correct values, it seems more plausible, considering that the p-value should be in the range [0..1]?

Crystal programming support

It would be easy and better if you can extend this libraries to Crystal programming language to get free performance boost and lower memory consumption.

Significance tests

Add a couple of significance tests, so we can have a pretty robust and interesting statistics gem :)

Chi-squared goodness fit test.
Paired-T tests.
Wilcoxon rank-sum test.
~~[ ] Spearman's rank correlation (optional, but desired).~~ (Extracted to #17)

Update documentation

Update wiki with the expected documentation per distribution.

T Test for one sample uses wrong degrees of freedom

While trying to implement the confidence intervals for the Student's T test, I found out that we are calculating the degrees of freedom for one sample to be wrong: https://github.com/estebanz01/ruby-statistics/blob/master/lib/statistics/statistical_test/t_test.rb#L24.

Namespace should match the gem name

Per this thread: https://www.reddit.com/r/ruby/comments/76vse9/rubystatistics_a_gem_for_statistical_methods_and/

It's important to have a match between the gem name and the global namespace defined. I tried to move it, but It broke all my specs and the paths didn't load as desired.

p-values for T-test are not accurate enough

I found while building the paired t-test that the p-values (1 - P(x <= X)) calculated using the Student's T distribution are behind of the expected value. A couple of test shows a difference of two decimal values from the expected one:

In https://www.statsdirect.com/help/parametric_methods/paired_t.htm we have p-value of 0.0006 and 0.0012 but the gem returns p-values of 0.0004 and 0.0008.
In http://www.unm.edu/~marcusj/Paired2Sample.pdf we have a p-value of 0.026 and I got a p-value of 0.024.

I traced this back to the way how ruby handles the decimal points. The calculated ones by minitab or R are different than the ones calculated by Ruby itself.

I'm logging this as a bug, but it's not related to the code itself (😅). I'm logging it for future references.

Write F-test/One way ANOVA

Implement the F-Test for two samples (Implemented as ANOVA F Score).
Implement the One way ANOVA for more than two samples.

Aliasing Distribution Causes Naming collision in Main App

I have a Rails app that I have just upgraded to Ruby 2.7.4 which appears to install ruby-statistics as a dependency. In this application, I have a model/namespace class Distribution. When running bundle exec rails console # or server this gem loads prior to the definition of this class and creates an alias in the main app namespace Distribution here:

ruby-statistics/lib/statistics/distribution.rb

Line 9 in 8a699c2

if defined?(Statistics) && !(defined?(Distribution))

This causes the app to fail to load with naming conflicts. Removing this line allows my app to load properly.

Paired T test minimal sample size?

sample1 = [1.0, 2.0, 3.0]
sample2 = [2.0, 3.0, 4.0]
StatisticalTest::TTest.paired_test(alpha = 0.05, :one_tail, sample1, sample2)

=> TypeError: nil can't be coerced into Fixnum

sample1 = [1.0, 2.0, 3.0, 4.0]
sample2 = [2.0, 3.0, 4.0, 3.0]
StatisticalTest::TTest.paired_test(alpha = 0.05, :one_tail, sample1, sample2)

=> {:probability=>0.1955011094781139, :p_value=>0.8044988905218862, :alpha=>0.05, :null=>true, :alternative=>false, :confidence_level=>0.95}

It seems like the lowest sample size that can be used is 3. Is this intentional or am I missing something?