Git Product home page Git Product logo

ruby-statistics's People

Contributors

dependabot[bot] avatar dsounded avatar estebanz01 avatar htwroclau avatar igas avatar jasoncaryallen avatar oliver-czulo avatar ylansegal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ruby-statistics's Issues

Implement/replace standard deviation and variance.

We need to do this urgent, because the descriptive_statistics gem is using a formula to calculate variance not for samples but for populations, which is not accurate for statistical tests and distribution calculations.

It's causing #13

ChiSquaredTest#goodness_of_fit hangs with BigDecimal input

This is the simplest example that seems to trigger the hanging behaviour. If the BigDecimal is cast to a float then it works as expected:

require 'statistics'

expected = [BigDecimal(0.2, 1)]
observed = [3.3]

chi_squared = Statistics::StatisticalTest::ChiSquaredTest
stats = chi_squared.goodness_of_fit(0.05, expected, observed)
# Observe that the process hangs at this point.

Add more discrete distributions

due to rush, I was only able to add two discrete distribution. The idea is to have more common discrete elements like the negative binomial or the geometric distribution.

Standard Deviation

Excuse me but I would like to know if there is a function for the Standard Deviation. I was looking into the wiki and did not find it.

Thank you in advance.

Implement KS test for one sample

The idea is to implement the KS test where we pass a sample and a distribution object, to validate if it belongs or not.

Something like this:

distribution = Distribution::StandardNormal.new
samples = distribution.random(elements: 1_000, seed: 100)
# This should return null hypothesis as true,
# so we assume that the sample belongs to the specified distribution.
StatisticalTest::KSTest.one_group(samples: samples, distribution: distribution)

ChiSquaredTest#goodness_of_fit gives ZeroDivisionError

ChiSquaredTest#goodness_of_fit gives an error with some distributions. The shortest code I could reproduce the error is as follows:

irb(main):001:0> require 'statistics'
=> true
irb(main):002:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.01, 477, [481, 483, 482, 488, 478, 471, 477, 479, 475, 462])
        9: from /usr/bin/irb:23:in `<main>'
        8: from /usr/bin/irb:23:in `load'
        7: from /usr/lib/ruby/gems/2.7.0/gems/irb-1.2.1/exe/irb:11:in `<top (required)>'
        6: from (irb):2
        5: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/statistics/statistical_test/chi_squared_test.rb:28:in `goodness_of_fit'
        4: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/statistics/distribution/chi_squared.rb:14:in `cumulative_function'
        3: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/math.rb:49:in `lower_incomplete_gamma_function'
        2: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/math.rb:27:in `simpson_rule'
        1: from /var/lib/gems/2.7.0/gems/ruby-statistics-3.0.0/lib/math.rb:27:in `/'
ZeroDivisionError (divided by 0)

Perhaps a problem lies in lib/math.rb:49. It looks to cause a zero division if x < 0.5.

Thank you in advance.

P-values incorrectly reported for two sided tests.

Some two sided tests are reporting wrong p-values, hence wrong hypotesis evaluation results. According to https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/manually-calculate-a-p-value/ we should calculate the p-value with the following rules:

  1. if test is assuming a lower tail, p-value = CDF(t_score). We show this calculation as probability.
  2. if test is assuming a greater tail, p-value = 1 - CDF(t_score). We show this calculation as p_value for one_tail tests.
  3. if test is assuming a two sided test, p-value = 2 * ( 1 - CDF( |t_score| ). We dont calculate the CDF of the absolute value of the t_score. We use the normal t_score. This is yielding weird results reported in the p_value field.

result of Distribution::ChiSquared.cumulative_function outside [0..1] range

I work with the following two test contingency tables (observed1, observed2) with these resulting variables:

observed1 = [[388,51692],[119,45633]]
expected1 = [[269.8969662278191, 51810.10303377218], [237.10303377218088, 45514.89696622782]]
chi_score1 = 111.0839904758887
df1 = 1

observed2 = [[388,51692],[119,45633],[271,40040]]
expected2 = [[293.3065012342283, 51786.69349876577], [257.66818441759625, 45494.3318155824], [227.02531434817544, 40083.974685651825]]
chi_score2 = 114.36002831520479
df2 = 2

When calculated with Python, I get the following results when extracting the pvalue (replace observed with observed1 or observed2):

import scipy.stats as stats
# include above variables here
X2 = stats.chi2_contingency(observed, correction=False)[0]

# observed1 => pvalue = 5.671618200219206e-26
# observed2 => pvalue = 1.469045936431957e-25

When using Distribution::ChiSquared.cumulative_function from the ruby-statistics package, I get the following results (replace dfwith df1or df2, chi_score likewise):

probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
p_value = 1.0 - probability

# observed1 => p_value = Infinity
# observed2 => p_value = 1.0000333200206515

While I cannot confirm the Python module yields the correct values, it seems more plausible, considering that the p-value should be in the range [0..1]?

Crystal programming support

It would be easy and better if you can extend this libraries to Crystal programming language to get free performance boost and lower memory consumption.

Significance tests

Add a couple of significance tests, so we can have a pretty robust and interesting statistics gem :)

  • Chi-squared goodness fit test.
  • Paired-T tests.
  • Wilcoxon rank-sum test.
  • [ ] Spearman's rank correlation (optional, but desired). (Extracted to #17)

p-values for T-test are not accurate enough

I found while building the paired t-test that the p-values (1 - P(x <= X)) calculated using the Student's T distribution are behind of the expected value. A couple of test shows a difference of two decimal values from the expected one:

I traced this back to the way how ruby handles the decimal points. The calculated ones by minitab or R are different than the ones calculated by Ruby itself.

I'm logging this as a bug, but it's not related to the code itself (😅). I'm logging it for future references.

Write F-test/One way ANOVA

  • Implement the F-Test for two samples (Implemented as ANOVA F Score).
  • Implement the One way ANOVA for more than two samples.

Aliasing Distribution Causes Naming collision in Main App

I have a Rails app that I have just upgraded to Ruby 2.7.4 which appears to install ruby-statistics as a dependency. In this application, I have a model/namespace class Distribution. When running bundle exec rails console # or server this gem loads prior to the definition of this class and creates an alias in the main app namespace Distribution here:

if defined?(Statistics) && !(defined?(Distribution))

This causes the app to fail to load with naming conflicts. Removing this line allows my app to load properly.

Paired T test minimal sample size?

sample1 = [1.0, 2.0, 3.0]
sample2 = [2.0, 3.0, 4.0]
StatisticalTest::TTest.paired_test(alpha = 0.05, :one_tail, sample1, sample2)

=> TypeError: nil can't be coerced into Fixnum

sample1 = [1.0, 2.0, 3.0, 4.0]
sample2 = [2.0, 3.0, 4.0, 3.0]
StatisticalTest::TTest.paired_test(alpha = 0.05, :one_tail, sample1, sample2)

=> {:probability=>0.1955011094781139, :p_value=>0.8044988905218862, :alpha=>0.05, :null=>true, :alternative=>false, :confidence_level=>0.95}

It seems like the lowest sample size that can be used is 3. Is this intentional or am I missing something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.