montanaflynn / stats Goto Github PK

A well tested and comprehensive Golang statistics library package with no dependencies.

License: MIT License

Go 99.52% Makefile 0.48%

go statistics math data analytics stats rounding machine-learning algorithms

stats's Issues

Mode is not calculated correctly

For stats.Mode([]float64{5, 5, 3, 3, 4, 4, 2, 2, 1, 1}) should return []float64{} like for stats.Mode([]float64{5,3,4,2,1}).

p-value in Fisher's Exact Test is different in output and print

Hi, when I run Fisher's exact test. The print p-value is differenct from the results's p-value.
For example, in print, the p-value < 2.2e-16. However, the jk$p.value is 1.899826e-32.
Is there any thing wrong in the function? Thank you.

> fm_dn
       Down    No
Yes  281   326
No  4074 13090

jk = janitor::fisher.test(fm_dn)

> jk

	Fisher's Exact Test for Count Data

data:  fm_dn
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 2.343308 3.271687
sample estimates:
odds ratio 
  2.769178 

> jk$p.value
[1] 1.899826e-32

Mode is not calculated correctly

Hi,
First of all, congratulations and thanks for your work.

Using the function for calculate the mode, I've found that when the mode is a low value compared with the rest, and the data array is relatively long, an incorrect result occurs:

Example:

var data = []float64{1, 2, 3, 4, 4, 4, 4, 4, 5, 3, 6, 7, 5, 0, 8, 8, 7, 6, 9, 9}
mode, _ := stats.Mode(data)
fmt.Printf("%v\n", data)

// Result: [4 8]

As we can see the result is incorrect, because the function should return [4]

After analyze the results and study the function code, I've located the problem and this is the fix:

File: https://github.com/montanaflynn/stats/blob/master/mode.go

package stats

// Mode gets the mode [most frequent value(s)] of a slice of float64s
func Mode(input Float64Data) (mode []float64, err error) {
    // Return the input if there's only one number
    l := input.Len()
    if l == 1 {
        return input, nil
    } else if l == 0 {
        return nil, EmptyInput
    }

    c := sortedCopyDif(input)
    // Traverse sorted array,
    // tracking the longest repeating sequence
    mode = make([]float64, 5)
    cnt, maxCnt := 1, 1
    for i := 1; i < l; i++ {
        switch {
        case c[i] == c[i-1]:
            cnt++
        case cnt == maxCnt && maxCnt != 1:
            mode = append(mode, c[i-1])
            cnt = 1
        case cnt > maxCnt:
            mode = append(mode[:0], c[i-1])
            maxCnt, cnt = cnt, 1
        // :: the fix - reset the counter ::
        default:
            cnt = 1
        // :: end fix ::
        }
    }
    switch {
    case cnt == maxCnt:
        mode = append(mode, c[l-1])
    case cnt > maxCnt:
        mode = append(mode[:0], c[l-1])
        maxCnt = cnt    
    }


    // Since length must be greater than 1,
    // check for slices of distinct values
    if maxCnt == 1 {
        return Float64Data{}, nil
    }

    return mode, nil
}

I don't know if the solution convinces you, but it works. If you see it well, I can make the changes and emit a pull request to correct it sooner.

CumulativeSum doesn't exist

It's on the readme, but it's not in the package

Please Provide Annotated Release Tags

This package is used by Gitea which is a package I'm currently working to get into Debian. This means I get to do a review of all build dependencies. While working through your project, I saw that tags are used to mark releases, but they are not being annotated.

Unannotated release tags end up causing some headaches for packaging systems that monitor upstream activity--mostly for new releases--because the information is missing from 'git describe'. To annotate a tag, it just needs the -a flag passed. (git tag -a).

If you're willing to, it's possible to update the current tags (or just latest) with annotation. I've included some links [1] [2] that explain the process.

If you choose not to update tags, it would still be hugely appreciated if you could use annotated tags in the future.

[1] http://sartak.org/2011/01/replace-a-lightweight-git-tag-with-an-annotated-tag.html
[2] http://stackoverflow.com/questions/5002555/can-a-lightweight-tag-be-converted-to-an-annotated-tag

[Suggestion] Calculate Quartile from the instance of Float64Data

Nice package, I am using it right now. And I found an inconsistency while calculating the quartiles. Any reason why we must pass the data/input to calculate Quartile? Why not use the instance. If there is no specific reason, I suggest adding the Quartiles method on the Float64Data's struct without any input, but use the current instance, like we use Mean(), Max(), etc.

Suggestion:

func (f Float64Data) Quartiles() (Quartiles, error) {
	return Quartile(f)
}

If this is possible, I will make the MR.

A few tests fail in different architectures due to precision errors

When running the test suite on s390x and ppc64le architectures, I get the following output:

go test -compiler gc -ldflags '' github.com/montanaflynn/stats
--- FAIL: TestCorrelation (0.00s)
	correlation_test.go:33: Correlation 0.9912407071619304 != 0.9912407071619302
	correlation_test.go:47: Correlation 0.9912407071619304 != 0.9912407071619302
--- FAIL: TestOtherDataMethods (0.00s)
	data_test.go:22: github.com/montanaflynn/stats.(Float64Data).Correlation-fm() => 0.2087547359760545 != 0.20875473597605448
	data_test.go:22: github.com/montanaflynn/stats.(Float64Data).Pearson-fm() => 0.2087547359760545 != 0.20875473597605448
	data_test.go:22: github.com/montanaflynn/stats.(Float64Data).Covariance-fm() => 7.381421553571428 != 7.3814215535714265
	data_test.go:22: github.com/montanaflynn/stats.(Float64Data).CovariancePopulation-fm() => 6.458743859374999 != 6.458743859374998
--- FAIL: TestLinearRegression (0.00s)
	regression_test.go:19: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 2.3800000000000026
	regression_test.go:23: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 3.0800000000000014
	regression_test.go:31: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 4.479999999999999
	regression_test.go:35: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 5.179999999999998
--- FAIL: TestLogarithmicRegression (0.00s)
	regression_test.go:94: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 2.1520822363811702
	regression_test.go:98: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 3.3305559222492214
	regression_test.go:102: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 4.019918836568674
	regression_test.go:106: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 4.509029608117273
	regression_test.go:110: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 4.888413396683663
FAIL
FAIL	github.com/montanaflynn/stats	0.003s

I also opened a similar bug report for x/image with a bug that seems to be related to this one golang/go#21460

In addition, the tests also fail for i686 architectures, with a different output:

go test -compiler gc -ldflags '' github.com/montanaflynn/stats
--- FAIL: TestLogarithmicRegression (0.00s)
	regression_test.go:94: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 2.1520822363811702
	regression_test.go:98: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 3.3305559222492214
	regression_test.go:102: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 4.019918836568674
	regression_test.go:106: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 4.509029608117273
	regression_test.go:110: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 4.888413396683663
FAIL
FAIL	github.com/montanaflynn/stats	0.005s

Note that this does not seem to be related to the issue mentioned above for x/image.

Percentile return BoundsErr when the input is only one element

example:

stats.Percentile([]float64{19.39}, 90)
it will return err:  Input is outside of range.

In my opinion, I think it should return 19.39

Release strategy

Since the public API isn't finalized I've been suggesting to simply clone and vendor into your projects but I'd like others to be able to take advantage of tools like godep, glide, gopkg.in to install stats into their projects.

How can we best release changes to stats? I'd like to automate it if possible, as of now I'm building the CHANGELOG.md and git tagging manually which is slow and error prone.

Does anyone have experience with releasing packages into the Golang ecosystem?

Trouble with trying to add Changelog and Documentation

Hello, I wanted to try and practice some of the go programming language, while contributing to an open source package. I was able to write some functions and test those functions utilizing the makefile provided. However, I had trouble trying to get the packages for changelog / documention . MD file updates

go get github.com/davecheney/godoc2md
go get github.com/golangci/golangci-lint/cmd/golangci-lint

I ran the top command and had some errors, which I believe yielded me in errors for running the bottom command. I checked the top repository and it looks like it is no longer being developed.

I've never contributed in this way and don't have experience in making the edits to the markdown files, but am trying to learn more of all things git and software.

Add mature test suites

Even though I've spent a lot of time writing tests for stats I think it could benefit from incorporating other more mature test suites from other statistics tools as well. For instance here's a NIST test suite used by Gnu gsl which could be ported to Go and added as a test for stats.

I'm sure there are other test suites as well, let me know if you have suggestions or want to help with this!

Using an interface to support []float64 and []int

I have a feeling it might be possible to use an interface to support both []float64 and []int data. However I've not designed Public interfaces or worked around the lack of generics myself so I'll either have to do some research and hacking or have the excellent community of Gophers help in this area or tell me my attempts will be futile. Either way, any feedback is appreciated!

Hi, I was thinking of implementing z-test, should I add it as a pull request?

Describe function

I think it cool be great to have a Describe() function like pandas.describe().

AutoCorrelation bug.

Hello Flynn. I tried to use your autocorrelation function, but found a strange thing, it returns very similar and incorrect values for some lags.
Example:
We have the following sequence:
[22, 24, 25, 25, 28, 29, 34, 37, 40, 44, 51, 48, 47, 50, 51]
In this case we must obtain the following sequence of values of the autocorrelation function:
[1, 0.83174224, 0.65632458, 0.49105012, 0.27863962, 0.03102625, -0.16527446, -0.30369928, -0.40095465, -0.45823389, -0.45047733]
For lags 0,1,2,3... respectively.
But your function return that:
[0, 0.8317422434367543, 0.8871917263325378, 0.8908883585255901, 0.8911348006717935,...]

for i := 0; i < lags; i++ {
	v := (data[0] - mean) * (data[0] - mean)
	for i := 1; i < len(data); i++ {
		delta0 := data[i-1] - mean
		delta1 := data[i] - mean
		q += (delta0*delta1 - q) / float64(i+1)
		v += (delta1*delta1 - v) / float64(i+1)
	}
	result = q / v
}

And in your function there is this strange loop through the lag range, in which you overwrite the value of the variable

Percentile is not calculated correctly

The percentile funciton does neither the The Nearest Rank method nor The Weighted Percentile method nor the NIST method all defined here: http://en.wikipedia.org/wiki/Percentile

Which is the prefered method? I was thinking about fixing this myself, which would you prefer?

Question: Sampling from Normal distribution

How one would sample from norma distribution using this package?

Thanks in advance.

[enhancement] Error message on percentile calc is slightly misleading

Using the library and getting error messages that Input is outside of range, which turned out to be that the data array had a single value, (which I fully realise is not suitable for a percentile calc!). Would it be possible to change the error message to be more informative, or add a documentation item?

To reproduce:

package main

import (
	"fmt"
	"github.com/montanaflynn/stats"
)

func main()  {
	points1 := []float64{1.0,2.0,3.0}
	points2 := []float64{123.0}
	sa1 := stats.LoadRawData(points1)
	sa2 := stats.LoadRawData(points2)
	if vPerc1, err := stats.Percentile(sa1,90.0); err == nil {
		fmt.Printf("Calc'd 90th percentile for multivalue, it's %f\n",vPerc1)
	} else {
		fmt.Printf("Error calculating 90th percentile for multivalue, err: %s\n",err.Error())
	}
	if vPerc2, err := stats.Percentile(sa2,90.0); err == nil {
		fmt.Printf("Calc'd 90th percentile for single value, it's %f\n",vPerc2)
	} else {
		fmt.Printf("Error calcultating 90th percentile for single value, err: %s\n",err.Error())
	}
}

which gives output of:

Calc'd 90th percentile for multivalue, it's 2.500000
Error calcultating 90th percentile for single value, err: Input is outside of range.

Public API Discussion

Would love to have some discussion on the Public API. Specifically I want to know if having types with methods in addition to the functions makes sense. I think it does but maybe having two ways to do something is confusing to some. Here's what I mean by two ways of doing things:

var data = []float64{1, 2, 3, 4, 4, 5}
median, _ := stats.Median(data)
fmt.Println(median) // 3.5

var d stats.Float64Data = data
median, _ = d.Median()
fmt.Println(median) // 3.5

Please share your thoughts with me!

Percentile Calculation Bug?

Given a slice a := []float64{0, 300, 600}
stats.Percentile(a, 50) should return 300. However it returns 150.

Spearman Correlation?

Is there any plans to implement Spearman Correlation? Thanks.

Adding single-pass descriptive stats for large data sets

Hi, wondered if I could contribute by adding single-pass descriptive stats for people working with large datasets? This will simply return mean, sdev, var, min, max, correlation. All the things you have, but for situations where Float64Data would be too big.

panic: sync: negative WaitGroup counter in dbscan.go

We use your library for our open-source photo app, in particular the DBSCAN implementation. Thanks for providing it!

While it works great for me, a developer reported issues with panics in dbscan.go, line 251. Seems w.Done() may be called too often, probably depending on input data. Couldn't reproduce it with my local samples.

Our related code and GitHub issue:

Trace:

panic: sync: negative WaitGroup counter

goroutine 3326 [running]:
sync.(*WaitGroup).Add(0xc004761070, 0xffffffffffffffff)
         /usr/local/go/src/sync/waitgroup.go:74 +0x147
 sync.(*WaitGroup).Done(...)
         /usr/local/go/src/sync/waitgroup.go:99
 github.com/mpraski/clusters.(*dbscanClusterer).nearestWorker(0xc000cce3c0)
        /go/pkg/mod/github.com/mpraski/[email protected]/dbscan.go:251 +0x230
 created by github.com/mpraski/clusters.(*dbscanClusterer).startNearestWorkers
         /go/pkg/mod/github.com/mpraski/[email protected]/dbscan.go:228

How to calculate the weighted percentile?

See: https://en.wikipedia.org/wiki/Percentile#The_weighted_percentile_method

My current code is as follows. But I don't know how to support weighted percentile?

func TestPercentile(t *testing.T) {
	values := []float64{4, 5, 3, 1, 2}
	percentiles := []float64{}
	for i := 1; i <= 100; i++ {
		percentile, err := stats.PercentileNearestRank(values, float64(i))
		if err != nil {
			panic(err)
		}
		percentiles = append(percentiles, percentile)
		fmt.Printf("%d%%: %f, ", i, percentile)
	}
	println()
	println()

	for f := 0.0; f <= 5; f += 0.1 {
		index := sort.SearchFloat64s(percentiles, f + 0.00000001)
		fmt.Printf("%f: %d%%, ", f, index)
	}
	println()
	println()
}

semver tags, please

Could you please tag v0.3.0 so that vgo can find it and use it?

See: https://research.swtch.com/vgo

Thank you.

support string in LoadRawData()

Thank you for this great package.
I added support for a string and io.Reader in LoadRawData() so it support whitespace separated strings.

i.e

stats.LoadRawData("1.1 2 3.0 4 5")

// or
stats.LoadRawData(os.Stdin)

Is this something you would consider implemented in you package? If so, I can create a pull request.

percentile is having issue

stats/percentile.go

Line 26 in 8e3445d

index := (percent / 100) * float64(len(c))

it should be : index := (percent/100)*float64(len(c)-1) + 1
according to the wiki https://www.calculatorsoup.com/calculators/statistics/percentile-calculator.php
r = (p/100) * (n - 1) + 1

to produce the bug, try this dataset
0.67, 0.999, 1
percent = 25

Edge cases with Percentiles

I believe there are some errors with the Percentiles() edge cases.

Passing 0 as the percent will cause an error as will a small set of data and a small percentage (such that the c[i-1] is out of bounds because i = 0.

I'm not sure the best approach to fix this, as picking which index is quite critical for the correct result, but I think this might work?

index := (percent / 100) * float64(len(c) - 1)

And then use c[i] and c[i+1] later on. Using c[i+1] would be dangerous with input.Len() == 1 and maybe in the case of 99.9 percent and few values?

undefined: stats.MedianAbsoluteDeviationPopulation

Hi,

I am not able to use the MedianAbsoluteDeviationPopulation function,

If I use "go doc", I do not see all functions:
$ go doc stats

package stats // import "github.com/zizmos/ego/vendor/github.com/montanaflynn/stats"

func Correlation(data1, data2 Float64Data) (float64, error)
func Covariance(data1, data2 Float64Data) (float64, error)
func GeometricMean(input Float64Data) (float64, error)
func HarmonicMean(input Float64Data) (float64, error)
func InterQuartileRange(input Float64Data) (float64, error)
func Max(input Float64Data) (max float64, err error)
func Mean(input Float64Data) (float64, error)
func Median(input Float64Data) (median float64, err error)
func Midhinge(input Float64Data) (float64, error)
func Min(input Float64Data) (min float64, err error)
func Mode(input Float64Data) (mode []float64, err error)
func Percentile(input Float64Data, percent float64) (percentile float64, err error)
func PercentileNearestRank(input Float64Data, percent float64) (percentile float64, err error)
func PopulationVariance(input Float64Data) (pvar float64, err error)
func Round(input float64, places int) (rounded float64, err error)
func Sample(input Float64Data, takenum int, replacement bool) ([]float64, error)
func SampleVariance(input Float64Data) (svar float64, err error)
func StandardDeviation(input Float64Data) (sdev float64, err error)
func StandardDeviationPopulation(input Float64Data) (sdev float64, err error)
func StandardDeviationSample(input Float64Data) (sdev float64, err error)
func StdDevP(input Float64Data) (sdev float64, err error)
func StdDevS(input Float64Data) (sdev float64, err error)
func Sum(input Float64Data) (sum float64, err error)
func Trimean(input Float64Data) (float64, error)
func VarP(input Float64Data) (sdev float64, err error)
func VarS(input Float64Data) (sdev float64, err error)
func Variance(input Float64Data) (sdev float64, err error)
type Coordinate struct{ ... }
func ExpReg(s []Coordinate) (regressions []Coordinate, err error)
func LinReg(s []Coordinate) (regressions []Coordinate, err error)
func LogReg(s []Coordinate) (regressions []Coordinate, err error)
type Float64Data []float64
type Outliers struct{ ... }
func QuartileOutliers(input Float64Data) (Outliers, error)
type Quartiles struct{ ... }
func Quartile(input Float64Data) (Quartiles, error)
type Series []Coordinate
func ExponentialRegression(s Series) (regressions Series, err error)
func LinearRegression(s Series) (regressions Series, err error)
func LogarithmicRegression(s Series) (regressions Series, err error)

However, MedianAbsoluteDeviationPopulation function is a public function in the implementation.
$go version
go version go1.10.2 darwin/amd64

$dep status
....
....
github.com/montanaflynn/stats ^0.2.0 0.2.0 eeaced0 0.2.0 1
...
...

T-Test

Hi, I was thinking of implementing t-test, should I add it as a pull request?

Funcitons have unexpected side effects

All of the methods that call sort are not documented to specifiy that the input will be modified.

Go module version 0.5.0

The source code from go get github.com/montanaflynn/stats on my project with go module is different from the source you have in your GitHub repo. It shows me [email protected] but the changes log for ver0.5.0 isn't on the code I have received. Here is the go sum:
github.com/montanaflynn/stats v0.5.0 h1:2EkzeTSqBB4V4bJwWrt5gIIrZmpJBcoIRGS2kWLgzmk= github.com/montanaflynn/stats v0.5.0/go.mod h1:wL8QJuTMNUDYhXwkmfOly8iTdp5TEcJFWZD2D7SIkUc=

Hi, I was thinking of implementing a norm class (normal distribution probability density function, point percentile, etc. ), should I add it as a pull request?

Standard Deviation

Hi
I found sth on paper which shows we can calculate SD of a dataset from SD's of its subsets :)
e.g : NEW_FUNCTION(SD({1,3}), SD(5), medians) = SD({1,3,5})
and the use case is where we have calculated the SD of 1000 rows of data and want to calculate SD of 1000 + 1 without processing former data
is this implemented in this library cause i didn't see such a thing
if it seems proper let me know to submit the pull request.
Thanks

NaN when running exponential regression

[{0 2.5} {1 5} {4 25} {6 5} {8 5} {11 15} {12 2.5} {14 25} {15 0} {16 1.6666666666666667} {17 40} {20 5} {21 15} {22 20} {23 16.666666666666668} {24 13.333333333333334} {25 50} {26 18} {27 75} {28 21} {29 0} {30 5} {31 37.5} {32 5} {34 40} {36 5} {37 39}]

When running exponential regression with the series above. The result is:

[{0 NaN} {1 NaN} {4 NaN} {6 NaN} {8 NaN} {11 NaN} {12 NaN} {14 NaN} {15 NaN} {16 NaN} {17 NaN} {20 NaN} {21 NaN} {22 NaN} {23 NaN} {24 NaN} {25 NaN} {26 NaN} {27 NaN} {28 NaN} {29 NaN} {30 NaN} {31 NaN} {32 NaN} {34 NaN} {36 NaN} {37 NaN}]

Is this desired? If so could you maybe elaborate on why this is happens.

Exported error values

Instead of the error values being generated within the functions, globally-defined errors would allow error handling in the calling function to be done without parsing the error string.

This error is generated in many places, but there's no way to compare two error values without comparing the string itself. Additionally, there's no guarantee that two "Empty input errors" have the exact same error string.

    if input.Len() == 0 {
        return 0, errors.New("Input must not be empty")
    }

If all of those errors are collected into a set of exported constants:

type struct StatErr {
      err string
}

func (s StatErr) Error() string {
        return err
}

const (
        EmptyArrayError = StatErr{"Empty input can't be processed"}
        ...
)

That way an error value can be identified by these constants:

In the code:

    if input.Len() == 0 {
        return 0, EmptyArrayError
    }

In the call:

    v, err := stat.Mean(input)
    if err == stat.EmptyArrayError {
        //Handle the specific error
    }

This would clean up the error returns in the library and make error handling easier for anyone using it.

montanaflynn / stats Goto Github PK

stats's Issues

Recommend Projects

Recommend Topics

Recommend Org