montanaflynn / stats Goto Github PK
View Code? Open in Web Editor NEWA well tested and comprehensive Golang statistics library package with no dependencies.
License: MIT License
A well tested and comprehensive Golang statistics library package with no dependencies.
License: MIT License
For stats.Mode([]float64{5, 5, 3, 3, 4, 4, 2, 2, 1, 1}) should return []float64{} like for stats.Mode([]float64{5,3,4,2,1}).
Hi, when I run Fisher's exact test. The print p-value is differenct from the results's p-value.
For example, in print, the p-value < 2.2e-16. However, the jk$p.value is 1.899826e-32.
Is there any thing wrong in the function? Thank you.
> fm_dn
Down No
Yes 281 326
No 4074 13090
jk = janitor::fisher.test(fm_dn)
> jk
Fisher's Exact Test for Count Data
data: fm_dn
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.343308 3.271687
sample estimates:
odds ratio
2.769178
> jk$p.value
[1] 1.899826e-32
Hi,
First of all, congratulations and thanks for your work.
Using the function for calculate the mode, I've found that when the mode is a low value compared with the rest, and the data array is relatively long, an incorrect result occurs:
Example:
var data = []float64{1, 2, 3, 4, 4, 4, 4, 4, 5, 3, 6, 7, 5, 0, 8, 8, 7, 6, 9, 9}
mode, _ := stats.Mode(data)
fmt.Printf("%v\n", data)
// Result: [4 8]
As we can see the result is incorrect, because the function should return [4]
After analyze the results and study the function code, I've located the problem and this is the fix:
File: https://github.com/montanaflynn/stats/blob/master/mode.go
package stats
// Mode gets the mode [most frequent value(s)] of a slice of float64s
func Mode(input Float64Data) (mode []float64, err error) {
// Return the input if there's only one number
l := input.Len()
if l == 1 {
return input, nil
} else if l == 0 {
return nil, EmptyInput
}
c := sortedCopyDif(input)
// Traverse sorted array,
// tracking the longest repeating sequence
mode = make([]float64, 5)
cnt, maxCnt := 1, 1
for i := 1; i < l; i++ {
switch {
case c[i] == c[i-1]:
cnt++
case cnt == maxCnt && maxCnt != 1:
mode = append(mode, c[i-1])
cnt = 1
case cnt > maxCnt:
mode = append(mode[:0], c[i-1])
maxCnt, cnt = cnt, 1
// :: the fix - reset the counter ::
default:
cnt = 1
// :: end fix ::
}
}
switch {
case cnt == maxCnt:
mode = append(mode, c[l-1])
case cnt > maxCnt:
mode = append(mode[:0], c[l-1])
maxCnt = cnt
}
// Since length must be greater than 1,
// check for slices of distinct values
if maxCnt == 1 {
return Float64Data{}, nil
}
return mode, nil
}
I don't know if the solution convinces you, but it works. If you see it well, I can make the changes and emit a pull request to correct it sooner.
It's on the readme, but it's not in the package
This package is used by Gitea which is a package I'm currently working to get into Debian. This means I get to do a review of all build dependencies. While working through your project, I saw that tags are used to mark releases, but they are not being annotated.
Unannotated release tags end up causing some headaches for packaging systems that monitor upstream activity--mostly for new releases--because the information is missing from 'git describe'. To annotate a tag, it just needs the -a flag passed. (git tag -a).
If you're willing to, it's possible to update the current tags (or just latest) with annotation. I've included some links [1] [2] that explain the process.
If you choose not to update tags, it would still be hugely appreciated if you could use annotated tags in the future.
[1] http://sartak.org/2011/01/replace-a-lightweight-git-tag-with-an-annotated-tag.html
[2] http://stackoverflow.com/questions/5002555/can-a-lightweight-tag-be-converted-to-an-annotated-tag
Nice package, I am using it right now. And I found an inconsistency while calculating the quartiles. Any reason why we must pass the data/input to calculate Quartile? Why not use the instance. If there is no specific reason, I suggest adding the Quartiles
method on the Float64Data's struct without any input, but use the current instance, like we use Mean(), Max(), etc.
Suggestion:
func (f Float64Data) Quartiles() (Quartiles, error) {
return Quartile(f)
}
If this is possible, I will make the MR.
When running the test suite on s390x and ppc64le architectures, I get the following output:
go test -compiler gc -ldflags '' github.com/montanaflynn/stats
--- FAIL: TestCorrelation (0.00s)
correlation_test.go:33: Correlation 0.9912407071619304 != 0.9912407071619302
correlation_test.go:47: Correlation 0.9912407071619304 != 0.9912407071619302
--- FAIL: TestOtherDataMethods (0.00s)
data_test.go:22: github.com/montanaflynn/stats.(Float64Data).Correlation-fm() => 0.2087547359760545 != 0.20875473597605448
data_test.go:22: github.com/montanaflynn/stats.(Float64Data).Pearson-fm() => 0.2087547359760545 != 0.20875473597605448
data_test.go:22: github.com/montanaflynn/stats.(Float64Data).Covariance-fm() => 7.381421553571428 != 7.3814215535714265
data_test.go:22: github.com/montanaflynn/stats.(Float64Data).CovariancePopulation-fm() => 6.458743859374999 != 6.458743859374998
--- FAIL: TestLinearRegression (0.00s)
regression_test.go:19: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 2.3800000000000026
regression_test.go:23: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 3.0800000000000014
regression_test.go:31: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 4.479999999999999
regression_test.go:35: [{1 2.380000000000002} {2 3.080000000000001} {3 3.7800000000000002} {4 4.4799999999999995} {5 5.179999999999999}] != 5.179999999999998
--- FAIL: TestLogarithmicRegression (0.00s)
regression_test.go:94: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 2.1520822363811702
regression_test.go:98: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 3.3305559222492214
regression_test.go:102: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 4.019918836568674
regression_test.go:106: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 4.509029608117273
regression_test.go:110: [{1 2.152082236381168} {2 3.330555922249221} {3 4.019918836568675} {4 4.509029608117274} {5 4.8884133966836645}] != 4.888413396683663
FAIL
FAIL github.com/montanaflynn/stats 0.003s
I also opened a similar bug report for x/image with a bug that seems to be related to this one golang/go#21460
In addition, the tests also fail for i686 architectures, with a different output:
go test -compiler gc -ldflags '' github.com/montanaflynn/stats
--- FAIL: TestLogarithmicRegression (0.00s)
regression_test.go:94: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 2.1520822363811702
regression_test.go:98: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 3.3305559222492214
regression_test.go:102: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 4.019918836568674
regression_test.go:106: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 4.509029608117273
regression_test.go:110: [{1 2.1520822363811654} {2 3.3305559222492205} {3 4.019918836568676} {4 4.509029608117276} {5 4.888413396683665}] != 4.888413396683663
FAIL
FAIL github.com/montanaflynn/stats 0.005s
Note that this does not seem to be related to the issue mentioned above for x/image.
example:
stats.Percentile([]float64{19.39}, 90)
it will return err: Input is outside of range.
In my opinion, I think it should return 19.39
Since the public API isn't finalized I've been suggesting to simply clone and vendor into your projects but I'd like others to be able to take advantage of tools like godep, glide, gopkg.in to install stats into their projects.
How can we best release changes to stats? I'd like to automate it if possible, as of now I'm building the CHANGELOG.md and git tagging manually which is slow and error prone.
Does anyone have experience with releasing packages into the Golang ecosystem?
Hello, I wanted to try and practice some of the go programming language, while contributing to an open source package. I was able to write some functions and test those functions utilizing the makefile provided. However, I had trouble trying to get the packages for changelog / documention . MD file updates
go get github.com/davecheney/godoc2md
go get github.com/golangci/golangci-lint/cmd/golangci-lint
I ran the top command and had some errors, which I believe yielded me in errors for running the bottom command. I checked the top repository and it looks like it is no longer being developed.
I've never contributed in this way and don't have experience in making the edits to the markdown files, but am trying to learn more of all things git and software.
Even though I've spent a lot of time writing tests for stats I think it could benefit from incorporating other more mature test suites from other statistics tools as well. For instance here's a NIST test suite used by Gnu gsl which could be ported to Go and added as a test for stats.
I'm sure there are other test suites as well, let me know if you have suggestions or want to help with this!
I have a feeling it might be possible to use an interface to support both []float64 and []int data. However I've not designed Public interfaces or worked around the lack of generics myself so I'll either have to do some research and hacking or have the excellent community of Gophers help in this area or tell me my attempts will be futile. Either way, any feedback is appreciated!
I think it cool be great to have a Describe()
function like pandas.describe()
.
Hello Flynn. I tried to use your autocorrelation function, but found a strange thing, it returns very similar and incorrect values for some lags.
Example:
We have the following sequence:
[22, 24, 25, 25, 28, 29, 34, 37, 40, 44, 51, 48, 47, 50, 51]
In this case we must obtain the following sequence of values of the autocorrelation function:
[1, 0.83174224, 0.65632458, 0.49105012, 0.27863962, 0.03102625, -0.16527446, -0.30369928, -0.40095465, -0.45823389, -0.45047733]
For lags 0,1,2,3... respectively.
But your function return that:
[0, 0.8317422434367543, 0.8871917263325378, 0.8908883585255901, 0.8911348006717935,...]
for i := 0; i < lags; i++ {
v := (data[0] - mean) * (data[0] - mean)
for i := 1; i < len(data); i++ {
delta0 := data[i-1] - mean
delta1 := data[i] - mean
q += (delta0*delta1 - q) / float64(i+1)
v += (delta1*delta1 - v) / float64(i+1)
}
result = q / v
}
And in your function there is this strange loop through the lag range, in which you overwrite the value of the variable
The percentile funciton does neither the The Nearest Rank method nor The Weighted Percentile method nor the NIST method all defined here: http://en.wikipedia.org/wiki/Percentile
Which is the prefered method? I was thinking about fixing this myself, which would you prefer?
How one would sample from norma distribution using this package?
Thanks in advance.
Using the library and getting error messages that Input is outside of range
, which turned out to be that the data array had a single value, (which I fully realise is not suitable for a percentile calc!). Would it be possible to change the error message to be more informative, or add a documentation item?
To reproduce:
package main
import (
"fmt"
"github.com/montanaflynn/stats"
)
func main() {
points1 := []float64{1.0,2.0,3.0}
points2 := []float64{123.0}
sa1 := stats.LoadRawData(points1)
sa2 := stats.LoadRawData(points2)
if vPerc1, err := stats.Percentile(sa1,90.0); err == nil {
fmt.Printf("Calc'd 90th percentile for multivalue, it's %f\n",vPerc1)
} else {
fmt.Printf("Error calculating 90th percentile for multivalue, err: %s\n",err.Error())
}
if vPerc2, err := stats.Percentile(sa2,90.0); err == nil {
fmt.Printf("Calc'd 90th percentile for single value, it's %f\n",vPerc2)
} else {
fmt.Printf("Error calcultating 90th percentile for single value, err: %s\n",err.Error())
}
}
which gives output of:
Calc'd 90th percentile for multivalue, it's 2.500000
Error calcultating 90th percentile for single value, err: Input is outside of range.
Would love to have some discussion on the Public API. Specifically I want to know if having types with methods in addition to the functions makes sense. I think it does but maybe having two ways to do something is confusing to some. Here's what I mean by two ways of doing things:
var data = []float64{1, 2, 3, 4, 4, 5}
median, _ := stats.Median(data)
fmt.Println(median) // 3.5
var d stats.Float64Data = data
median, _ = d.Median()
fmt.Println(median) // 3.5
Please share your thoughts with me!
Given a slice a := []float64{0, 300, 600}
stats.Percentile(a, 50)
should return 300. However it returns 150.
Is there any plans to implement Spearman Correlation? Thanks.
Hi, wondered if I could contribute by adding single-pass descriptive stats for people working with large datasets? This will simply return mean, sdev, var, min, max, correlation. All the things you have, but for situations where Float64Data would be too big.
We use your library for our open-source photo app, in particular the DBSCAN implementation. Thanks for providing it!
While it works great for me, a developer reported issues with panics in dbscan.go
, line 251. Seems w.Done()
may be called too often, probably depending on input data. Couldn't reproduce it with my local samples.
Our related code and GitHub issue:
Trace:
panic: sync: negative WaitGroup counter
goroutine 3326 [running]:
sync.(*WaitGroup).Add(0xc004761070, 0xffffffffffffffff)
/usr/local/go/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
/usr/local/go/src/sync/waitgroup.go:99
github.com/mpraski/clusters.(*dbscanClusterer).nearestWorker(0xc000cce3c0)
/go/pkg/mod/github.com/mpraski/[email protected]/dbscan.go:251 +0x230
created by github.com/mpraski/clusters.(*dbscanClusterer).startNearestWorkers
/go/pkg/mod/github.com/mpraski/[email protected]/dbscan.go:228
See: https://en.wikipedia.org/wiki/Percentile#The_weighted_percentile_method
My current code is as follows. But I don't know how to support weighted percentile?
func TestPercentile(t *testing.T) {
values := []float64{4, 5, 3, 1, 2}
percentiles := []float64{}
for i := 1; i <= 100; i++ {
percentile, err := stats.PercentileNearestRank(values, float64(i))
if err != nil {
panic(err)
}
percentiles = append(percentiles, percentile)
fmt.Printf("%d%%: %f, ", i, percentile)
}
println()
println()
for f := 0.0; f <= 5; f += 0.1 {
index := sort.SearchFloat64s(percentiles, f + 0.00000001)
fmt.Printf("%f: %d%%, ", f, index)
}
println()
println()
}
Could you please tag v0.3.0 so that vgo can find it and use it?
See: https://research.swtch.com/vgo
Thank you.
Thank you for this great package.
I added support for a string and io.Reader in LoadRawData() so it support whitespace separated strings.
i.e
stats.LoadRawData("1.1 2 3.0 4 5")
// or
stats.LoadRawData(os.Stdin)
Is this something you would consider implemented in you package? If so, I can create a pull request.
kk
Line 26 in 8e3445d
it should be : index := (percent/100)*float64(len(c)-1) + 1
according to the wiki https://www.calculatorsoup.com/calculators/statistics/percentile-calculator.php
r = (p/100) * (n - 1) + 1
to produce the bug, try this dataset
0.67, 0.999, 1
percent = 25
I believe there are some errors with the Percentiles()
edge cases.
Passing 0 as the percent will cause an error as will a small set of data and a small percentage (such that the c[i-1]
is out of bounds because i = 0
.
I'm not sure the best approach to fix this, as picking which index is quite critical for the correct result, but I think this might work?
index := (percent / 100) * float64(len(c) - 1)
And then use c[i]
and c[i+1]
later on. Using c[i+1]
would be dangerous with input.Len() == 1
and maybe in the case of 99.9 percent and few values?
Hi,
I am not able to use the MedianAbsoluteDeviationPopulation function,
If I use "go doc", I do not see all functions:
$ go doc stats
package stats // import "github.com/zizmos/ego/vendor/github.com/montanaflynn/stats"
func Correlation(data1, data2 Float64Data) (float64, error)
func Covariance(data1, data2 Float64Data) (float64, error)
func GeometricMean(input Float64Data) (float64, error)
func HarmonicMean(input Float64Data) (float64, error)
func InterQuartileRange(input Float64Data) (float64, error)
func Max(input Float64Data) (max float64, err error)
func Mean(input Float64Data) (float64, error)
func Median(input Float64Data) (median float64, err error)
func Midhinge(input Float64Data) (float64, error)
func Min(input Float64Data) (min float64, err error)
func Mode(input Float64Data) (mode []float64, err error)
func Percentile(input Float64Data, percent float64) (percentile float64, err error)
func PercentileNearestRank(input Float64Data, percent float64) (percentile float64, err error)
func PopulationVariance(input Float64Data) (pvar float64, err error)
func Round(input float64, places int) (rounded float64, err error)
func Sample(input Float64Data, takenum int, replacement bool) ([]float64, error)
func SampleVariance(input Float64Data) (svar float64, err error)
func StandardDeviation(input Float64Data) (sdev float64, err error)
func StandardDeviationPopulation(input Float64Data) (sdev float64, err error)
func StandardDeviationSample(input Float64Data) (sdev float64, err error)
func StdDevP(input Float64Data) (sdev float64, err error)
func StdDevS(input Float64Data) (sdev float64, err error)
func Sum(input Float64Data) (sum float64, err error)
func Trimean(input Float64Data) (float64, error)
func VarP(input Float64Data) (sdev float64, err error)
func VarS(input Float64Data) (sdev float64, err error)
func Variance(input Float64Data) (sdev float64, err error)
type Coordinate struct{ ... }
func ExpReg(s []Coordinate) (regressions []Coordinate, err error)
func LinReg(s []Coordinate) (regressions []Coordinate, err error)
func LogReg(s []Coordinate) (regressions []Coordinate, err error)
type Float64Data []float64
type Outliers struct{ ... }
func QuartileOutliers(input Float64Data) (Outliers, error)
type Quartiles struct{ ... }
func Quartile(input Float64Data) (Quartiles, error)
type Series []Coordinate
func ExponentialRegression(s Series) (regressions Series, err error)
func LinearRegression(s Series) (regressions Series, err error)
func LogarithmicRegression(s Series) (regressions Series, err error)
However, MedianAbsoluteDeviationPopulation function is a public function in the implementation.
$go version
go version go1.10.2 darwin/amd64
$dep status
....
....
github.com/montanaflynn/stats ^0.2.0 0.2.0 eeaced0 0.2.0 1
...
...
Hi, I was thinking of implementing t-test, should I add it as a pull request?
All of the methods that call sort are not documented to specifiy that the input will be modified.
The source code from go get github.com/montanaflynn/stats
on my project with go module is different from the source you have in your GitHub repo. It shows me [email protected]
but the changes log for ver0.5.0 isn't on the code I have received. Here is the go sum:
github.com/montanaflynn/stats v0.5.0 h1:2EkzeTSqBB4V4bJwWrt5gIIrZmpJBcoIRGS2kWLgzmk= github.com/montanaflynn/stats v0.5.0/go.mod h1:wL8QJuTMNUDYhXwkmfOly8iTdp5TEcJFWZD2D7SIkUc=
Hi
I found sth on paper which shows we can calculate SD of a dataset from SD's of its subsets :)
e.g : NEW_FUNCTION(SD({1,3}), SD(5), medians) = SD({1,3,5})
and the use case is where we have calculated the SD of 1000 rows of data and want to calculate SD of 1000 + 1 without processing former data
is this implemented in this library cause i didn't see such a thing
if it seems proper let me know to submit the pull request.
Thanks
[{0 2.5} {1 5} {4 25} {6 5} {8 5} {11 15} {12 2.5} {14 25} {15 0} {16 1.6666666666666667} {17 40} {20 5} {21 15} {22 20} {23 16.666666666666668} {24 13.333333333333334} {25 50} {26 18} {27 75} {28 21} {29 0} {30 5} {31 37.5} {32 5} {34 40} {36 5} {37 39}]
When running exponential regression with the series above. The result is:
[{0 NaN} {1 NaN} {4 NaN} {6 NaN} {8 NaN} {11 NaN} {12 NaN} {14 NaN} {15 NaN} {16 NaN} {17 NaN} {20 NaN} {21 NaN} {22 NaN} {23 NaN} {24 NaN} {25 NaN} {26 NaN} {27 NaN} {28 NaN} {29 NaN} {30 NaN} {31 NaN} {32 NaN} {34 NaN} {36 NaN} {37 NaN}]
Is this desired? If so could you maybe elaborate on why this is happens.
Instead of the error values being generated within the functions, globally-defined errors would allow error handling in the calling function to be done without parsing the error string.
This error is generated in many places, but there's no way to compare two error values without comparing the string itself. Additionally, there's no guarantee that two "Empty input errors" have the exact same error string.
if input.Len() == 0 {
return 0, errors.New("Input must not be empty")
}
If all of those errors are collected into a set of exported constants:
type struct StatErr {
err string
}
func (s StatErr) Error() string {
return err
}
const (
EmptyArrayError = StatErr{"Empty input can't be processed"}
...
)
That way an error value can be identified by these constants:
In the code:
if input.Len() == 0 {
return 0, EmptyArrayError
}
In the call:
v, err := stat.Mean(input)
if err == stat.EmptyArrayError {
//Handle the specific error
}
This would clean up the error returns in the library and make error handling easier for anyone using it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.