rurlus / diptest Goto Github PK
View Code? Open in Web Editor NEWPython/C++ implementation of Hartigan & Hartigan's dip test, based on Martin Maechler's R package
License: GNU General Public License v3.0
Python/C++ implementation of Hartigan & Hartigan's dip test, based on Martin Maechler's R package
License: GNU General Public License v3.0
I tried run the diptest.diptest() method with my sample from a huge dataframe. it contains more than 2 millions of elements, but I got IndexError about bounds for axis. However if I get a filtered data from the same column it runs normal, without error.
IndexError
Traceback (most recent call last)
----> 6 dip, pval = diptest.diptest(s)
File /env/lib/python3.10/site-packages/diptest/diptest.py:198, in diptest(x, full_output, sort_x, allow_zero, boot_pval, n_boot, n_threads, seed, stream)
196 pval = func(**kwargs)
197 else:
--> 198 pval = Consts.compute_pval_interpolation(n, dip)
200 if full_output:
201 return dip, pval, r[1]
File /home/paulo/routers/env/lib/python3.10/site-packages/diptest/consts.py:83, in Consts.compute_pval_interpolation(cls, n, dip)
80 i1 = min(cls._CRIT_VALS.shape[0], i1)
82 # Interpolate on sqrt(n)
---> 83 n0, n1 = cls._SAMPLE_SIZE[[i0, i1]]
85 y0 = np.sqrt(n0) * cls._CRIT_VALS[i0]
86 sD = np.sqrt(n) * dip
IndexError: index 21 is out of bounds for axis 0 with size 21
I got the code from the example provided by the Github README.md page
## Full version (this got error)
s = df.signal
# only the dip statistic
dip = diptest.dipstat(s)
# both the dip statistic and p-value
dip, pval = diptest.diptest(s)
print(dip, pval)
## Filtered version (run well)
s = df.query("col1 == 'A'").signal
# only the dip statistic
dip = diptest.dipstat(s)
# both the dip statistic and p-value
dip, pval = diptest.diptest(s)
print(dip, pval)
# 0.15625745645430683 0.0
First, thanks for putting together this package!
I've got a distribution that certainly looks bimodal but I'm getting a p-value of 0.0. I'm sure the p-value is very low, but not 0. I'd like to know a more precise threshold for the p-value (e.g. p < 1e-5?). I'm using diptest-0.7.0 and below is a snippet of code.
import diptest
x = df['my_distribution_data'].values
dip, pval = diptest.diptest(x)
print(dip,pval)
Returns: 0.10526623882697403 0.0
Hello @RUrlus,
I was considering that since we now moved entirely to C++, we can remove the legacy ifault
variable and remove some obsolete consistency checks. Specifically, my proposals are the following:
ifault
variable.diptst_unsafe
(func name discussable of course) that does not perform the checks for sorting and non negativity. This can save some time when calling either diptest_pval
or diptest_pval_mt
.diptst
will call diptst_unsafe
while also adding the following code:if (ifault == 1) {
throw std::runtime_error("N must be >= 1.");
} else if (ifault == 2) {
throw std::runtime_error("x must be sorted in ascending error.");
}
(ifault
will be removed and the actual checks will be performed in that place), so the wrapper method can also be refactored a little bit!
WDYT? If you agree, I will be happy to start working on it and create a PR linked to this issue. I am looking forward to your reply!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.