rust-ndarray / ndarray-stats Goto Github PK
View Code? Open in Web Editor NEWStatistical routines for ndarray
Home Page: https://docs.rs/ndarray-stats
License: Apache License 2.0
Statistical routines for ndarray
Home Page: https://docs.rs/ndarray-stats
License: Apache License 2.0
This fails in 0.4
use ndarray_stats::CorrelationExt;
let or = ArrayView2::from_shape([prices.ncols(), prices.nrows() - 1], &pct_returns).unwrap();
let cov_matrix = or.cov(1.0).expect("Could not calculate covariance matrix");
with
help: the following trait is implemented but not in scope; perhaps add a `use` for it:
|
1 | use ndarray_stats::correlation::CorrelationExt;
but has no such issues in 0.3
A regression?
A common use case for me is taking the min or max over one axis of a multidimensional array, much like quantile_axis_mut
, except I don't want to mutate my original data and I don't need the overhead of tracking quantiles. I couldn't find mentions of this idea in other issues. Will you consider a PR for such methods?
I was thinking along the lines of
pub trait QuantileExt<A, S, D>
where
S: Data<Elem = A>,
D: Dimension,
{
fn min_axis_skipnan(
&self,
axis: Axis,
) -> Array<A, D::Smaller>
where
D: RemoveAxis,
A: Ord + Clone + MaybeNan,
A::NotNan: Ord;
}
Bumping a minor version given that ndarray
is bumping a minor as well (0.12.1
-> 0.13.0
).
To do:
crates.io
Description
quantile_mut
can fail with the error message:
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Version Information
ndarray
: 0.15.4ndarray-stats
: 0.5.0To Reproduce
use ndarray::Array1;
use ndarray_stats::{interpolate::Linear, Quantile1dExt};
use noisy_float::types::{n64, N64};
fn main() {
{
let mut array: Array1<N64> = Array1::ones(15300);
println!("One {}", array.quantile_mut(n64(0.5), &Linear).unwrap());
}
{
let mut array: Array1<N64> = Array1::ones(15600);
println!("Two {}", array.quantile_mut(n64(0.5), &Linear).unwrap());
}
{
let mut array: Array1<N64> = Array1::ones(100000);
println!("Three {}", array.quantile_mut(n64(0.5), &Linear).unwrap());
}
}
Observed behavior
$ cargo run --profile=dev
One 1
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
$ cargo run --profile=release
One 1
Two 1
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Expected behavior
One 1
Two 1
Three 1
Additional context
ulimit -s
reports 8192)Moving forward with #1, I am now working towards computing central order moments. As far as I can understand, it's impossible to compute M_n
in a numerically stable fashion without computing M_1
, M_2
, ..., M_{n-1}
, similarly to what happens with our variance method in ndarray
(I am using this as reference).
Should we make this transparent and return the whole array of moments up to the order required?
This would probably save some computational workload if people actually need more than one of those (e.g. mean, std deviation, kurtosis and skewness).
What do you think @jturner314?
The only issue is what do you we use as return type? A Vec
?
It would be nice to make it impossible for other crates to implement our *Ext
traits, because then we could freely add new methods without breaking changes. (Adding the indexed_fold_skipnan
method to MaybeNanExt
in #33 is an example. If ndarray-stats
was the only crate that could implement MaybeNanExt
, then we could add indexed_fold_skipnan
without that being a breaking change.)
ndarray
accomplishes this for some of its traits (e.g. the Dimension
trait) using a private marker type.
Given that the bulk of the work on this crate has been completed, would it make sense to move it inside the rust-ndarray
organization @jturner314?
The Rust cookbook has a section on statistics but, while other section points you to common crates, its examples are about computing the mean, median and std from scratch.
It might be a good place to advertise about ndarray-stats (and maybe a non ndarray based crate such as statistical).
Binned statistic like scipy.stats.binned_statistic_dd similar to ndarray_stats::histogram::Histogram would allow calculation of more statistical features like weighted histograms, means, variances, min, max etc. of each bin. I would like to add something like that and would be grateful for opinions on how that should look like. @LukeMathWalker
Is it a good idea to calculate all statisics when a value is pushed to be binned or should only one statistic be calculated which has to be selected beforehand?
bs = BinnedStatistic(grid)
vs. bs = BinnedStatistic(grid, variance)
.
Histograms solely count the number of observations in each bin. The default value is zero. For other statistics zero is a valid result event with values in that bin.
The output could be just the numerical value and comparison with the histogram (through an additional function) allows knowing which bins are empty, or would something similar to Option<T>
be a good output?
[..., 0.0, 0.0, 1.2, ...]
vs [..., Value(0.0), Empty, Value(1.2), ...]
.
Description
So that crates using this crate don't have to add it explicitly to their Cargo.toml
, it would be convenient to have this crate pub use
things like noisy_float::types::N64
since they are required directly by the API, such as with QuantileExt:: quantile_axis_mut
:
ndarray-stats/src/quantile/mod.rs
Lines 208 to 213 in b6628c6
Version Information
ndarray
: 0.15.4ndarray-stats
: 0.5.0To Reproduce
N/A
Expected behavior
Not have to add noisy_float
to my Cargo.toml
when using ndarray-stats::quantile::QuantileExt
.
See #13 for some discussion.
Rust 1.49 adds a select_nth_unstable
method to slices which is very similar to get_from_sorted_mut
, except that it returns a mutable reference to the element and additionally returns slices for the portions before/after the element. It would probably be worthwhile to change the API of get_from_sorted_mut
to match (and maybe rename it):
pub fn select_nth_unstable(&mut self, i: usize) -> (ArrayViewMut1<'_, A>, &mut A, ArrayViewMut1<'_, A>)
where
A: Ord,
S: DataMut,
Midpoint interpolate is currently implemented like:
let denom = T::from_u8(2).unwrap();
(lower.unwrap() + higher.unwrap()).mapv_into(|x| x / denom.clone())
This causes overflows most times I use it, whereas implementing like (lower + (higher - lower))/2
would prevent this.
Also, happy to do a PR for this if it would be helpful but I figured it was such a small change it might be quicker if you guys sorted it yourselves π
Hi guys, I'm trying to use .quantile_mut() on a 1D array of f64 numbers. I'm getting this error:
the trait bound `f64: std::cmp::Ord` is not satisfied
the trait `std::cmp::Ord` is not implemented for `f64
I have a few questions:
I suggest we consider adding an examples
folder to demonstrate more real world usage.
The benefits I think this would bring are:
Can we brain-storm a list of the kinds of examples we would want?
Does anybody have any toy examples we could use to seed the folder?
I just had a situation where I wanted to convert from a covariance matrix to a correlation matrix. It would be neat if we could do that with one function call.
This is as much a note to myself to implement this (when I get time) as it is a feature request.
The summary statistics methods use ArrayBase::sum
(directly or indirectly) in anticipation of pairwise summation (rust-ndarray/ndarray#577), which provides improved accuracy over naive summation using fold
. However, to do this, some of the methods have unnecessary allocations or other performance issues.
For example, harmonic_mean
is implemented like this:
self.map(|x| x.recip()).mean().map(|x| x.recip())
It's implemented this way to take advantage of .mean()
(which is implemented in terms of .sum()
), but this approach requires a temporary allocation for the result of self.map
.
summary_statistics::means::moments
has a similar issue:
for k in 2..=order {
moments.push(a.map(|x| x.powi(k)).sum() / n_elements)
}
It's implemented this way to take advantage of .sum()
. However, this implementation requires a temporary allocation for the result of a.map
. Additionally, it would probably be faster to make the loop over k
be the innermost loop to improve the locality of reference.
We should be able to resolve these issues with a lazy version of map
combined with a pairwise summation method on that lazy map
. Something like jturner314/nditer would work once it's stable.
[Edit: This issue also appears in the entropy methods.]
The docs for the correlation methods say:
Let (r, o) be the shape of M:
- r is the number of random variables;
- o is the number of observations we have collected for each random variable.
What this implicitly says is that "M should be a matrix with r rows, corresponding to random variables, and o columns, corresponding to observations". We know this because ndarray
has an explicit definition for rows and columns, whereby the first axis refers to the rows and the second axis is called the column axis. For example refer to nrows
and ncols
functions.
However I find this assumption is counter-intuitive. The convention in my experience is to use the "tidy" layout which is that each row corresponds to an observation and each column corresponds to a variable. I refer here to Hadley Wickham's work, and this figure (e.g. here):
.
Also this is how R works:
> mat
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
> nrow(mat)
[1] 4
> ncol(mat)
[1] 2
> cov(mat)
[,1] [,2]
[1,] 1.666667 1.666667
[2,] 1.666667 1.666667
Thirdly, in terms of the Rust data science ecosystem, note that polars
(as far as I know, the best supported data frame library in Rust) outputs matricies with the same assumptions. If you create a DataFrame with 2 series (which correspond to variables) and 3 rows, and run .to_ndarray()
, you will get a (3, 2) ndarray
. Then when you call .cov()
on it, you will get something that is not the covariance matrix that you are after.
One argument in the defence of the current method is numpy.cov
, which makes the same assumption, as it takes:
A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables.
My suggestions is therefore to consider reversing the assumed dimensions for these methods in the next major (breaking) release. I realise that using .t()
is not a difficult thing to do, but unfortunately forgetting to do this in your code will result in a valid matrix that may continue into downstream code without the user realising that it is not the correct covariance matrix. This happened to me and I'd like to spare other users from this issue.
Hello. Do you have in plans port of https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde ?
There exists a method for weighted standard deviation, weighted_std, but I can't find one for regular unweighted standard deviation.
Is there a reason for this?
Some functions like mean
are implemented in ndarray
(here) and ndarray-stats
(here). When using SummaryStatisticsExt
, the function mean
exists two times, which is not a problem but confusing as one implementation returns Option
and the other Result
and it firstly seems to mismatch with the documentation.
Is it intended to have some implementations duplicated along the two crates on the long run? If so, ndarray-stats
could be more specific by renaming mean
to arithmetic_mean
(and its axis/weighted variants) being consistent with geometric_mean
and harmonic_mean
. Feel free to close the issue if you prefer a simple mean
.
Hello! I want to put it in an embedded environment. Any no-std
support?
Description
The map_axis_skipnan_mut method returns invalid results for any non-maximal axis.
Version Information
ndarray
: "0.15"ndarray-stats
: "0.5.0"To Reproduce
let mut a = arr2(&[[1., 2., f64::NAN, 4.], [5., 6., 7., 8.], [9., 10., 11., f64::NAN]]);
println!("Initial array:\n{}", &a);
println!("\nLanes:");
let b = a.map_axis_skipnan_mut(Axis(0), |x| { println!("{}", x); x});
println!("\nResults:\n{}", b);
Expected behavior
As far as I understand, the described behavior, consistent with the map_axis_mut method behavior, would give the following result:
Initial array:
[[1, 2, NaN, 4],
[5, 6, 7, 8],
[9, 10, 11, NaN]]
Lanes:
[1, 5, 9]
[2, 6, 10]
[11, 7]
[4, 8]
Results:
[[1, 5, 9], [2, 6, 10], [11, 7], [4, 8]]
Actual behavior
Initial array:
[[1, 2, NaN, 4],
[5, 6, 7, 8],
[9, 10, 11, NaN]]
Lanes:
[1, 2, NaN]
[2, NaN, 4]
[11, 4]
[4, 5]
Results:
[[1, 2, 11], [2, 11, 4], [11, 4], [4, 5]]
This behavior appears to be clearly invalid. It does not skip nan values, and the elements present in each lane are not consistent with the selected axis. Applying some methods to the lanes (such as x.sum()) can cause a panic: thread 'main' panicked at 'unexpected NaN', .../.cargo/registry/src/github.com-1ecc6299db9ec823/noisy_float-0.2.0/src/checkers.rs:30:9
, as the lanes include NotNan types with a value of nan.
Additional context
When no nan values are present in the initial array, the result should be identical (besides the NotNan type) to the map_axis_mut function. However it instead returns slices of what seems to be a flattened version of the initial array.
let mut a = arr2(&[[1., 2., 3., 4.], [5., 6., 7., 8.], [9., 10., 11., 12.]]);
Result:
Lanes:
[1, 2, 3]
[2, 3, 4]
[3, 4, 5]
[4, 5, 6]
The function works as expected, when the axis argument is the largest valid axis (a.ndim()-1).
It appears that while the content of the lanes is invalid, the number of elements in each lane is always correct. My best guess is that the error originates in the remove_nan_mut function, specifically the slice created in line 63 probably ignores the axis of the ArrayView.
Sort1dExt
trait provides sorting/partitioning methods based onquickselect
. What's the reason for choosing this over pdqsort
as used insort_unstable
in STD.histogram/bins.rs
has several calls on sort_unstable
, and how about arayon
for its parallel sort?As soon as rust-ndarray/ndarray#491 is merged in ndarray
.
I am not sure if this is a right place to ask this question. First, thanks for your contribution.
I am having n time series of Vec<f32>
and I want to get a histogram Count of m (usually 50) bins. In python, there is a possibility to say the number of bins needed and is it possible to do in rust?
My next problem is, I donβt know the observation values during compile time so I canβt build the grid beforehand. Is it possible to redefine the grid once added.
For example, I initiate an empty histogram, and add observations in a loop ( the loop will be a concat of multiple time series until merge pr is ready :) ). Then I say the bin count (say 50). It should distribute the bin range with 50 bins based on the min/max. Is this possible?
My use case can be solved using HdrHistogram but it supports only u64 and not f32. Also, in my application I already use ndarray, so felt this crate will fit very well.
Note: I am new to rust
#[test]
fn test_zero_observations() {
let a = Array2::<f32>::zeros((2, 0));
let pearson = a.pearson_correlation();
assert_eq!(pearson.shape(), &[2, 2]);
let all_nan_flag = pearson.iter().map(|x| x.is_nan()).fold(true, |acc, flag| acc & flag);
assert_eq!(all_nan_flag, true);
}
This test fails on Travis, for #5, while it succeeds on my local machine.
It's weird - any idea? @jturner314
I am trying to reproduce the error, but I am failing.
Title is self-explanatory. I guess this is also a breaking change so should come with a minor version bump. It would also be good if the 2 PRs that are currently open could be sorted out π
Hi thanks for the lib! When looking at code:
ndarray-stats/src/quantile/mod.rs
Lines 289 to 304 in b6628c6
I see some error handling. IMHO this may make the code much slower. Since I am only working with f32
or i32
or similar primitive numbers, can we remove such error handling? Thanks!
In terms of functionality, the mid-term end goal is to achieve feature parity with the statistics routine in numpy
(here) and Julia StatsBase
(here).
For the next version:
partialord
version for quantiles
methods;merge
method;For version 0.2.0:
For version 0.1.0:
max
/ nanmax
(@jturner314)min
/ nanmin
(@jturner314)quantile
/ nanquantile
(it includes percentile
/ nanpercentile
as a special case) (@LukeMathWalker & @jturner314)correlation
-methods:
cov
(@LukeMathWalker) - corrcoef
(@LukeMathWalker - #5)histogram
-methods (@LukeMathWalker - #9)Title is fairly self explanatory, I've found the need for cosine distance at various times (and other distance metrics) that probably fit well in ndarray-stats. Maybe here we can decided on a few different ones and what is/isn't in scope for this crate
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.