rust-ndarray / ndarray-stats Goto Github PK

View Code? Open in Web Editor NEW

187.0 187.0 22.0 215 KB

Statistical routines for ndarray

Home Page: https://docs.rs/ndarray-stats

License: Apache License 2.0

Rust 100.00%

rust rust-ndarray rust-sci scientific-computing statistics

ndarray-stats's People

Contributors

Stargazers

Watchers

ndarray-stats's Issues

Covariance not working in 0.4

This fails in 0.4

  use ndarray_stats::CorrelationExt;
   let or = ArrayView2::from_shape([prices.ncols(), prices.nrows() - 1], &pct_returns).unwrap();
   let cov_matrix = or.cov(1.0).expect("Could not calculate covariance matrix");

with

help: the following trait is implemented but not in scope; perhaps add a `use` for it:
   |
1  | use ndarray_stats::correlation::CorrelationExt;

but has no such issues in 0.3

A regression?

A common use case for me is taking the min or max over one axis of a multidimensional array, much like quantile_axis_mut, except I don't want to mutate my original data and I don't need the overhead of tracking quantiles. I couldn't find mentions of this idea in other issues. Will you consider a PR for such methods?
I was thinking along the lines of

pub trait QuantileExt<A, S, D>
where
    S: Data<Elem = A>,
    D: Dimension,
{
    fn min_axis_skipnan(
        &self,
        axis: Axis,
    ) -> Array<A, D::Smaller>
    where
        D: RemoveAxis,
        A: Ord + Clone + MaybeNan,
        A::NotNan: Ord;
}

0.3 Release

Bumping a minor version given that ndarray is bumping a minor as well (0.12.1 -> 0.13.0).

To do:

Update README (with changelog)
Release on crates.io

quantile_mut: fatal runtime error: stack overflow

Description
quantile_mut can fail with the error message:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow

Version Information

ndarray: 0.15.4
ndarray-stats: 0.5.0
Rust: 1.58.1

To Reproduce

use ndarray::Array1;
use ndarray_stats::{interpolate::Linear, Quantile1dExt};
use noisy_float::types::{n64, N64};

fn main() {
    {
        let mut array: Array1<N64> = Array1::ones(15300);
        println!("One {}", array.quantile_mut(n64(0.5), &Linear).unwrap());
    }

    {
        let mut array: Array1<N64> = Array1::ones(15600);
        println!("Two {}", array.quantile_mut(n64(0.5), &Linear).unwrap());
    }

    {
        let mut array: Array1<N64> = Array1::ones(100000);
        println!("Three {}", array.quantile_mut(n64(0.5), &Linear).unwrap());
    }
}

Observed behavior

$ cargo run --profile=dev
One 1

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
$ cargo run --profile=release
One 1
Two 1

thread 'main' has overflowed its stack
fatal runtime error: stack overflow

Expected behavior

One 1
Two 1
Three 1

Additional context

I'm able to reproduce this issue on both Linux and macOS with the default stack limit of 8 MiB. (ulimit -s reports 8192)
The result is non-deterministic. Re-running the executable can succeed sometimes and fail sometimes. The larger the vector the more likely it is to fail.
The result depends on whether optimization is enabled.

Higher order moments

Moving forward with #1, I am now working towards computing central order moments. As far as I can understand, it's impossible to compute M_n in a numerically stable fashion without computing M_1, M_2, ..., M_{n-1}, similarly to what happens with our variance method in ndarray (I am using this as reference).

Should we make this transparent and return the whole array of moments up to the order required?
This would probably save some computational workload if people actually need more than one of those (e.g. mean, std deviation, kurtosis and skewness).

What do you think @jturner314?
The only issue is what do you we use as return type? A Vec?

Prevent other crates from implementing *Ext traits

It would be nice to make it impossible for other crates to implement our *Ext traits, because then we could freely add new methods without breaking changes. (Adding the indexed_fold_skipnan method to MaybeNanExt in #33 is an example. If ndarray-stats was the only crate that could implement MaybeNanExt, then we could add indexed_fold_skipnan without that being a breaking change.)

ndarray accomplishes this for some of its traits (e.g. the Dimension trait) using a private marker type.

Move to rust-ndarray organization

Given that the bulk of the work on this crate has been completed, would it make sense to move it inside the rust-ndarray organization @jturner314?

Add examples to Rust cookbook

The Rust cookbook has a section on statistics but, while other section points you to common crates, its examples are about computing the mean, median and std from scratch.

It might be a good place to advertise about ndarray-stats (and maybe a non ndarray based crate such as statistical).

Binned statistic similar to scipy.stats.binned_statistic_dd

Binned statistic like scipy.stats.binned_statistic_dd similar to ndarray_stats::histogram::Histogram would allow calculation of more statistical features like weighted histograms, means, variances, min, max etc. of each bin. I would like to add something like that and would be grateful for opinions on how that should look like. @LukeMathWalker

All vs. only one statistic

Is it a good idea to calculate all statisics when a value is pushed to be binned or should only one statistic be calculated which has to be selected beforehand?
bs = BinnedStatistic(grid) vs. bs = BinnedStatistic(grid, variance).

Type of output array

Histograms solely count the number of observations in each bin. The default value is zero. For other statistics zero is a valid result event with values in that bin.
The output could be just the numerical value and comparison with the histogram (through an additional function) allows knowing which bins are empty, or would something similar to Option<T> be a good output?
[..., 0.0, 0.0, 1.2, ...] vs [..., Value(0.0), Empty, Value(1.2), ...].

Reexport `noisy_float::types::N64`

Description
So that crates using this crate don't have to add it explicitly to their Cargo.toml, it would be convenient to have this crate pub use things like noisy_float::types::N64 since they are required directly by the API, such as with QuantileExt:: quantile_axis_mut:

ndarray-stats/src/quantile/mod.rs

Lines 208 to 213 in b6628c6

 fn quantile_axis_mut<I>( 

 &mut self, 

 axis: Axis, 

 q: N64, 

 interpolate: &I, 

 ) -> Result<Array<A, D::Smaller>, QuantileError>

Version Information

ndarray: 0.15.4
ndarray-stats: 0.5.0
Rust: 1.61.0

To Reproduce
N/A

Expected behavior
Not have to add noisy_float to my Cargo.toml when using ndarray-stats::quantile::QuantileExt.

Reduce cases where histogram strategies panic

See #13 for some discussion.

Change get_from_sorted_mut signature to be analogous to select_nth_unstable

Rust 1.49 adds a select_nth_unstable method to slices which is very similar to get_from_sorted_mut, except that it returns a mutable reference to the element and additionally returns slices for the portions before/after the element. It would probably be worthwhile to change the API of get_from_sorted_mut to match (and maybe rename it):

pub fn select_nth_unstable(&mut self, i: usize) -> (ArrayViewMut1<'_, A>, &mut A, ArrayViewMut1<'_, A>)
where
    A: Ord,
    S: DataMut,

Reduce chance of overflow panics on Midpoint Interpolate

Midpoint interpolate is currently implemented like:

let denom = T::from_u8(2).unwrap();
(lower.unwrap() + higher.unwrap()).mapv_into(|x| x / denom.clone())

This causes overflows most times I use it, whereas implementing like (lower + (higher - lower))/2 would prevent this.

Also, happy to do a PR for this if it would be helpful but I figured it was such a small change it might be quicker if you guys sorted it yourselves 👍

using .quantile_mut() on Array1::<f64>

Hi guys, I'm trying to use .quantile_mut() on a 1D array of f64 numbers. I'm getting this error:

the trait bound `f64: std::cmp::Ord` is not satisfied

the trait `std::cmp::Ord` is not implemented for `f64

I have a few questions:

Which type will work? It's not clear from the docs.
Is there a reason for using the noisy float for the number between 0-1? If the function errors when it's not between 0-1, wouldnt a normal float work?

Consider adding examples

I suggest we consider adding an examples folder to demonstrate more real world usage.

The benefits I think this would bring are:

By seeing the library used in a more realistic situation we may learn things about the design that the tests/doc-tests didn't reveal.
We help users get started more quickly for typical use-cases.
We get to set up some good usage patterns for others to follow.

Can we brain-storm a list of the kinds of examples we would want?

Does anybody have any toy examples we could use to seed the folder?

Implement a `cov_to_corr()` method

I just had a situation where I wanted to convert from a covariance matrix to a correlation matrix. It would be neat if we could do that with one function call.

This is as much a note to myself to implement this (when I get time) as it is a feature request.

Work on performance issues in summary statistics due to using ArrayBase::sum

The summary statistics methods use ArrayBase::sum (directly or indirectly) in anticipation of pairwise summation (rust-ndarray/ndarray#577), which provides improved accuracy over naive summation using fold. However, to do this, some of the methods have unnecessary allocations or other performance issues.

For example, harmonic_mean is implemented like this:

self.map(|x| x.recip()).mean().map(|x| x.recip())

It's implemented this way to take advantage of .mean() (which is implemented in terms of .sum()), but this approach requires a temporary allocation for the result of self.map.

summary_statistics::means::moments has a similar issue:

for k in 2..=order {
    moments.push(a.map(|x| x.powi(k)).sum() / n_elements)
}

It's implemented this way to take advantage of .sum(). However, this implementation requires a temporary allocation for the result of a.map. Additionally, it would probably be faster to make the loop over k be the innermost loop to improve the locality of reference.

We should be able to resolve these issues with a lazy version of map combined with a pairwise summation method on that lazy map. Something like jturner314/nditer would work once it's stable.

[Edit: This issue also appears in the entropy methods.]

The assumed matrix layout for correlation is unintuitive

The docs for the correlation methods say:

Let (r, o) be the shape of M:

r is the number of random variables;

o is the number of observations we have collected for each random variable.

What this implicitly says is that "M should be a matrix with r rows, corresponding to random variables, and o columns, corresponding to observations". We know this because ndarray has an explicit definition for rows and columns, whereby the first axis refers to the rows and the second axis is called the column axis. For example refer to nrows and ncols functions.

However I find this assumption is counter-intuitive. The convention in my experience is to use the "tidy" layout which is that each row corresponds to an observation and each column corresponds to a variable. I refer here to Hadley Wickham's work, and this figure (e.g. here):
.

Also this is how R works:

> mat
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8
> nrow(mat)
[1] 4
> ncol(mat)
[1] 2
> cov(mat)
         [,1]     [,2]
[1,] 1.666667 1.666667
[2,] 1.666667 1.666667

Thirdly, in terms of the Rust data science ecosystem, note that polars (as far as I know, the best supported data frame library in Rust) outputs matricies with the same assumptions. If you create a DataFrame with 2 series (which correspond to variables) and 3 rows, and run .to_ndarray(), you will get a (3, 2) ndarray. Then when you call .cov() on it, you will get something that is not the covariance matrix that you are after.

One argument in the defence of the current method is numpy.cov, which makes the same assumption, as it takes:

A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables.

My suggestions is therefore to consider reversing the assumed dimensions for these methods in the next major (breaking) release. I realise that using .t() is not a difficult thing to do, but unfortunately forgetting to do this in your code will result in a valid matrix that may continue into downstream code without the user realising that it is not the correct covariance matrix. This happened to me and I'd like to spare other users from this issue.

Gaussian kernels

Hello. Do you have in plans port of https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde ?

No unweighted standard deviation

There exists a method for weighted standard deviation, weighted_std, but I can't find one for regular unweighted standard deviation.
Is there a reason for this?

Consider renaming `mean` to `arithmetic_mean`.

Some functions like mean are implemented in ndarray (here) and ndarray-stats (here). When using SummaryStatisticsExt, the function mean exists two times, which is not a problem but confusing as one implementation returns Option and the other Result and it firstly seems to mismatch with the documentation.

Is it intended to have some implementations duplicated along the two crates on the long run? If so, ndarray-stats could be more specific by renaming mean to arithmetic_mean (and its axis/weighted variants) being consistent with geometric_mean and harmonic_mean. Feel free to close the issue if you prefer a simple mean.

`no-std` support?

Hello! I want to put it in an embedded environment. Any no-std support?

map_axis_skipnan_mut invalid result

Description
The map_axis_skipnan_mut method returns invalid results for any non-maximal axis.

Version Information

ndarray: "0.15"
ndarray-stats: "0.5.0"
Rust: "1.62.0"

To Reproduce

let mut a = arr2(&[[1., 2., f64::NAN, 4.], [5., 6., 7., 8.], [9., 10., 11., f64::NAN]]);
println!("Initial array:\n{}", &a);

println!("\nLanes:");
let b = a.map_axis_skipnan_mut(Axis(0), |x| { println!("{}", x); x});
println!("\nResults:\n{}", b);

Expected behavior
As far as I understand, the described behavior, consistent with the map_axis_mut method behavior, would give the following result:

Initial array:
[[1, 2, NaN, 4],
 [5, 6, 7, 8],
 [9, 10, 11, NaN]]

Lanes:
[1, 5, 9]
[2, 6, 10]
[11, 7]
[4, 8]

Results:
[[1, 5, 9], [2, 6, 10], [11, 7], [4, 8]]

Actual behavior

Initial array:
[[1, 2, NaN, 4],
 [5, 6, 7, 8],
 [9, 10, 11, NaN]]

Lanes:
[1, 2, NaN]
[2, NaN, 4]
[11, 4]
[4, 5]

Results:
[[1, 2, 11], [2, 11, 4], [11, 4], [4, 5]]

This behavior appears to be clearly invalid. It does not skip nan values, and the elements present in each lane are not consistent with the selected axis. Applying some methods to the lanes (such as x.sum()) can cause a panic: thread 'main' panicked at 'unexpected NaN', .../.cargo/registry/src/github.com-1ecc6299db9ec823/noisy_float-0.2.0/src/checkers.rs:30:9, as the lanes include NotNan types with a value of nan.

Additional context
When no nan values are present in the initial array, the result should be identical (besides the NotNan type) to the map_axis_mut function. However it instead returns slices of what seems to be a flattened version of the initial array.

let mut a = arr2(&[[1., 2., 3., 4.], [5., 6., 7., 8.], [9., 10., 11., 12.]]);

Result:

Lanes:
[1, 2, 3]
[2, 3, 4]
[3, 4, 5]
[4, 5, 6]

The function works as expected, when the axis argument is the largest valid axis (a.ndim()-1).

It appears that while the content of the lanes is invalid, the number of elements in each lane is always correct. My best guess is that the error originates in the remove_nan_mut function, specifically the slice created in line 63 probably ignores the axis of the ArrayView.

On sorting

Currently the Sort1dExt trait provides sorting/partitioning methods based on
quickselect. What's the reason for choosing this over pdqsort as used in
sort_unstable in STD.
histogram/bins.rs has several calls on sort_unstable, and how about a
feature flag rayon for its parallel sort?

Remove 'static bound from type `A` in `CorrelationExt.cov`

As soon as rust-ndarray/ndarray#491 is merged in ndarray.

How to use the histogram if the min/max is not known on compile time?

I am not sure if this is a right place to ask this question. First, thanks for your contribution.
I am having n time series of Vec<f32> and I want to get a histogram Count of m (usually 50) bins. In python, there is a possibility to say the number of bins needed and is it possible to do in rust?

My next problem is, I don’t know the observation values during compile time so I can’t build the grid beforehand. Is it possible to redefine the grid once added.

For example, I initiate an empty histogram, and add observations in a loop ( the loop will be a concat of multiple time series until merge pr is ready :) ). Then I say the bin count (say 50). It should distribute the bin range with 50 bins based on the min/max. Is this possible?

My use case can be solved using HdrHistogram but it supports only u64 and not f32. Also, in my application I already use ndarray, so felt this crate will fit very well.

Note: I am new to rust

Non-deterministic test?

    #[test]
    fn test_zero_observations() {
        let a = Array2::<f32>::zeros((2, 0));
        let pearson = a.pearson_correlation();
        assert_eq!(pearson.shape(), &[2, 2]);
        let all_nan_flag = pearson.iter().map(|x| x.is_nan()).fold(true, |acc, flag| acc & flag);
        assert_eq!(all_nan_flag, true);
    }

This test fails on Travis, for #5, while it succeeds on my local machine.
It's weird - any idea? @jturner314
I am trying to reproduce the error, but I am failing.

Release a new version to coincide with ndarray 0.14

Title is self-explanatory. I guess this is also a breaking change so should come with a minor version bump. It would also be good if the 2 PRs that are currently open could be sorted out 👀

Can we improve speed of `argmin`/`min` by removing the `Result` part?

Hi thanks for the lib! When looking at code:

ndarray-stats/src/quantile/mod.rs

Lines 289 to 304 in b6628c6

 fn argmin(&self) -> Result<D::Pattern, MinMaxError> 

 where 

 A: PartialOrd, 

 { 

 let mut current_min = self.first().ok_or(EmptyInput)?; 

 let mut current_pattern_min = D::zeros(self.ndim()).into_pattern(); 

 for (pattern, elem) in self.indexed_iter() { 

 if elem.partial_cmp(current_min).ok_or(UndefinedOrder)? == cmp::Ordering::Less { 

 current_pattern_min = pattern; 

 current_min = elem 

 } 

 } 

 Ok(current_pattern_min) 

 }

I see some error handling. IMHO this may make the code much slower. Since I am only working with f32 or i32 or similar primitive numbers, can we remove such error handling? Thanks!

Roadmap

In terms of functionality, the mid-term end goal is to achieve feature parity with the statistics routine in numpy (here) and Julia StatsBase (here).

For the next version:

Order statistics:
- partialord version for quantiles methods;
Histograms:
- merge method;

For version 0.2.0:

Order statistics:
- optimized computations of multiple quantiles if requested all at once (#26) ;
- argmin / argmax (#30);
Summary statistics:
- harmonic mean (#20);
- geometric mean (#20);
- higher order central moments (#23);
- standardized moments (they include kurtosis and skewness) (#23);
Histograms:
- Fix error handling (Issue: #16 - PR: #25 )
Entropy:
- Feature parity with StatsBase.jl (#24)

For version 0.1.0:

max / nanmax (@jturner314)
min / nanmin (@jturner314)
quantile / nanquantile (it includes percentile / nanpercentile as a special case) (@LukeMathWalker & @jturner314)
correlation-methods:
- cov (@LukeMathWalker) - ~~One last fix to be made (#3)~~ [On hold for now]
- corrcoef (@LukeMathWalker - #5)
histogram-methods (@LukeMathWalker - #9)

cosine similarity and other distance measures

Title is fairly self explanatory, I've found the need for cosine distance at various times (and other distance metrics) that probably fit well in ndarray-stats. Maybe here we can decided on a few different ones and what is/isn't in scope for this crate

	fn quantile_axis_mut<I>(
	&mut self,
	axis: Axis,
	q: N64,
	interpolate: &I,
	) -> Result<Array<A, D::Smaller>, QuantileError>

	fn argmin(&self) -> Result<D::Pattern, MinMaxError>
	where
	A: PartialOrd,
	{
	let mut current_min = self.first().ok_or(EmptyInput)?;
	let mut current_pattern_min = D::zeros(self.ndim()).into_pattern();

	for (pattern, elem) in self.indexed_iter() {
	if elem.partial_cmp(current_min).ok_or(UndefinedOrder)? == cmp::Ordering::Less {
	current_pattern_min = pattern;
	current_min = elem
	}
	}

	Ok(current_pattern_min)
	}

rust-ndarray / ndarray-stats Goto Github PK

ndarray-stats's People

Contributors

Stargazers

Watchers

Forkers

ndarray-stats's Issues

All vs. only one statistic

Type of output array

Recommend Projects

Recommend Topics

Recommend Org