tdigest

Wicked Fast, Accurate Quantiles Using ‘t-Digests’

Description

The t-Digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-Digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-Digest over previous digests for this purpose is that the t-Digest handles data with full floating point resolution. The accuracy of quantile estimates produced by t-Digests can be orders of magnitude more accurate than those produced by previous digest algorithms. Methods are provided to create and update t-Digests and retrieve quantiles from the accumulated distributions.

See the original paper by Ted Dunning & Otmar Ertl for more details on t-Digests.

What’s Inside The Tin

The following functions are implemented:

as.list.tdigest: Serialize a tdigest object to an R list or unserialize a serialized tdigest list back into a tdigest object
td_add: Add a value to the t-Digest with the specified count
td_create: Allocate a new histogram
td_merge: Merge one t-Digest into another
td_quantile_of: Return the quantile of the value
td_total_count: Total items contained in the t-Digest
td_value_at: Return the value at the specified quantile
tquantile: Calculate sample quantiles from a t-Digest

Installation

install.packages("tdigest") # NOTE: CRAN version is 0.4.1
# or
remotes::install_gitlab("hrbrmstr/tdigest")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(tdigest)

# current version
packageVersion("tdigest")
## [1] '0.4.2'

Basic (Low-level interface)

td <- td_create(10)

td
## <tdigest; size=0; compression=10; cap=70>

td_total_count(td)
## [1] 0

td_add(td, 0, 1) %>% 
  td_add(10, 1)
## <tdigest; size=2; compression=10; cap=70>

td_total_count(td)
## [1] 2

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

quantile(td)
## [1]  0  0  5 10 10

Bigger (and Vectorised)

td <- tdigest(c(0, 10), 10)

is_tdigest(td)
## [1] TRUE

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)

td_total_count(td)
## [1] 1e+06

tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574
## [10]  80.3090454  90.2594642  99.4269454 100.0000000

quantile(td)
## [1]   0.00000  24.74751  49.99666  75.24783 100.00000

Serialization

These [de]serialization functions make it possible to create & populate a tdigest, serialize it out, read it in at a later time and continue populating it enabling compact distribution accumulation & storage for large, “continuous” datasets.

set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)

tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574
## [10]  80.3090454  90.2594642  99.4269454 100.0000000

str(in_r <- as.list(td), 1)
## List of 7
##  $ compression   : num 1000
##  $ cap           : int 6010
##  $ merged_nodes  : int 226
##  $ unmerged_nodes: int 0
##  $ merged_count  : num 1e+06
##  $ unmerged_count: num 0
##  $ nodes         :List of 2
##  - attr(*, "class")= chr [1:2] "tdigest_list" "list"

td2 <- as_tdigest(in_r)
tquantile(td2, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574
## [10]  80.3090454  90.2594642  99.4269454 100.0000000

identical(in_r, as.list(td2))
## [1] TRUE

ALTREP-aware

N <- 1000000
x.altrep <- seq_len(N) # this is an ALTREP in R version >= 3.5.0

td <- tdigest(x.altrep)
td[0.1]
## [1] 93051
td[0.5]
## [1] 491472.5
length(td)
## [1] 1000000

Proof it’s faster

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr       min        lq        mean     median        uq     max neval
##     tdigest     3.198     3.731     7.79369     4.4895    12.792    16.4   100
##  r_quantile 39197.353 39445.444 40069.38938 39584.8030 40062.945 43613.3   100

tdigest Metrics

Lang	# Files	(%)	LoC	(%)	Blank lines	(%)	# Lines	(%)
C	3	0.15	499	0.36	71	0.29	45	0.10
R	6	0.30	161	0.12	35	0.14	156	0.34
C/C++ Header	1	0.05	24	0.02	16	0.07	30	0.06
SUM	10	0.50	684	0.50	122	0.50	231	0.50

{cloc} 📦 metrics for tdigest

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Odd behaviour with td_add

Hi,

Hopefully the below makes the issue clear and easy to reproduce. From what I can tell there is weird behaviour when using the current implementation of td_add. Apologies if this is me misunderstanding what behaviour I should expect, but from the references I think there is something wrong.

x=c(rep(3,10),rep(5,10))
x
[1] 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5
td <- tdigest(x)
td
<tdigest; size=20; compression=100; cap=610>
tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
[1] 3 3 3 3 3 3 4 5 5 5 5 5 5
quantile(x, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
0% 1% 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100%
3 3 3 3 3 3 4 5 5 5 5 5 5

Everything makes sense up to hear. But then:

td_add(td, 8, 10)
<tdigest; size=30; compression=100; cap=610>
x=c(x,rep(8,10))
quantile(x, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
0% 1% 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100%
3 3 3 3 3 5 5 5 8 8 8 8 8
tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
[1] 3.000000 3.000000 3.000000 3.000000 3.000000 5.000000 5.000000 5.000000 8.272727 9.909091 8.000000 8.000000 8.000000

Something odd has happened, with tquantile solutions that are larger than the largest replicated value (8.27 and 9.91 where the largest value is clearly 8). This behaviour does not occur if we remake the tdigest object:

td <- tdigest(x)
tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
[1] 3 3 3 3 3 5 5 5 8 8 8 8 8

This agrees again.

Hopefully the above is not misleading and down to me misunderstanding.

Cheers,

Aaron

hrbrmstr / tdigest Goto Github PK

tdigest's Introduction

tdigest

Description

What’s Inside The Tin

Installation

Usage

Basic (Low-level interface)

Bigger (and Vectorised)

Serialization

ALTREP-aware

Proof it’s faster

tdigest Metrics

Code of Conduct

tdigest's People

Contributors

Stargazers

Watchers

Forkers

tdigest's Issues

Recommend Projects

Recommend Topics

Recommend Org