Git Product home page Git Product logo

tdigest's Introduction

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Signed by Signed commit % Linux build Status builds.sr.ht status Windows build status Coverage Status cran checks CRAN status Minimal R Version License DOI

tdigest

Wicked Fast, Accurate Quantiles Using ‘t-Digests’

Description

The t-Digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-Digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-Digest over previous digests for this purpose is that the t-Digest handles data with full floating point resolution. The accuracy of quantile estimates produced by t-Digests can be orders of magnitude more accurate than those produced by previous digest algorithms. Methods are provided to create and update t-Digests and retreive quantiles from the accumulated distributions.

See the original paper by Ted Dunning & Otmar Ertl for more details on t-Digests.

What’s Inside The Tin

The following functions are implemented:

  • as.list.tdigest: Serialize a tdigest object to an R list or unserialize a serialized tdigest list back into a tdigest object
  • td_add: Add a value to the t-Digest with the specified count
  • td_create: Allocate a new histogram
  • td_merge: Merge one t-Digest into another
  • td_quantile_of: Return the quantile of the value
  • td_total_count: Total items contained in the t-Digest
  • td_value_at: Return the value at the specified quantile
  • tquantile: Calculate sample quantiles from a t-Digest

Installation

install.packages("tdigest") # NOTE: CRAN version is 0.3.0
# or
remotes::install_github("hrbrmstr/tdigest")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(tdigest)

# current version
packageVersion("tdigest")
## [1] '0.4.1'

Basic (Low-level interface)

td <- td_create(10)

td
## <tdigest; size=0; compression=10; cap=70>

td_total_count(td)
## [1] 0

td_add(td, 0, 1) %>% 
  td_add(10, 1)
## <tdigest; size=2; compression=10; cap=70>

td_total_count(td)
## [1] 2

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

quantile(td)
## [1]  0  0  5 10 10

Bigger (and Vectorised)

td <- tdigest(c(0, 10), 10)

is_tdigest(td)
## [1] TRUE

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)

td_total_count(td)
## [1] 1e+06

tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574
## [10]  80.3090454  90.2594642  99.4269454 100.0000000

quantile(td)
## [1]   0.00000  24.74751  49.99666  75.24783 100.00000

Serialization

These [de]serialization functions make it possible to create & populate a tdigest, serialize it out, read it in at a later time and continue populating it enabling compact distribution accumulation & storage for large, “continuous” datasets.

set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)

tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574
## [10]  80.3090454  90.2594642  99.4269454 100.0000000

str(in_r <- as.list(td), 1)
## List of 7
##  $ compression   : num 1000
##  $ cap           : int 6010
##  $ merged_nodes  : int 226
##  $ unmerged_nodes: int 0
##  $ merged_count  : num 1e+06
##  $ unmerged_count: num 0
##  $ nodes         :List of 2
##  - attr(*, "class")= chr [1:2] "tdigest_list" "list"

td2 <- as_tdigest(in_r)
tquantile(td2, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.8099857   9.6725790  19.7533723  29.7448283  39.7544675  49.9966628  60.0235148  70.2067574
## [10]  80.3090454  90.2594642  99.4269454 100.0000000

identical(in_r, as.list(td2))
## [1] TRUE

ALTREP-aware

N <- 1000000
x.altrep <- seq_len(N) # this is an ALTREP in R version >= 3.5.0

td <- tdigest(x.altrep)
td[0.1]
## [1] 93051
td[0.5]
## [1] 491472.5
length(td)
## [1] 1000000

Proof it’s faster

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr       min         lq        mean     median         uq       max neval cld
##     tdigest     3.198     3.6695     7.97286     4.4485    12.5255    18.245   100  a 
##  r_quantile 39631.461 39965.9800 40628.32106 40127.4380 40648.0970 43751.469   100   b

tdigest Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
C 3 0.14 486 0.34 77 0.22 46 0.08
R 6 0.27 161 0.11 35 0.10 156 0.27
Rmd 1 0.05 44 0.03 47 0.13 58 0.10
C/C++ Header 1 0.05 24 0.02 16 0.05 30 0.05
SUM 11 0.50 715 0.50 175 0.50 290 0.50

clock Package Metrics for tdigest

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

tdigest's People

Contributors

hrbrmstr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tdigest's Issues

Odd behaviour with td_add

Hi,

Hopefully the below makes the issue clear and easy to reproduce. From what I can tell there is weird behaviour when using the current implementation of td_add. Apologies if this is me misunderstanding what behaviour I should expect, but from the references I think there is something wrong.

x=c(rep(3,10),rep(5,10))
x
[1] 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5
td <- tdigest(x)
td
<tdigest; size=20; compression=100; cap=610>
tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
[1] 3 3 3 3 3 3 4 5 5 5 5 5 5
quantile(x, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
0% 1% 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100%
3 3 3 3 3 3 4 5 5 5 5 5 5

Everything makes sense up to hear. But then:

td_add(td, 8, 10)
<tdigest; size=30; compression=100; cap=610>
x=c(x,rep(8,10))
quantile(x, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
0% 1% 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100%
3 3 3 3 3 5 5 5 8 8 8 8 8
tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
[1] 3.000000 3.000000 3.000000 3.000000 3.000000 5.000000 5.000000 5.000000 8.272727 9.909091 8.000000 8.000000 8.000000

Something odd has happened, with tquantile solutions that are larger than the largest replicated value (8.27 and 9.91 where the largest value is clearly 8). This behaviour does not occur if we remake the tdigest object:

td <- tdigest(x)
tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
[1] 3 3 3 3 3 5 5 5 8 8 8 8 8

This agrees again.

Hopefully the above is not misleading and down to me misunderstanding.

Cheers,

Aaron

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.