mpascariu / ungroup Goto Github PK

View Code? Open in Web Editor NEW

13.0 4.0 9.0 8.05 MB

Estimating Smooth Distributions from Coarsely Binned Data - R Package

License: Other

R 83.27% C++ 5.12% TeX 11.62%

ungrouping smoothing distributions glm

ungroup's People

Contributors

Stargazers

Watchers

Forkers

jschoeley kevhuy normark andreaskiermeier muntasirmasum volare2016 timriffe grguedes m-g-h

ungroup's Issues

MortalitySmooth dependency now archived

Great package, I use the heck out of it, but it's tricky to install at the moment.

The following observations are from a user perspective:

MortalitySmooth is now in the CRAN archives. Maybe in part this is because its dependency svcm has been archived. And for this reason now ungroup has been archived, and its dependencies have installation problems.

if one does

install.packages("https://cran.r-project.org/src/contrib/Archive/svcm/svcm_0.1.2.tar.gz", repos = NULL, type = "source")
install.packages("https://cran.r-project.org/src/contrib/Archive/MortalitySmooth/MortalitySmooth_2.3.4.tar.gz", repos = NULL, type = "source")

followed by

remotes::install_github("mpascariu/ungroup")

Then one can install (provided Rcpp, RcppEigen, and rgl are available).

From package maintainer perspective, I do not know what the proper course of action is. I've emailed the maintainers of svcm and MortalitySmooth and you too proposing a temporary solution.

Manually specify composition matrix

I think one remedy to both #6 and #4 would be a possibility to manually specify the composition matrix (or marginal composition matrices in the 2D case)

Allow for missing years (and interpolate) in two-dimensional smooth

My case is that I have population data for most part of the period every third year, but I have mortality data each year (but that might be missing as well for other regions). It should in theory be possible to interpolate over the years with missing data. The only thing I have come up with is this ad hoc solution:

Say I have population data at time point 1 and 3, and mortality data for 1, 2 and 3. Then the estimated population in time point 2 would be some kind of mean of pop_hat_t1 - mortality_hat_t1 (and then shift x axis) and pop_hat_t3+mortality_hat_t3 (hat is for estimated). And possibly iterate back and forward to minimize the difference. But it would be cumbersome when it would be possible to just smooth instead (albeit with some information borrowing from the future in case of new-borns/edge cases)

pclm failing

Loving this package, but presently stumped with a simple case. Here are some single age example data from the DemoTools package (pop1m_ind). I group them to almost abridged ages (except 1-4 is here 1-2 and 3-4). pclm() warns heavily for this case, but returns some fitted values. The sum of the values isn't particularly close to the sum of the original vector, however, and taking a look at the single age output shows unexpected patterns. Any guidance?

NB1: I get the same behavior if all age groups are of uniform width 5.
NB2: I tried giving control = list(lambda = 1e-5) and similar, but same behavior.

library(ungroup)
# start with some single age data
y_orig <- c(9544406, 7471790, 11590109, 11881844, 11872503, 12968350, 11993151, 
        10033918, 14312222, 8111523, 15311047, 6861510, 13305117, 7454575, 
        9015381, 10325432, 9055588, 5519173, 12546779, 4784102, 13365429, 
        4630254, 9595545, 4727963, 5195032, 15061479, 5467392, 4011539, 
        8033850, 1972327, 17396266, 1647397, 6539557, 2233521, 2101024, 
        16768198, 3211834, 1923169, 4472854, 1182245, 15874081, 1017752, 
        3673865, 1247304, 1029243, 12619050, 1499847, 1250321, 2862148, 
        723195, 12396632, 733501, 2186678, 777379, 810700, 7298270, 1116032, 
        650402, 1465209, 411834, 9478824, 429296, 1190060, 446290, 362767, 
        4998209, 388753, 334629, 593906, 178133, 4560342, 179460, 481230, 
        159087, 155831, 1606147, 166763, 93569, 182238, 53567, 1715697, 
        127486, 150782, 52332, 48664, 456387, 46978, 34448, 44015, 19172, 
        329149, 48004, 28574, 9200, 7003, 75195, 13140, 5889, 18915, 
        21221, 72373)

x       <- c(0,1,3,seq(5,100,5))
agi    <- c(1,2,2,rep(5,19),1)
y       <- tapply(y_orig,rep(x,times=agi),sum)
nlast  <- 1

# equal length
# length(x); length(y)

# lots of warnings
#  In y_/K$muA : longer object length is not a multiple of shorter object length
y1mod <- pclm(x = x, y = y, nlast = nlast)

# sums far off
sum(y1mod$fitted) 
# 393272639 for me
sum(y)                     
# 432502563

NA’s not allowed

Testing the ungroup package in the Bangladesh data.
Issue raised by Jonas Scholey in January 22, 2018.

ug <- pclm2D(x = bang_Dx_2d$age, y = bang_Dx_2d[,-1],
                         offset = bang_Nx_2d[,-1], nlast = 20)
##  Error: 'y' contains NA values

NA’s not allowed. The problem with the Bangladesh data is that the last age group is not the same across the years, e.g. 1975 to 1981 the last observed age is 65+, 1982 to 1983 its 80+ and after that its 85+. The varying open age is the source of NA’s in higher ages.

Right now my only option would be to truncate all life-tables to have a last age group of 65+ and then run the 2D plcm on these truncated life-tables. Not good as I throw away data this way.

I’m not sure about the proper solution. I know that PCLM needs a rectangular surface, so part of the
solution would be to have a fixed omega (closing age of life-table) instead of a fixed nlast parameter. This ensures that even if the starting age of the last age group varies, the upper limit of the last age-group will not. Therefore I propose you have an omega (closing age of life-table) parameter instead of a nlast parameter.

So one solution would be for PCLM to recode any series of NAs over the last ages to 0s and redistribute the last observed Dx and Nx from [last observed x, omega]. You’d still have to think about how to deal with a cohort that dies out before reaching the last age group. . .

More generally, I strongly prefer an na.rm option as present in many base R functions. As PCLM is a smoother, it may very well be used to smooth over missing values. Missings due to varying last-age groups are tricky, but missings in between are not, you can simply smooth over them.

Ok. But for now let’s stick to truncating my data. The data is truncated so that the last age group is 65 across all periods. Accordingly I choose a higher nlast.

age_trunc <- bang_Dx_2d$age[1:18]
anyNA(age_trunc)
## [1] FALSE

bang_Dx_2d_trunc <- bang_Dx_2d[1:18, -1]
anyNA(bang_Dx_2d_trunc)
## [1] FALSE

bang_Nx_2d_trunc <- bang_Nx_2d[1:18, -1]
anyNA(bang_Nx_2d_trunc)
## [1] FALSE

ug <- pclm2D(x = age_trunc, y = bang_Dx_2d_trunc,
offset = bang_Nx_2d_trunc, nlast = 40)
## Ungrouping offset Ungrouping data
## Error in if ((d < tol || dd < 0.1) && it >= 4) break: missing value where TRUE/FALSE needed

Don’t know what went wrong here.

persp example to help purge rgl dependency

Hi MP! I think you could save the current rgl magic under \dontrun{}, and make plot.pclm2D() use something inspired by this under the hood:

Dx <- ungroup.data$Dx
Ex <- ungroup.data$Ex

# Aggregate data to ungroup it in the examples below
x         <- c(0, 1, seq(5, 85, by = 5))
nlast   <-  26
n         <- c(diff(x), nlast)
group  <- rep(x, n)
y         <- aggregate(Dx, by = list(group), FUN = "sum")[, -1]
offset  <- aggregate(Ex, by = list(group), FUN = "sum")[, -1]

P1 <- pclm2D(x, y, nlast)

x <- P1$bin.definition$output$location["left",] 
y <- P1$fitted %>% colnames() %>% as.integer()
z <- P1$fitted

z.facet.center <- (z[-1, -1] + z[-1, -ncol(z)] + z[-nrow(z), -1] + z[-nrow(z), -ncol(z)])/4
# Range of the facet center on a 100-scale (number of colors)
z.facet.range<-cut(z.facet.center, 100)
colpal<- colorspace::sequential_hcl(100,palette = "ag_GrnYl")

persp(x = x, 
      y=y, 
      z = z, 
      theta = 210, 
      col = colpal[z.facet.range], 
      border = "#AAAAAA50",
      shade = NA, 
      zlab = "fitted")

0s produce error

The following worked for an earlier version of pclm:

x <- c(14:19,seq(20,50,by=5))
y <- c(0, 5.89614302375851, 27.5281691833457, 154.360824153406, 404.073482638157, 
826.841462608498, 15596.6097885998, 31266.5457249206, 32973.6915087617, 
28942.3542158762, 14290.9937703288, 1988.94222234551, 25.5661614888236
)
pclm(x=x,y=y,nlast=5)
# Error in if ((d < tol || dd < 0.1) && it >= 4) break : 
#  missing value where TRUE/FALSE needed

Trial and error leads me to conclude that this is due to the 0 in the first age bin. In this case I'd like to just get that 0 back. Perhaps fitting could be shielded from its nefarious effects, while still preserving the 0 in situ. Likewise for NAs (prev issue)? Or, if you consider the error to be justified, then it merits documentation and an informative warning. Another workaround would be a weight vector where I can assign a weight of 0 myself to cases where bin values are 0. Or is that possible already?

law = "makeham" producing values > 1000

Hi Marius!

Is there something wrong with this set of steps? This example data is coming from DDSQLtools, where you once left a note to dig deeper if ever a value >1000 were produced. The example data don't seem pathological, and the extrapolation isn't that far. HT @cimentadaj

mx <- c(0.0296771105565131, 0.00665264110919088, 0.00342875960445963, 
        0.00377901902282611, 0.00707116273138672, 0.0134041945263743, 
        0.0184073459431529, 0.0222593253608793, 0.0257132898736745, 0.0360888832546771, 
        0.0481218078806996, 0.0734298380538821, 0.107445906251669, 0.153227233797312, 
        0.192783421516418, 0.254067819476128, 0.349650365024805, 0.604515542507172, 
        0.705407912909985, 0.828063756883144)
x <- c(0,1,seq(5,90,by=5))


M <- MortalityLaw(
  x = x,
  mx = mx,
  fit.this.x = c(75, 80, 85, 90) ,
  law = "makeham",
  opt.method = "poissonL"
) 
pv <- predict(object = M,
              x = c(90, 95, 100, 105, 110))
range(pv)
summary(M)

The use of offset produces negative output rates when input rates are low

Thank you for all your hard work in this very useful package.

I have noticed that when I try to ungroup mortality rates of uncommon diseases I get negative rates in the outputs. For example

library(ungroup) # CRAN version 1.1.0
m1 <- pclm(seq(0, 80, 5), c(0, 0, 0, 0, 0, 0, 0, 0, 5, 15, 10, 17, 20, 30, 40, 45, 49), 21, rep(1e5, 17))
summary(m1$fitted)
# Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
# -1.812e-01 -5.685e-04 -1.947e-04 -4.592e-03 -8.000e-08  0.000e+00

I was wondering if there is a way to constrain rates to non negative values already. Otherwise it might be a good idea to add this constrain in the fitting algorithm.

zero tolerance

It seems one could allow for 0s in the count being split with some simple filters. Here's an example, where the open age group (100+) has 0 count, and it breaks. If we only supply up to age 95, then it works. I can imagine some easy filters handling final age groups, or temporarily imputing a positive number for fitting, then putting the 0 back in afterwards. I do not know what the best way to handle 0s in intermediate ages would be, however. Any thoughts?

 Age <- c(0,1,seq(5,100,by=5))
 Value <- c(10000, 44170, 44775, 42142, 38464, 34406, 30386, 26933, 23481, 
            20602, 16489, 14248, 9928, 8490, 4801, 3599, 2048, 941, 326, 
            80, 17, 0)
 n <- length(Age)
# error
 pclm(x = Age, y = Value, nlast = 1)
 
# works
 pclm(x = Age[-n], y = Value[-n], nlast = 5)

# putting a positive value in the 0 makes it not break:
# so maybe a before-after treatment would suffice?
 Value2 <- c(10000, 44170, 44775, 42142, 38464, 34406, 30386, 26933, 23481, 
            20602, 16489, 14248, 9928, 8490, 4801, 3599, 2048, 941, 326, 
            80, 17, 5)
 pclm(x = Age, y = Value2, nlast = 1)

mpascariu / ungroup Goto Github PK

ungroup's People

Contributors

Stargazers

Watchers

Forkers

ungroup's Issues

MortalitySmooth dependency now archived

Manually specify composition matrix

Allow for missing years (and interpolate) in two-dimensional smooth

pclm failing

NA’s not allowed

persp example to help purge rgl dependency

0s produce error

law = "makeham" producing values > 1000

The use of offset produces negative output rates when input rates are low

zero tolerance

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent