mpascariu / ungroup Goto Github PK
View Code? Open in Web Editor NEWEstimating Smooth Distributions from Coarsely Binned Data - R Package
License: Other
Estimating Smooth Distributions from Coarsely Binned Data - R Package
License: Other
Great package, I use the heck out of it, but it's tricky to install at the moment.
The following observations are from a user perspective:
MortalitySmooth
is now in the CRAN archives. Maybe in part this is because its dependency svcm
has been archived. And for this reason now ungroup
has been archived, and its dependencies have installation problems.
if one does
install.packages("https://cran.r-project.org/src/contrib/Archive/svcm/svcm_0.1.2.tar.gz", repos = NULL, type = "source")
install.packages("https://cran.r-project.org/src/contrib/Archive/MortalitySmooth/MortalitySmooth_2.3.4.tar.gz", repos = NULL, type = "source")
followed by
remotes::install_github("mpascariu/ungroup")
Then one can install (provided Rcpp
, RcppEigen
, and rgl
are available).
From package maintainer perspective, I do not know what the proper course of action is. I've emailed the maintainers of svcm
and MortalitySmooth
and you too proposing a temporary solution.
My case is that I have population data for most part of the period every third year, but I have mortality data each year (but that might be missing as well for other regions). It should in theory be possible to interpolate over the years with missing data. The only thing I have come up with is this ad hoc solution:
Say I have population data at time point 1 and 3, and mortality data for 1, 2 and 3. Then the estimated population in time point 2 would be some kind of mean of pop_hat_t1 - mortality_hat_t1 (and then shift x axis) and pop_hat_t3+mortality_hat_t3 (hat is for estimated). And possibly iterate back and forward to minimize the difference. But it would be cumbersome when it would be possible to just smooth instead (albeit with some information borrowing from the future in case of new-borns/edge cases)
Loving this package, but presently stumped with a simple case. Here are some single age example data from the DemoTools
package (pop1m_ind
). I group them to almost abridged ages (except 1-4 is here 1-2 and 3-4). pclm()
warns heavily for this case, but returns some fitted values. The sum of the values isn't particularly close to the sum of the original vector, however, and taking a look at the single age output shows unexpected patterns. Any guidance?
NB1: I get the same behavior if all age groups are of uniform width 5.
NB2: I tried giving control = list(lambda = 1e-5)
and similar, but same behavior.
library(ungroup)
# start with some single age data
y_orig <- c(9544406, 7471790, 11590109, 11881844, 11872503, 12968350, 11993151,
10033918, 14312222, 8111523, 15311047, 6861510, 13305117, 7454575,
9015381, 10325432, 9055588, 5519173, 12546779, 4784102, 13365429,
4630254, 9595545, 4727963, 5195032, 15061479, 5467392, 4011539,
8033850, 1972327, 17396266, 1647397, 6539557, 2233521, 2101024,
16768198, 3211834, 1923169, 4472854, 1182245, 15874081, 1017752,
3673865, 1247304, 1029243, 12619050, 1499847, 1250321, 2862148,
723195, 12396632, 733501, 2186678, 777379, 810700, 7298270, 1116032,
650402, 1465209, 411834, 9478824, 429296, 1190060, 446290, 362767,
4998209, 388753, 334629, 593906, 178133, 4560342, 179460, 481230,
159087, 155831, 1606147, 166763, 93569, 182238, 53567, 1715697,
127486, 150782, 52332, 48664, 456387, 46978, 34448, 44015, 19172,
329149, 48004, 28574, 9200, 7003, 75195, 13140, 5889, 18915,
21221, 72373)
x <- c(0,1,3,seq(5,100,5))
agi <- c(1,2,2,rep(5,19),1)
y <- tapply(y_orig,rep(x,times=agi),sum)
nlast <- 1
# equal length
# length(x); length(y)
# lots of warnings
# In y_/K$muA : longer object length is not a multiple of shorter object length
y1mod <- pclm(x = x, y = y, nlast = nlast)
# sums far off
sum(y1mod$fitted)
# 393272639 for me
sum(y)
# 432502563
Testing the ungroup
package in the Bangladesh data.
Issue raised by Jonas Scholey in January 22, 2018.
ug <- pclm2D(x = bang_Dx_2d$age, y = bang_Dx_2d[,-1],
offset = bang_Nx_2d[,-1], nlast = 20)
## Error: 'y' contains NA values
NA’s not allowed. The problem with the Bangladesh data is that the last age group is not the same across the years, e.g. 1975 to 1981 the last observed age is 65+, 1982 to 1983 its 80+ and after that its 85+. The varying open age is the source of NA’s in higher ages.
Right now my only option would be to truncate all life-tables to have a last age group of 65+ and then run the 2D plcm on these truncated life-tables. Not good as I throw away data this way.
I’m not sure about the proper solution. I know that PCLM needs a rectangular surface, so part of the
solution would be to have a fixed omega (closing age of life-table) instead of a fixed nlast parameter. This ensures that even if the starting age of the last age group varies, the upper limit of the last age-group will not. Therefore I propose you have an omega (closing age of life-table) parameter instead of a nlast
parameter.
So one solution would be for PCLM to recode any series of NAs over the last ages to 0s and redistribute the last observed Dx and Nx from [last observed x, omega]. You’d still have to think about how to deal with a cohort that dies out before reaching the last age group. . .
More generally, I strongly prefer an na.rm
option as present in many base R functions. As PCLM is a smoother, it may very well be used to smooth over missing values. Missings due to varying last-age groups are tricky, but missings in between are not, you can simply smooth over them.
Ok. But for now let’s stick to truncating my data. The data is truncated so that the last age group is 65 across all periods. Accordingly I choose a higher nlast
.
age_trunc <- bang_Dx_2d$age[1:18]
anyNA(age_trunc)
## [1] FALSE
bang_Dx_2d_trunc <- bang_Dx_2d[1:18, -1]
anyNA(bang_Dx_2d_trunc)
## [1] FALSE
bang_Nx_2d_trunc <- bang_Nx_2d[1:18, -1]
anyNA(bang_Nx_2d_trunc)
## [1] FALSE
ug <- pclm2D(x = age_trunc, y = bang_Dx_2d_trunc,
offset = bang_Nx_2d_trunc, nlast = 40)
## Ungrouping offset Ungrouping data
## Error in if ((d < tol || dd < 0.1) && it >= 4) break: missing value where TRUE/FALSE needed
Don’t know what went wrong here.
Hi MP! I think you could save the current rgl
magic under \dontrun{}
, and make plot.pclm2D()
use something inspired by this under the hood:
Dx <- ungroup.data$Dx
Ex <- ungroup.data$Ex
# Aggregate data to ungroup it in the examples below
x <- c(0, 1, seq(5, 85, by = 5))
nlast <- 26
n <- c(diff(x), nlast)
group <- rep(x, n)
y <- aggregate(Dx, by = list(group), FUN = "sum")[, -1]
offset <- aggregate(Ex, by = list(group), FUN = "sum")[, -1]
P1 <- pclm2D(x, y, nlast)
x <- P1$bin.definition$output$location["left",]
y <- P1$fitted %>% colnames() %>% as.integer()
z <- P1$fitted
z.facet.center <- (z[-1, -1] + z[-1, -ncol(z)] + z[-nrow(z), -1] + z[-nrow(z), -ncol(z)])/4
# Range of the facet center on a 100-scale (number of colors)
z.facet.range<-cut(z.facet.center, 100)
colpal<- colorspace::sequential_hcl(100,palette = "ag_GrnYl")
persp(x = x,
y=y,
z = z,
theta = 210,
col = colpal[z.facet.range],
border = "#AAAAAA50",
shade = NA,
zlab = "fitted")
The following worked for an earlier version of pclm
:
x <- c(14:19,seq(20,50,by=5))
y <- c(0, 5.89614302375851, 27.5281691833457, 154.360824153406, 404.073482638157,
826.841462608498, 15596.6097885998, 31266.5457249206, 32973.6915087617,
28942.3542158762, 14290.9937703288, 1988.94222234551, 25.5661614888236
)
pclm(x=x,y=y,nlast=5)
# Error in if ((d < tol || dd < 0.1) && it >= 4) break :
# missing value where TRUE/FALSE needed
Trial and error leads me to conclude that this is due to the 0 in the first age bin. In this case I'd like to just get that 0 back. Perhaps fitting could be shielded from its nefarious effects, while still preserving the 0 in situ. Likewise for NA
s (prev issue)? Or, if you consider the error to be justified, then it merits documentation and an informative warning. Another workaround would be a weight vector where I can assign a weight of 0 myself to cases where bin values are 0. Or is that possible already?
Hi Marius!
Is there something wrong with this set of steps? This example data is coming from DDSQLtools
, where you once left a note to dig deeper if ever a value >1000 were produced. The example data don't seem pathological, and the extrapolation isn't that far. HT @cimentadaj
mx <- c(0.0296771105565131, 0.00665264110919088, 0.00342875960445963,
0.00377901902282611, 0.00707116273138672, 0.0134041945263743,
0.0184073459431529, 0.0222593253608793, 0.0257132898736745, 0.0360888832546771,
0.0481218078806996, 0.0734298380538821, 0.107445906251669, 0.153227233797312,
0.192783421516418, 0.254067819476128, 0.349650365024805, 0.604515542507172,
0.705407912909985, 0.828063756883144)
x <- c(0,1,seq(5,90,by=5))
M <- MortalityLaw(
x = x,
mx = mx,
fit.this.x = c(75, 80, 85, 90) ,
law = "makeham",
opt.method = "poissonL"
)
pv <- predict(object = M,
x = c(90, 95, 100, 105, 110))
range(pv)
summary(M)
Thank you for all your hard work in this very useful package.
I have noticed that when I try to ungroup mortality rates of uncommon diseases I get negative rates in the outputs. For example
library(ungroup) # CRAN version 1.1.0
m1 <- pclm(seq(0, 80, 5), c(0, 0, 0, 0, 0, 0, 0, 0, 5, 15, 10, 17, 20, 30, 40, 45, 49), 21, rep(1e5, 17))
summary(m1$fitted)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -1.812e-01 -5.685e-04 -1.947e-04 -4.592e-03 -8.000e-08 0.000e+00
I was wondering if there is a way to constrain rates to non negative values already. Otherwise it might be a good idea to add this constrain in the fitting algorithm.
It seems one could allow for 0s in the count being split with some simple filters. Here's an example, where the open age group (100+) has 0 count, and it breaks. If we only supply up to age 95, then it works. I can imagine some easy filters handling final age groups, or temporarily imputing a positive number for fitting, then putting the 0 back in afterwards. I do not know what the best way to handle 0s in intermediate ages would be, however. Any thoughts?
Age <- c(0,1,seq(5,100,by=5))
Value <- c(10000, 44170, 44775, 42142, 38464, 34406, 30386, 26933, 23481,
20602, 16489, 14248, 9928, 8490, 4801, 3599, 2048, 941, 326,
80, 17, 0)
n <- length(Age)
# error
pclm(x = Age, y = Value, nlast = 1)
# works
pclm(x = Age[-n], y = Value[-n], nlast = 5)
# putting a positive value in the 0 makes it not break:
# so maybe a before-after treatment would suffice?
Value2 <- c(10000, 44170, 44775, 42142, 38464, 34406, 30386, 26933, 23481,
20602, 16489, 14248, 9928, 8490, 4801, 3599, 2048, 941, 326,
80, 17, 5)
pclm(x = Age, y = Value2, nlast = 1)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.