ocbe-uio / truncexpfam Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 1.0 5.47 MB

R package to generate data related to the Truncated Exponential Family

Home Page: https://ocbe-uio.github.io/TruncExpFam/

License: GNU General Public License v3.0

R 99.97% Makefile 0.03%

r-package truncated-distribution

truncexpfam's Introduction

What is this?

This is an R package to handle truncated members from the exponential family.

Installation

Stable version

TruncExpFam is available on CRAN and can be installed by running the following in an interactive R session:

install.packages("TruncExpFam")

Development version

The development version of the package contains features and bug fixes that are yet to be published. It is, however, much less stable than the CRAN version. You can install the development version of TruncExpFam by running the following command in R (requires the remotes package to be installed beforehand):

remotes::install_github("ocbe-uio/TruncExpFam")

If you want to browse the vignette, add build_vignettes = TRUE to your install_github() command.

Further details on installing TruncExpFam can be found on the Wiki.

Usage

Once installed, TruncExpFam can be loaded with library(TruncExpFam). A list of available functions can be printed with ls("package:TruncExpFam").

For more information about the package (e.g. suppored distributions), run ?TruncExpFam after loading the package in your R session.

Are you familiar with the stats package and its r* and d* functions such as rnorm() and dpois()? If so, you will feel right at home with TruncExpFam, which uses the rtrunc() function to generate random numbers and the dtrunc() function to generate probability densities.

For a more detailed explanation on how to use this package’s features, check out its vignette:

browseVignettes("TruncExpFam")

Contributing

TruncExpFam is open-source software licensed by the GPL. All contributions are welcome! Please use the issues page to submit any bugs you find or see what other issues have been submitted.

To contribute with code, we recommend reading this Wiki page on the subject.

Citing

If you present work that uses this package, please remember to cite it. To cite TruncExpFam in publications, use the output of citation("TruncExpFam") on your R session.

Badges

Stable version

Development version

truncexpfam's People

Contributors

Watchers

Forkers

rho62

truncexpfam's Issues

Eliminate ml.estimation.trunc.dist dependency on the "family" argument

Motivation

The ml.estimation.trunc.dist() function requires a family argument to manually trigger the appropriate estimation method. As of commit 8e17b33, this is not required anymore, because the y argument already contains this information in its class.

Steps

Remove family argument
If using S3: change name of ml.estimation.trunc.dist function to something without . (name can be kept for S4)
Split generic and methods for each signature
Adjust tests and examples

Using rtrunc aliases skips domain validation

Using wrapper skips rtrunc() and, therefore, domain validation. Possible solution: adding validateDomain methods to each rtrunc method.

Classes.R file

Go through the file:

Update parameter lists
Add new classes for the new distributions

Arguments for rtrunc.norm don't match those on stats::rnorm

The stats::rnorm function uses mean and sd, whereas rtrunc.norm uses mu and sigma. @rho62, should we change the latter to match the former?

Export natural2parameters and parameters2natural

Each distribution family contains its own set of these two functions, so maybe this is another case for creating generic natural2parameters() and parameters2natural() functions and only document these. Namely:

Develop rtrunc.binomial

According to the following lines, rtrunc.binomial is not yet working:

TruncExpFam/tests/testthat/test-examples.R

Lines 50 to 54 in ffcb4ca

 # set.seed(117) 

 # # NOT WORKING YET 

 # sample.binom <- rtrunc.binomial(1000, 0.6, 4, , 10) 

 # hist(sample.binom) 

 # ml.estimation.trunc.dist(sample.binom, y.min = 4, max.it = 500, delta = 0.33, family = "Binomial", nsize = 10)

From René's e-mail:

For the binomial case ,there is a small caveat: It has an extra parameter n (The number of trials). This parameter «n» is not to be estimated, but a fixed value, that has to be transferred to the various functions. As it does not conform with the syntax, for the other distributions ,I have tried to use the dot-dot-dot facility.

Create package logo

Try to use hexSticker::sticker().

Make other functions called by ml.estimation.trunc.dist

If we convert the following functions into generics, we can probably close #20:

Create accessor function for exported class slots

If we're using S4, set functions should be written to handle assignment and retrieval of slot values.

Here's a good example on how to implement this:
https://adv-r.hadley.nz/s4.html#accessors

Join truncation limits into one slot

Truncation limits are currently defined as separate slots in the Trunc class:

TruncExpFam/R/classes.R

Lines 11 to 19 in 32e40d3

 setClass( 

 Class = "Trunc", 

 slots = list( 

 n = "integer", 

 a = "numeric", 

 b = "numeric", # TODO: join with a as a length 1+ vector (trunc. points) 

 sample = "numeric" 

 ) 

 )

They should be joined as a 1+ vector into one slot, as it may have length 1, 2, 3.

Switch file structure back to grouping by distribution

It's currently grouped by generic, which is easier for development but non-intuitive.

Improve readability of rtrunc signatures

The current help file of rtrunc reads:

Usage:

 rtrunc(n, size, prob, alpha, beta, mulog, sigmalog, mu, sigma, lambda, a, b)
 
 ## S4 method for signature 'numeric,missing,missing,missing,missing'
 rtrunc(n, size, prob, a, b)
 
 ## S4 method for signature 'missing,numeric,missing,missing,missing'
 rtrunc(n, alpha, beta, a = 0, b = Inf)
 
 ## S4 method for signature 'missing,missing,numeric,missing,missing'
 rtrunc(n, mulog, sigmalog, a, b)
 
 ## S4 method for signature 'missing,missing,missing,numeric,missing'
 rtrunc(n, mu, sigma, a, b)
 
 ## S4 method for signature 'missing,missing,missing,missing,numeric'
 rtrunc(n, lambda, a, b)

This is a bit confusing, as the name of the signatures is not understandable unless one looks at the source code. They should be replaced with something like the name of the distribution.

Moreover, the top usage (with all the parameters) makes no sense, so it should probably be removed not to confuse the user.

Replace aliases with @export <alias> in original function?

Perhaps alias functions (e.g. the ones on R/rtrunc.R can be replaced by using @export <alias_name> in the original methods. Something to try that will greatly reduce code footprint.

Check consistency between parameters passed to rtrunc and family

Check consistency between parameters passed and family (e.g. rtrunc(n=10, df=4, family=gaussian should yield an error an explain why)

Print message on rtrunc about which distribution is being used

Makes it clearer to the user about which distribution is actually used. The function already outputs the class, but that might go unnoticed (it's unseen if the user assigns rtrunc() to an R object).

Standardize output of ml.estimation.trunc.dist for Poisson

The output for other families is a named vector, whereas for Poisson it's a 1x1 unnamed matrix. It would look better if the output was "lambda".

To observe the behavior, run example(ml.estimation.trunc.dist).

Sampling n units

The sampling functions rXXX do most often not sample the full sample n, due to truncation

inverse gaussian depends on the statmod library

Needs to be accomodated ('install dependent libraries' thing)
Similarly invgamma is needed for the inverse gamma

Describe delta parameter of ml.estimation.trunc.dist

The delta argument is missing a proper description. This can be seen on the function documentation (?ml.estimation.trunc.dist). This can be fixed by editing the R/ml.estimation.r file, particularly the line below:

TruncExpFam/R/ml.estimation.r

Line 8 in ffcb4ca

#' @param delta #TODO: describe

Log-Normal not working in this edition

Title copied from René's original script.

Find suitable limits for non-symmetrical distributions

Find suitable limits for non-symmetrical distributions (eg. Gamma, Inv Gaussian, LogNormal etc)

Scale parameter in gamma and invgamma

Introduce the scale parameter (scale=1/rate)
Can mimic the implementation from base R

Export density* functions as S3 methods of dtrunc

This would require changing the class of the inputs of those files (y and eta) so they can be properly matched by the generic density() function. Otherwise, the functions won't work as methods for density().

Create generic rtrunc function

It will call rtrunc.norm(), rtrunc.gamma(), etc. as methods. Maybe the generic and its methods should be grouped into one function instead of spread across several files.

rcontbernoulli

The sampling functions implemented in 'rtrunc.R' relies on random sampling from distributions follow by a truncation. For the continuous bernoulli, there was no sampling function available in base R. Hence it has been implemented in the rtrunc.R file (at the top) - Needs proper adaption

Reorganize family name validation in rtrunc

Problem

Using the rtrunc() aliases skips rtrunc() and, therefore, domain validation.

Possible solutions

add validateDomain methods to each rtrunc method
transform family name validation into validateFamily() (like is done for domain)
Replace the valid_distros vector with the valid_fam_parm list on validateFamilyParms()

Dependencies

This is dependent on the closure of the following issues/PRs:

Function variables missing

Some functions are calling variables that don't exist in their scope. Namely (as for commit 6531470):

dtrunc.trunc_chisq: no visible binding for global variable ‘parm’

TruncExpFam/R/chisq.R

Lines 24 to 38 in 83c7189

 dtrunc.trunc_chisq <- function(y, eta, a = 0, b) { 

 df <- natural2parameters.trunc_chisq(eta) 

 dens <- ifelse((y <= a) | (y > b), 0, dchisq(y, df=df)) 

 if (!missing(a)) { 

 F.a <- pchisq(a, parm) # FIXME: parm is not defined 

 } else { 

 F.a <- 0 

 } 

 if (!missing(b)) { 

 F.b <- pchisq(b, parm) 

 } else { 

 F.b <- 1 

 } 

 return(dens / (F.b - F.a)) 

 }

dtrunc.trunc_exp: no visible binding for global variable ‘parm’

TruncExpFam/R/exponential.R

Lines 7 to 22 in 83c7189

 dtrunc.trunc_exp <- function(y, eta, a = 0, b) { 

 # TODO: develop rtrunc.exp? 

 rate <- natural2parameters.trunc_exp(eta) 

 dens <- ifelse((y <= a) | (y > b), 0, dexp(y, rate=rate)) 

 if (!missing(a)) { 

 F.a <- pexp(a, rate) 

 } else { 

 F.a <- 0 

 } 

 if (!missing(b)) { 

 F.b <- pexp(b, parm) # FIXME: parm is not defined 

 } else { 

 F.b <- 1 

 } 

 return(dens / (F.b - F.a)) 

 }

dtrunc.trunc_nbinom: no visible global function definition for ‘my.pbinom’

TruncExpFam/R/negative-binomial.R

Lines 7 to 29 in a29d752

 dtrunc.trunc_nbinom <- function(y, eta, a = 0, b, ...) { 

 # TODO: develop rtrunc.nbinom 

 my.dnbinom <- function(nsize) { 

 dnbinom(y, size = nsize, prob = proba) 

 } 

 my.pnbinom <- function(z, nsize) { 

 pnbinom(z, size = nsize, prob = proba) 

 } 

 proba <- exp(eta) 

 dens <- ifelse((y < a) | (y > b), 0, my.dnbinom(...)) 

 if (!missing(a)) { 

 F.a <- my.pbinom(a - 1, ...) 

 } else { 

 F.a <- 0 

 } 

 if (!missing(b)) { 

 F.b <- my.pbinom(b, ...) 

 } else { 

 F.b <- 1 

 } 

 return(dens / (F.b - F.a)) 

 }

get.grad.E.T.inv.trunc_nbinom: no visible binding for global variable ‘r’

TruncExpFam/R/negative-binomial.R

Lines 59 to 64 in a29d752

 get.grad.E.T.inv.trunc_nbinom <- function(eta) { 

 # eta: Natural parameter 

 # return the inverse of E.T differentiated with respect to eta 

 p=exp(eta) 

 return(A = (1-p)^2/(r*p)) # FIXME: r not defined 

 }

get.y.seq.trunc_invgauss: no visible binding for global variable ‘sd’

TruncExpFam/R/inverse-gaussian.R

Lines 57 to 63 in 83c7189

 get.y.seq.trunc_invgauss <- function(y, y.min, y.max, n = 100) { 

 mean <- mean(y, na.rm = T) 

 shape <- var(y, na.rm = T)^0.5 

 lo <- max(max(0,y.min), mean - 3.5 * sd) # FIXME: sd not defined 

 hi <- min(y.max, mean + 3.5 * sd) 

 return(seq(lo, hi, length = n)) 

 }

These variables must be either internally calculated or passed as arguments.

Add family=gaussian as an argument to rtrunc

Part of the user experience should be the explicit specification of a family when calling rtrunc(). For compatibility with glm(), the argument should default to gaussian, though.

Extra, related task:

Check consistency between parameters passed and family (e.g. rtrunc(n=10, df=4, family=gaussian should yield an error an explain why)

Add probability and quantile functions

After we've implemented the r* and d* functions, work on p* and q*. Both would probably be numerically determined (though analytical solutions are of course preferred).

Maybe one function can do the job for all distributions, since the procedure is similar see sketch below)?

prtrunclnorm <- function(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE) {
  y <- rtrunc.lognormal(n = 1e3, ...)
  # bootstrap y
  # get value of y_0 of y corresponding to q
  # return probability from -Inf to y_0
}

and q* is basically the other way around.

Implement ptrunc*
Implement qtrunc*

Add output suppression to ml.estimation.trunc.dist

This would be quite useful for people running the function inside scripts, where output suppression is often desirable to avoid visual pollution.

Documentation for rtrunc should group arguments by distro

The contents of ?rtrunc (source code here) lists all possible arguments together. This makes it hard for a user to know how to use the function, since the proper "pair" of arguments must be used (e.g. mu with sigma and not with lambda).

Ideally, the arguments should be grouped inside the documentation, but I don't know if Roxygen allows for this. In any case, there should be details about this in the @details section.

Check that all parameters from stats::r* appear in the respective rtrunc methods

~~mu -> mean~~ (tracked by issue #22)
~~mulog -> meanlog~~ (tracked by issue #22)
~~sigma -> sd~~ (tracked by issue #22)
~~sigmalog -> sdlog~~ (tracked by issue #22)

Also check that all parameters are included in all functions in rtrunc.R

rtrunc signatures are too long

The rtrunc function uses signatures that involve so many arguments that generate documentation with names that are too long for an R package. This is the error from devtools::check():

E checking for portable file names
Found the following non-portable file paths:
TruncExpFam/man/rtrunc-numeric-numeric-ANY-missing-missing-missing-missing-missing-missing-missing-missing-numeric-method.Rd
TruncExpFam/man/rtrunc-numeric-numeric-ANY-missing-missing-missing-missing-missing-missing-numeric-numeric-missing-method.Rd
TruncExpFam/man/rtrunc-numeric-numeric-ANY-missing-missing-missing-missing-numeric-numeric-missing-missing-missing-method.Rd
TruncExpFam/man/rtrunc-numeric-numeric-ANY-missing-missing-numeric-numeric-missing-missing-missing-missing-missing-method.Rd
TruncExpFam/man/rtrunc-numeric-numeric-numeric-numeric-numeric-missing-missing-missing-missing-missing-missing-missing-method.Rd

Tarballs are only required to store paths of up to 100 bytes and cannot
store those of more than 256 bytes, with restrictions including to 100
bytes for the final component.
See section ‘Package structure’ in the ‘Writing R Extensions’ manual.
OK

The methods of rtrunc must be adapted to recognize simpler signatures, not only so that the package cleanly checks, but also to facilitate debugging (it's not eay to understand what distribution "rtrunc-numeric-numeric-numeric-numeric-numeric-missing-missing-missing-missing-missing-missing-missing" refers to).

Naming of functions

use ". " rather than "_" in function names: xxx.trunc_normal -> xxx.trunc.normal
And avoid using "trunc" twice as in "dtrunc.trunc_normal" Should be "dtrunc.normal"
And similar for all distributions.
Reluctant to do this, as I'm uncertain about the repercussions for other parts

Create validation function for distribution parameter domains

From René's e-mail:

One thing missing is a function that cheks if data are inside the domain of the given domain (e.g. Normal/Gaussian is real numbers, Gamma positive real numbers, Poisson is non-negative integers etc)

Include more distributions

Add methods for these distributions:

This depends on several generic functions being in place. In other words, the following issues must be closed first to avoid rework:

Make aliases for rtrunc methods

The idea is for a user to be able to call either one of those:

rtrunc(100, family="poisson", lambda=3)
rtruncpois(100, lambda=3)

So all rtrunc methods would have the same name as their untruncated counterparts in stats, but with the word "trunc" between "r" and the distribution name.

This can either be achieved by creating wrapper functions or aliases (preferred).

Add default values for a and b for all methods

Add limits for a and b
Add unit tests to check if passing custom arguments changes output given the same seed

Make output of rtrunc match that of stats:: counterparts

Add a full_output=FALSE argument to rtrunc
Default behavior: only output the sample results
Alternative behavior, for full_output=TRUE: print all slots

Truncation limits are not working on rtrunc

Summary

I guess I messed something up when writing the rtrunc() generic and now it's ignoring truncation limits. Should be easy to fix, though.

MRE

set.seed(10); rtrunc(n=10, mean=0, sd=4, a=0)

Observed result

[1]  0.07498468 -0.73701017 -5.48532220 -2.39667086  1.17818051  1.55917720 -4.83230470 -1.45470407 -6.50669073 -1.02591358
attr(,"class")
[1] "trunc_normal"

Expected result

No negative values.

Regroup unit tests

Unit tests are grouped by distribution, and it's a bit messy and hard to navigate (and it's not even 100 lines long!). I bet it would look much better with the following structure:

context("Sampling with rtrunc")
context("Matching output of stats::r*")
context("ML estimation")

Missing rinvnbinom function

Hi @rho62,

Line 7 below references an rinvnbinom() function which is not defined in the package, and I couldn't find it elsewhere in the R-verse.

TruncExpFam/R/negative-binomial.R

Lines 6 to 16 in ddc3961

 rtrunc.nbinom <- function(n, size, prob, mu, a,b=Inf) { 

 y <- rinvnbinom(n, size, prob, mu) 

 if (!missing(a)) { 

 y <- y[y >= a] 

 } 

 if (!missing(b)) { 

 y <- y[y <= b] 

 } 

 class(y) <- "trunc_nbinom" 

 return(y) 

 }

Should we code this function or import it from another package?

Implement "invisible" distributions in rtrunc

Add code for the t-distribution

Add short vignette

Add vignette showing package usage. Here's a quick draft:

x <- rtrunc(...)
dtrunc(x)
ml.estimation

Use an ml.estimation... example as a base
Add some text

Inconsistent class names

On some parts of the code, the rtrunc classes begin with trunc_ whereas in others they don't. This should be standardized ASAP to avoid confusion and code fragility.

Consistent naming of extensions .r or .R

As I understand it, some files will be merged at the end, so perhaps immaterial for these. But we could just stick to one or the other. Which do you prefer

ML estimation not working for Binomial

MRE

sample.binom <- rtrunc(n=1000, prob=0.6, size=20, a=4, b=10, family="binomial") 
ml_binom <- ml.estimation.trunc.dist(
	sample.binom, y.min = 4, max.it = 500, delta = 0.33, family = "Binomial", nsize = 10
)

Observed output

Error in y/... : invalid unary operator

Expected output

An estimation of p (plus, intermediate output for each iteration)

Remove redundant code

In rtrunc.normal:
.....
if (!missing(b)) {
y <- y[y <= b]
} else {
b <- Inf
}
class(y) <- "trunc_normal"
return(y)

Guess "else {b<-Inf } can" be deleted?

The same issue applies for other distributions gamma ao

Add unit tests comparing rtrunc with their stats:: counterparts

The following behavior should remain applicable as we further develop the package:

r$> set.seed(2); rbinom(n=100, size=10, prob=.4)                                             
  [1] 3 5 4 3 6 6 2 5 4 4 4 3 5 3 4 6 7 3 4 2 5 4 6 2 3 4 2 3 7 2 1 2 5 6 4 4 6 3 5 2 7 3 2 2
 [45] 6 5 7 3 4 5 1 1 5 6 3 5 5 8 4 5 5 6 4 3 6 4 4 4 3 2 3 3 1 3 3 5 3 6 4 4 3 5 1 4 3 6 7 3
 [89] 5 3 7 4 3 4 4 3 4 2 2 4

r$> set.seed(2); rtrunc(n=100, size=10, prob=.4, a=0, b=Inf)@sample                          
  [1] 3 5 4 3 6 6 2 5 4 4 4 3 5 3 4 6 7 3 4 2 5 4 6 2 3 4 2 3 7 2 1 2 5 6 4 4 6 3 5 2 7 3 2 2
 [45] 6 5 7 3 4 5 1 1 5 6 3 5 5 8 4 5 5 6 4 3 6 4 4 4 3 2 3 3 1 3 3 5 3 6 4 4 3 5 1 4 3 6 7 3
 [89] 5 3 7 4 3 4 4 3 4 2 2 4

r$> identical({set.seed(2); rbinom(n=100, size=10, prob=.4)}, {set.seed(2); rtrunc(n=100, siz
    e=10, prob=.4, a=0, b=Inf)@sample})                                                      
[1] TRUE

For the generated values, at least. The vector generated by rtrunc will probably not literally be the same as it will have a different class.

	# set.seed(117)
	# # NOT WORKING YET
	# sample.binom <- rtrunc.binomial(1000, 0.6, 4, , 10)
	# hist(sample.binom)
	# ml.estimation.trunc.dist(sample.binom, y.min = 4, max.it = 500, delta = 0.33, family = "Binomial", nsize = 10)

	setClass(
	Class = "Trunc",
	slots = list(
	n = "integer",
	a = "numeric",
	b = "numeric", # TODO: join with a as a length 1+ vector (trunc. points)
	sample = "numeric"
	)
	)

	dtrunc.trunc_chisq <- function(y, eta, a = 0, b) {
	df <- natural2parameters.trunc_chisq(eta)
	dens <- ifelse((y <= a) \| (y > b), 0, dchisq(y, df=df))
	if (!missing(a)) {
	F.a <- pchisq(a, parm) # FIXME: parm is not defined
	} else {
	F.a <- 0
	}
	if (!missing(b)) {
	F.b <- pchisq(b, parm)
	} else {
	F.b <- 1
	}
	return(dens / (F.b - F.a))
	}

	dtrunc.trunc_exp <- function(y, eta, a = 0, b) {
	# TODO: develop rtrunc.exp?
	rate <- natural2parameters.trunc_exp(eta)
	dens <- ifelse((y <= a) \| (y > b), 0, dexp(y, rate=rate))
	if (!missing(a)) {
	F.a <- pexp(a, rate)
	} else {
	F.a <- 0
	}
	if (!missing(b)) {
	F.b <- pexp(b, parm) # FIXME: parm is not defined
	} else {
	F.b <- 1
	}
	return(dens / (F.b - F.a))
	}

	dtrunc.trunc_nbinom <- function(y, eta, a = 0, b, ...) {
	# TODO: develop rtrunc.nbinom
	my.dnbinom <- function(nsize) {
	dnbinom(y, size = nsize, prob = proba)
	}
	my.pnbinom <- function(z, nsize) {
	pnbinom(z, size = nsize, prob = proba)
	}
	proba <- exp(eta)
	dens <- ifelse((y < a) \| (y > b), 0, my.dnbinom(...))

	if (!missing(a)) {
	F.a <- my.pbinom(a - 1, ...)
	} else {
	F.a <- 0
	}
	if (!missing(b)) {
	F.b <- my.pbinom(b, ...)
	} else {
	F.b <- 1
	}
	return(dens / (F.b - F.a))
	}

	get.grad.E.T.inv.trunc_nbinom <- function(eta) {
	# eta: Natural parameter
	# return the inverse of E.T differentiated with respect to eta
	p=exp(eta)
	return(A = (1-p)^2/(r*p)) # FIXME: r not defined
	}

	get.y.seq.trunc_invgauss <- function(y, y.min, y.max, n = 100) {
	mean <- mean(y, na.rm = T)
	shape <- var(y, na.rm = T)^0.5
	lo <- max(max(0,y.min), mean - 3.5 * sd) # FIXME: sd not defined
	hi <- min(y.max, mean + 3.5 * sd)
	return(seq(lo, hi, length = n))
	}

	rtrunc.nbinom <- function(n, size, prob, mu, a,b=Inf) {
	y <- rinvnbinom(n, size, prob, mu)
	if (!missing(a)) {
	y <- y[y >= a]
	}
	if (!missing(b)) {
	y <- y[y <= b]
	}
	class(y) <- "trunc_nbinom"
	return(y)
	}

ocbe-uio / truncexpfam Goto Github PK

truncexpfam's Introduction

What is this?

Installation

Stable version

Development version

Usage

Contributing

Citing

Badges

Stable version

Development version

truncexpfam's People

Contributors

Watchers

Forkers

truncexpfam's Issues

Motivation

Steps

Problem

Possible solutions

Dependencies

Summary

MRE

Observed result

Expected result

MRE

Observed output

Expected output

Recommend Projects

Recommend Topics

Recommend Org