Git Product home page Git Product logo

Comments (36)

larmarange avatar larmarange commented on July 29, 2024 3

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024 2

Thanks Joseph for the very useful example.

This probably gets us quite far. There are summary_factorlist() features not yet included, most notably, labels. But that would be an easy addition.

Would be tidied up and split into component functions prior to being deployed.

Interested in thoughts.

library(questionr)
data(hdv2003)

library(dplyr)
hdv2003$etud <- recode_factor(
  hdv2003$nivetud,
  "N'a jamais fait d'etudes" = "Primary",
  "A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
  "Derniere annee d'etudes primaires" = "Primary",
  "1er cycle" = "Secondary",
  "2eme cycle" = "Secondary",
  "Enseignement technique ou professionnel court" = "Professionnal",
  "Enseignement technique ou professionnel long" = "Professionnal",
  "Enseignement superieur y compris technique superieur" = "High"
)

library(survey)
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)

summary_survey <- function(.data, dependent, explanatory, cont = "mean", cont_range = FALSE, 
                           p = FALSE, column = FALSE, digits = c(1, 1, 3, 1), fit_id = FALSE){
  # Define variable type
  explanatory_type = .data$variables %>% 
    select(explanatory) %>% 
    purrr::map(is.numeric)
  
  # Hypothesis test
  if(p){
    p_tests = explanatory %>% 
      purrr::map2(explanatory_type,
                  ~if(!.y){
                    survey::svychisq(as.formula(paste0("~", ., "+", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "mean"){
                    survey::svyttest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "median"){
                    survey::svyranktest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  }
      )
  }
  
  # Output table
  explanatory %>% 
    purrr::map2(explanatory_type,
                ~ if(!.y){
                  survey::svytable(as.formula(paste0("~", .x, "+", dependent)), .data) %>% 
                    as.data.frame(stringsAsFactors = FALSE) %>% 
                    { if(column) {
                      dplyr::group_by(., !! sym(dependent)) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    } else { 
                      dplyr::group_by_at(., 1) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    }
                    } %>% 
                    dplyr::mutate(
                      value = paste0(Freq %>% round_tidy(digits[4]), " (", 
                                     prop %>% round_tidy(digits[4]), ")")
                    ) %>%
                    dplyr::select(-total, -prop, -Freq) %>% 
                    tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                    dplyr::mutate(
                      label = names(.)[1]
                    ) %>% 
                    dplyr::rename(levels = 1)
                } else {
                  { if(cont == "mean") {
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, svymean, na.rm = TRUE) %>%
                      dplyr::mutate(
                        value = paste0(!! sym(.x) %>% round_tidy(digits[1]), " (", 
                                       se %>%  round_tidy(digits[2]), ")") 
                      ) %>% 
                      dplyr::select(-c(.x, se)) %>% 
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Mean (SE)"
                      )
                  } else if(cont == "median"){
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, 
                                  svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) %>% 
                      dplyr::rename(Q1 = 2,
                                    Q2 = 3,
                                    Q3 = 4) %>% 
                      dplyr::mutate(
                        IQR = Q3 - Q1) %>% 
                      { if(cont_range){
                        dplyr::mutate(., 
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     Q1 %>%  round_tidy(digits[2]), " to ",
                                                     Q3 %>% round_tidy(digits[2]), ")")
                        )                        
                      } else {
                        dplyr::mutate(.,
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     IQR %>%  round_tidy(digits[2]), ")")
                        )
                      }} %>% 
                      dplyr::select(-c(2:8)) %>%                      
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Median (IQR)"
                      )
                  }
                  }
                }) %>% 
    
    # Add hypothesis test
    { if(p){
      purrr::map2_df(., p_tests,
                     ~ mutate(.x,
                              p = .y)
      )} else {
        dplyr::bind_rows(.)
      }} %>%
    dplyr::select(label, levels, dplyr::everything()) %>% 
    as.data.frame() %>%
    { if(fit_id){
      levels_id = .$levels
      drop = levels_id %in% c("Mean (SE)", "Median (IQR)")
      levels_id[drop] = ""
      mutate(., 
             fit_id = paste0(label, levels_id),
             index = 1:dim(.)[1])
    } else {
      .
    }} %>% 
    rm_duplicate_labels()
}


library(finalfit)
dependent = "sport"

# Choose different options to test
explanatory = c("sexe")
explanatory = c("heures.tv")
explanatory = c("sexe", "heures.tv", "etud")

dw %>% 
  summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, 
                 digits = c(1, 1, 3, 0))


ff_merge(
  summary_survey(dw, dependent, explanatory, fit_id = TRUE),
  svyglmmulti(dw, dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR (multivariable)"),
  last_merge = TRUE
)

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

There are lots of different glm() options that could be added. Poisson/weights/propensity score as you rightly say. Rather than make things more complicated, ff_merge() allows any table to be created. Here are some examples of going from a simple logistic regression model, to a weighted model.

Standard approach:

explanatory = c("age.factor", "age", "sex.factor", "nodes", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
	finalfit(dependent, explanatory)

Build each column separately approach:

colon_s %>% 
	summary_factorlist(dependent, explanatory, fit_id=TRUE) %>% 
	ff_merge(
		glmuni(colon_s, dependent, explanatory) %>% 
			fit2df(estimate_suffix = " (univariable)")
	) %>% 
	ff_merge(
		glmmulti(colon_s, dependent, explanatory) %>%
			fit2df(estimate_suffix = " (multivariable)")
	) %>% 
	dplyr::select(-fit_id, -index) %>% 
	dependent_label(colon_s, dependent)

Incorporate more complex model:

colon_s %>% 
	summary_factorlist(dependent, explanatory, fit_id=TRUE) %>% 
	ff_merge(
		glmuni(colon_s, dependent, explanatory) %>% 
			fit2df(estimate_suffix = " (univariable)")
	) %>% 
	ff_merge(
		glm(ff_formula(dependent, explanatory), data=colon_s, 
				family = "binomial", weights = NULL) %>% 
			fit2df(estimate_suffix = " (multivariable)")
	) %>% 
	dplyr::select(-fit_id, -index) %>% 
	dependent_label(colon_s, dependent)

Let me know what you think.

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

That's great for multivariate models, thanks!

What do you think about weighted univariate stats? Would that require using different functions within the glmuni() function?

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

lmuni(), lmmulti(), glmuni() and glmmulti() now supported weights.

explanatory = c("age.factor", "age", "sex.factor", "nodes", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
wgts = runif(dim(colon_s)[1], 0, 1)

library(finalfit)
colon_s %>% 
	summary_factorlist(dependent, explanatory, fit_id=TRUE) %>% 
	ff_merge(
		glmuni(colon_s, dependent, explanatory, weights = wgts) %>% 
			fit2df(estimate_suffix = " (univariable)")
	) %>% 
	ff_merge(
		glm(ff_formula(dependent, explanatory), data=colon_s, 
				family = "binomial", weights = wgts) %>% 
			fit2df(estimate_suffix = " (multivariable)")
	) %>% 
	dplyr::select(-fit_id, -index) %>% 
	dependent_label(colon_s, dependent)

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Thanks.
Could you provide an example of what you suggest.

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Hi,

I don't know too much about this area. I've added these functions and would appreciate if you could let me know if this is what you were thinking of. Also, are the examples correct.

Many thanks for your help with this.

Ewen

library(dplyr)
library(survey)

# Examples from survey::svyglm() help page

data(api)

# Label data frame
apistrat = apistrat %>% 
  mutate(
    api00 = ff_label(api00, "API in 2000 (api00)"),
    ell = ff_label(ell, "English language learners (percent)(ell)"),
    meals = ff_label(meals, "Meals eligible (percent)(meals)"),
    mobility = ff_label(mobility, "First year at the school (percent)(mobility)"),
    sch.wide = ff_label(sch.wide, "School-wide target met (sch.wide)")
    )
		
# Linear example
dependent = "api00"
explanatory = c("ell", "meals", "mobility")

# Stratified design
dstrat = svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

# Univariable fit
fit_uni = dstrat %>%
    svyglmuni(dependent, explanatory) %>%
    fit2df(estimate_suffix = " (univariable)")

# Multivariable fit
fit_multi = dstrat %>%
    svyglmmulti(dependent, explanatory) %>%
    fit2df(estimate_suffix = " (multivariable)")
	
# Pipe together
apistrat %>%
    summary_factorlist(dependent, explanatory, fit_id = TRUE) %>%
    ff_merge(fit_uni) %>% 
    ff_merge(fit_multi) %>% 
    select(-fit_id, -index) %>%
    dependent_label(apistrat, dependent)

#               Dependent: API in 2000 (api00)             Mean (sd)    Coefficient (univariable)  Coefficient (multivariable)
#     English language learners (percent)(ell)  [0,84] 652.8 (121.0) -3.73 (-4.35--3.11, p<0.001)  -0.48 (-1.25-0.29, p=0.222)
#              Meals eligible (percent)(meals) [0,100] 652.8 (121.0) -3.38 (-3.71--3.05, p<0.001) -3.14 (-3.70--2.59, p<0.001)
# First year at the school (percent)(mobility)  [1,99] 652.8 (121.0)  -1.43 (-3.30-0.44, p=0.137)   0.23 (-0.54-1.00, p=0.567)

# Binomial example
## Note model family needs specified and exponentiation if desired
dependent = "sch.wide"
explanatory = c("ell", "meals", "mobility")

# Univariable fit
fit_uni = dstrat %>%
    svyglmuni(dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR", estimate_suffix = " (univariable)")

# Multivariable fit
fit_multi = dstrat %>%
    svyglmmulti(dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR", estimate_suffix = " (multivariable)")

# Pipe together
apistrat %>%
    summary_factorlist(dependent, explanatory, fit_id = TRUE) %>%
    ff_merge(fit_uni) %>% 
    ff_merge(fit_multi) %>% 
    select(-fit_id, -index) %>%
    dependent_label(apistrat, dependent)

# Dependent: School-wide target met (sch.wide)                    No         Yes          OR (univariable)        OR (multivariable)
#     English language learners (percent)(ell) Mean (SD) 22.5 (19.3) 20.5 (20.0) 1.00 (0.98-1.01, p=0.715) 1.00 (0.97-1.02, p=0.851)
#              Meals eligible (percent)(meals) Mean (SD) 46.0 (29.1) 44.7 (29.0) 1.00 (0.99-1.01, p=0.968) 1.00 (0.98-1.01, p=0.732)
# First year at the school (percent)(mobility) Mean (SD)  13.9 (8.6) 17.2 (13.0) 1.06 (1.00-1.12, p=0.049) 1.06 (1.00-1.13, p=0.058)

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

That's fantastic, I should have time in the next week to go through this and confirm. I will also test on my own data. I'll respond as soon as I do. Thank you for this great update!

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Thanks. One issue is that the summary statistics from summary_factorlist() are not weighted. They cannot be weighted using this function. So to be useful it might be necessary to incorporate svymean() or similar.

What would be useful is to know how you would actually use this. What your final table would actually look like. What variations would be common in practice etc. Best wishes.

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

Ewen I just took a look and that looks correct, though like you mention a key feature that would be really nice is the weighted means and SD in the table using svymean(). The final table would be just like you are demonstrating, except weighted mean/SD.

This is a basic example of what those tables would look like in a paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5624560/pdf/cureus-0009-00000001536.pdf

from finalfit.

larmarange avatar larmarange commented on July 29, 2024

Ideally, summary_factorlist() should test if .data is a svydesign object. In such case, all stats could be computed using survey functions. i.e. svytable, svymean, svychisq, etc.

The desired table is the same as with unweighted data. It's just that the computation of them should used survey functions.

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Thanks @larmarange
Is it worth a new function? Would people use it?

from finalfit.

larmarange avatar larmarange commented on July 29, 2024

Dear @ewenharrison thank you very much for your feedback. You have done a fantastic job on finalfit.

I just spent 3 days teaching R to master students and teachers in population studies. They were very impressed by summary_fatorlist(), finalfit() and or_plot() as they are very easy to use.

In population studies, most of them are working with national surveys. Such surveys have complex sampling and/or post-calibration. Almost all of them required to use the survey package.

When I presented multivariate analysis using survey, they were very frustrated not to be able to use finalfit.

I would say it's really worth to extend summary_facorlist() to be able to deal with a survey design object instead of a data.frame and to use therefore survey functions (such as svytable(), svymean(), svychisq()...) to generate the table.

Similarly, if summary_factorlist() is adaped to manage survey.design class (it's simple to test if an object inherits that class to load a specific function), then it should be quite easy to adapt finalfit() and or_fit() as well.

By the way, just for you to know, survey also provide a svycoxph() function.

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Ok, well if you could point me to a suitable dataset and provide 2 or 3 example analyses using survey, together with the table output you would like to to see then I can have a look.
Many of the examples in library(survey) help seem very complex and difficult to generalise.

from finalfit.

larmarange avatar larmarange commented on July 29, 2024

Please find below an exemple using a dataset provided by questionr package.

library(questionr)
data(hdv2003)

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
hdv2003$etud <- recode_factor(
  hdv2003$nivetud,
  "N'a jamais fait d'etudes" = "Primary",
  "A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
  "Derniere annee d'etudes primaires" = "Primary",
  "1er cycle" = "Secondary",
  "2eme cycle" = "Secondary",
  "Enseignement technique ou professionnel court" = "Professionnal",
  "Enseignement technique ou professionnel long" = "Professionnal",
  "Enseignement superieur y compris technique superieur" = "High"
)

library(survey)
#> Le chargement a nécessité le package : grid
#> Le chargement a nécessité le package : Matrix
#> Le chargement a nécessité le package : survival
#> 
#> Attachement du package : 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)

# Note: summary.formula not compatible with svydesign objects

# dependent variable
freq(svytable(~ sport, dw))
#>           n    % val%
#> Non 6714760 60.7 60.7
#> Oui 4356466 39.3 39.3

# binary explanatory
tab <- svytable(~ sexe + sport, dw)
tab
#>        sport
#> sexe        Non     Oui
#>   Homme 2850210 2299173
#>   Femme 3864550 2057293
rprop(tab)
#>           sport
#> sexe       Non   Oui   Total
#>   Homme     55.4  44.6 100.0
#>   Femme     65.3  34.7 100.0
#>   Ensemble  60.7  39.3 100.0
svychisq(~ sexe + sport, dw)
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~sexe + sport, dw)
#> F = 12.444, ndf = 1, ddf = 1999, p-value = 0.0004289

# explanotory with more modalities
tab <- svytable(~ etud + sport, dw)
tab
#>                sport
#> etud                  Non       Oui
#>   Primary       1978619.6  234477.9
#>   Secondary     1428124.8  671021.2
#>   Professionnal 2050834.3 1315064.1
#>   High          1071009.1 1504802.3
rprop(tab)
#>                sport
#> etud            Non   Oui   Total
#>   Primary        89.4  10.6 100.0
#>   Secondary      68.0  32.0 100.0
#>   Professionnal  60.9  39.1 100.0
#>   High           41.6  58.4 100.0
#>   Ensemble       63.7  36.3 100.0
svychisq(~ etud + sport, dw)
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~etud + sport, dw)
#> F = 45.593, ndf = 2.9831, ddf = 5963.2106, p-value < 2.2e-16

# quantitative explanatory
svyby(~ heures.tv, ~ sport, dw, svymean, na.rm = TRUE) # mean
#>     sport heures.tv         se
#> Non   Non  2.417582 0.06689688
#> Oui   Oui  1.785640 0.06573058
svyttest(heures.tv ~ sport, dw)
#> 
#>  Design-based t-test
#> 
#> data:  heures.tv ~ sport
#> t = -6.7382, df = 1993, p-value = 2.092e-11
#> alternative hypothesis: true difference in mean is not equal to 0
#> 95 percent confidence interval:
#>  -0.8157581 -0.4481261
#> sample estimates:
#> difference in mean 
#>         -0.6319421

svyby(~ heures.tv, ~ sport, dw, svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) # quartiles
#> Warning in vcov.svyquantile(X[[i]], ...): Only diagonal of vcov() available

#> Warning in vcov.svyquantile(X[[i]], ...): Only diagonal of vcov() available
#>     sport 0.25 0.5 0.75   se.0.25    se.0.5   se.0.75
#> Non   Non    1   2    3 0.0000000 0.0000000 0.2551067
#> Oui   Oui    1   2    3 0.1020427 0.1371243 0.2551067
svyranktest(heures.tv ~ sport, dw)
#> 
#>  Design-based KruskalWallis test
#> 
#> data:  heures.tv ~ sport
#> t = -6.1587, df = 1993, p-value = 8.846e-10
#> alternative hypothesis: true difference in mean rank score is not equal to 0
#> sample estimates:
#> difference in mean rank score 
#>                   -0.09905157

# binary regression
reg <- svyglm(sport ~ sexe + etud + heures.tv, family = quasibinomial(), design = dw)
questionr::odds.ratio(reg)
#>                         OR    2.5 %  97.5 %         p    
#> (Intercept)        0.18564  0.11762  0.2930 6.902e-13 ***
#> sexeFemme          0.79913  0.61522  1.0380  0.093059 .  
#> etudSecondary      3.73818  2.32314  6.0151 6.261e-08 ***
#> etudProfessionnal  4.94649  3.18799  7.6750 1.398e-12 ***
#> etudHigh          10.16897  6.43759 16.0632 < 2.2e-16 ***
#> heures.tv          0.88979  0.81820  0.9677  0.006427 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::tidy(reg, exponentiate = TRUE, conf.int = TRUE)
#> # A tibble: 6 x 7
#>   term             estimate std.error statistic  p.value conf.low conf.high
#>   <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Intercept)         0.186    0.233      -7.23 6.90e-13    0.118     0.293
#> 2 sexeFemme           0.799    0.133      -1.68 9.31e- 2    0.615     1.04 
#> 3 etudSecondary       3.74     0.243       5.43 6.26e- 8    2.32      6.02 
#> 4 etudProfessionn~    4.95     0.224       7.13 1.40e-12    3.19      7.67 
#> 5 etudHigh           10.2      0.233       9.94 9.76e-23    6.44     16.1  
#> 6 heures.tv           0.890    0.0428     -2.73 6.43e- 3    0.818     0.968

from finalfit.

larmarange avatar larmarange commented on July 29, 2024

In terms of expected output, the idea would be to be able to do the following :

dep <- "sport"
expl <- c("sexe", "etud", "heures.tv")

library(finalfit)
dw %>% summary_factorlist(dep, expl, p = TRUE) 
dw %>% finalfit(dep, expl)

The table will be exactly the same as with a classic data.frame, except that all stats would have been computed using survey functions, taking into account sample weights and sample design.

Best regards

from finalfit.

larmarange avatar larmarange commented on July 29, 2024

Just for information, if you prefer working with dplyr syntax, there is a package called srvyr extening dplyr verbs to survey objects.

https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html

With this package you can do things like:

dw <- as_survey_design(dw) # to add tbl_svy class to dw
dw %>% group_by(sport) %>% summarise(mean = survey_mean(heures.tv, na.rm = TRUE))

Here, the result will be a tibble.

from finalfit.

larmarange avatar larmarange commented on July 29, 2024

Thanks a lot. That's great!!!!

Just a small comment: for digits argument, it would be nice to consider a 5th value correspond to the number of digits for the number of observations (for example, to be able to use 0 for the number of observations while keeping 1 for percentage).

Otherwise, everything seems to behave as expected.

Thanks again

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

from finalfit.

nobu-nsk avatar nobu-nsk commented on July 29, 2024

Thank you very much for this excellent development.
I also needed exactly this functionality.

I noticed that the currently Mean (SE) is used instead of (SD).
Definitely, more often we need SD for reporting together with means.
Unfortunately, svymean package does not provide SD but it can be derived easily from svyvar:

svyvar(~x,design)[1] %>% sqrt()
or
svyby(~x, ~dependent, design, svyvar, na.rm = TRUE)[2] %>% sqrt()

I would be great to have Mean (SD), which is also ensuring consistency with summary_factorlist().

Nobu Nishikiori

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Thanks both.
Just to confirm, there is not a library(survey) function that will give both the mean and the sd in the same function?
I have re-written the original summary_factorlist() to remove some old dependencies and incorporate requested features.
We'll get that and this out soon. Thanks again. Ewen

from finalfit.

tylcole avatar tylcole commented on July 29, 2024

from finalfit.

nobu-nsk avatar nobu-nsk commented on July 29, 2024

I believe so.

Just to confirm, there is not a library(survey) function that will give
both the mean and the sd in the same function?
Thank you very much for incorporating the feature.
Nobu

from finalfit.

e-ibrahim avatar e-ibrahim commented on July 29, 2024

Hi @ewenharrison

Please, I am a newbie to R and I am trying to adapt your "summary_survey(dw, dependent, explanatory, fit_id = TRUE)" solution to my work with below parameters:

surveydes=svydesign(id=~psu, strata=~strata, weights=~pweight, data=data)
dependent="knowledge"
explanatory=c("agegroup", "married", "children")

#knowledge (factor var: knowledgeable vs. not knowledgeable)
#agegroup (factor var: 15-24 vs. 25-34, 35-49)
#married (factor var: never married vs. ever married)
#children (integer: 0, 1, 2, ...,19

surveydes %>%
summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE,
digits = c(1, 1, 3, 0))

Each time I do run my code, it returns:
"Error in select(., explanatory) : unused argument (explanatory)"

I copied and pasted your "summary_survey" source codes in my R environment. But, I am certain there is something I have either misapplied or am not doing right.
Could you please advise?

Thank you,
Elhakim

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Hi, not sure why it is not working for you. Do you have a reproducible example?

from finalfit.

ABSANDIE avatar ABSANDIE commented on July 29, 2024

Thanks Joseph for the very useful example.

This probably gets us quite far. There are summary_factorlist() features not yet included, most notably, labels. But that would be an easy addition.

Would be tidied up and split into component functions prior to being deployed.

Interested in thoughts.

library(questionr)
data(hdv2003)

library(dplyr)
hdv2003$etud <- recode_factor(
  hdv2003$nivetud,
  "N'a jamais fait d'etudes" = "Primary",
  "A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
  "Derniere annee d'etudes primaires" = "Primary",
  "1er cycle" = "Secondary",
  "2eme cycle" = "Secondary",
  "Enseignement technique ou professionnel court" = "Professionnal",
  "Enseignement technique ou professionnel long" = "Professionnal",
  "Enseignement superieur y compris technique superieur" = "High"
)

library(survey)
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)

summary_survey <- function(.data, dependent, explanatory, cont = "mean", cont_range = FALSE, 
                           p = FALSE, column = FALSE, digits = c(1, 1, 3, 1), fit_id = FALSE){
  # Define variable type
  explanatory_type = .data$variables %>% 
    select(explanatory) %>% 
    purrr::map(is.numeric)
  
  # Hypothesis test
  if(p){
    p_tests = explanatory %>% 
      purrr::map2(explanatory_type,
                  ~if(!.y){
                    survey::svychisq(as.formula(paste0("~", ., "+", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "mean"){
                    survey::svyttest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "median"){
                    survey::svyranktest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  }
      )
  }
  
  # Output table
  explanatory %>% 
    purrr::map2(explanatory_type,
                ~ if(!.y){
                  survey::svytable(as.formula(paste0("~", .x, "+", dependent)), .data) %>% 
                    as.data.frame(stringsAsFactors = FALSE) %>% 
                    { if(column) {
                      dplyr::group_by(., !! sym(dependent)) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    } else { 
                      dplyr::group_by_at(., 1) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    }
                    } %>% 
                    dplyr::mutate(
                      value = paste0(Freq %>% round_tidy(digits[4]), " (", 
                                     prop %>% round_tidy(digits[4]), ")")
                    ) %>%
                    dplyr::select(-total, -prop, -Freq) %>% 
                    tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                    dplyr::mutate(
                      label = names(.)[1]
                    ) %>% 
                    dplyr::rename(levels = 1)
                } else {
                  { if(cont == "mean") {
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, svymean, na.rm = TRUE) %>%
                      dplyr::mutate(
                        value = paste0(!! sym(.x) %>% round_tidy(digits[1]), " (", 
                                       se %>%  round_tidy(digits[2]), ")") 
                      ) %>% 
                      dplyr::select(-c(.x, se)) %>% 
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Mean (SE)"
                      )
                  } else if(cont == "median"){
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, 
                                  svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) %>% 
                      dplyr::rename(Q1 = 2,
                                    Q2 = 3,
                                    Q3 = 4) %>% 
                      dplyr::mutate(
                        IQR = Q3 - Q1) %>% 
                      { if(cont_range){
                        dplyr::mutate(., 
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     Q1 %>%  round_tidy(digits[2]), " to ",
                                                     Q3 %>% round_tidy(digits[2]), ")")
                        )                        
                      } else {
                        dplyr::mutate(.,
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     IQR %>%  round_tidy(digits[2]), ")")
                        )
                      }} %>% 
                      dplyr::select(-c(2:8)) %>%                      
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Median (IQR)"
                      )
                  }
                  }
                }) %>% 
    
    # Add hypothesis test
    { if(p){
      purrr::map2_df(., p_tests,
                     ~ mutate(.x,
                              p = .y)
      )} else {
        dplyr::bind_rows(.)
      }} %>%
    dplyr::select(label, levels, dplyr::everything()) %>% 
    as.data.frame() %>%
    { if(fit_id){
      levels_id = .$levels
      drop = levels_id %in% c("Mean (SE)", "Median (IQR)")
      levels_id[drop] = ""
      mutate(., 
             fit_id = paste0(label, levels_id),
             index = 1:dim(.)[1])
    } else {
      .
    }} %>% 
    rm_duplicate_labels()
}


library(finalfit)
dependent = "sport"

# Choose different options to test
explanatory = c("sexe")
explanatory = c("heures.tv")
explanatory = c("sexe", "heures.tv", "etud")

dw %>% 
  summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, 
                 digits = c(1, 1, 3, 0))


ff_merge(
  summary_survey(dw, dependent, explanatory, fit_id = TRUE),
  svyglmmulti(dw, dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR (multivariable)"),
  last_merge = TRUE
)

Many thanks for "summary_survey", it is a very useful function. Please I would like to have the row/col total in the output table.

from finalfit.

ABSANDIE avatar ABSANDIE commented on July 29, 2024

Many thanks for "summary_survey", it is a very useful function. Please I would like to have the row/col total in the output table.

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Many thanks for writing.
Yes, this is not yet implemented.
I guess you would need an observed total as well as a weighted total?
Best wishes,
Ewen

from finalfit.

ABSANDIE avatar ABSANDIE commented on July 29, 2024

@ewenharrison thanks for the reply
Yes, it would be great to have both totals if possible.

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Do you have published examples of how these tables are normally presented?

from finalfit.

ABSANDIE avatar ABSANDIE commented on July 29, 2024
<style> </style>
label levels X1 X2 Total
Sexe Féminin 761.6 (77.1) 23242.2 (85.7) 24003.8(81.4)
  Masculin 226.6 (22.9) 3881.9 (14.3) 4108.5 (18.6)
  Total 988.2(100.0) 27124.1(100.0) 28112.3(100.0)

from finalfit.

ABSANDIE avatar ABSANDIE commented on July 29, 2024

@ewenharrison please I think the weighted total will be enough.
Since the observed total could be obtained from the summary_factorlist function

from finalfit.

ewenharrison avatar ewenharrison commented on July 29, 2024

Hi, thanks for emailing about this again.
Sorry if I misled, there is no plan for me to develop this function at the moment.
I guess there are a lot of people out there doing this sort of things - can't you find anything?

If you simply want the totals from the weighted analysis, could you not do this?

library(questionr)
library(tidyverse)
library(finalfit)
library(survey)
data(hdv2003)
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)
dependent = "sport"
explanatory = c("sexe")

dw %>% 
	summary_survey(dependent, explanatory) %>% 
	bind_cols(Total = svytotal(as.formula(paste0("~", explanatory)), dw))
label levels              Non              Oui   Total
1  sexe  Homme 2850209.9 (55.4) 2299172.5 (44.6) 5149382
2        Femme 3864550.4 (65.3) 2057293.5 (34.7) 5921844

from finalfit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.