larmarange / labelled Goto Github PK
View Code? Open in Web Editor NEWManipulating labelled vectors in R
Home Page: https://larmarange.github.io/labelled/
License: GNU General Public License v3.0
Manipulating labelled vectors in R
Home Page: https://larmarange.github.io/labelled/
License: GNU General Public License v3.0
Develop a cheatsheet for labelled
cf. https://www.rstudio.com/resources/cheatsheets/how-to-contribute-a-cheatsheet/
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
revdepcheck::revdep_reset()
revdepcheck::revdep_check(num_workers = 4)
revdep\email.yml
revdepcheck::revdep_email()
Submit to CRAN:
usethis::use_version()
cran-comments.md
devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
usethis::use_github_release()
CRAN-RELEASE
usethis::use_dev_version()
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
revdepcheck::revdep_reset()
revdepcheck::revdep_check(num_workers = 4)
Submit to CRAN:
usethis::use_version()
cran-comments.md
devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
usethis::use_github_release()
CRAN-RELEASE
usethis::use_dev_version()
Currently levels
, value_labels
, na_values
, and na_range
are converted to a string e.g.: https://github.com/larmarange/labelled/blob/master/R/lookfor.R#L96
The current functionality is useful for View
ing but less useful when the labels are needed for further processing (e.g. to display labels in a chart or graphic).
Could we add the option to use a machine readable format like json
, or to preserve the original vectors by storing them in a column of type <list>
?
jsonlite::toJSON
can be imported lazily using Suggests
in the Description file, or no additional dependencies are needed if a flag is added to preserve the original vectors.
I need to bind_rows()
of two tibbles that contain labelled data and list columns. dplyr is dropping the labels with a warning about "Vectorizing labelled data". To circumvent this I am trying to extract the lists of variable labels and value labels and re-applying them to the binded tibble. This does not work for variable labels and list columns unfortunately (see the reprex below).
What do you think about:
var_label()
actually does not need to check whether x
is atomic, doesn't it? The test could be dropped IMHO.library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(haven)
library(labelled)
d <- data_frame(
x = labelled(1:5, c(a=1, b=5)),
lc = as.list(1:5)
)
var_label(d$x) <- "This is x"
# Can't apply variable label to a list column
var_label(d$lc) <- "This is lc" # Why not actually?
#> Error: `x` should be atomic
# Extract value labels
vl <- val_labels(d)
# Bind rows and re-apply value labels
dd <- bind_rows(d, d, .id="copy")
#> Warning in bind_rows_(x, .id): Vectorizing 'labelled' elements may not
#> preserve their attributes
#> Warning in bind_rows_(x, .id): Vectorizing 'labelled' elements may not
#> preserve their attributes
val_labels(dd) <- vl
dd$x # OK!
#> <Labelled integer>
#> [1] 1 2 3 4 5 1 2 3 4 5
#>
#> Labels:
#> value label
#> 1 a
#> 5 b
# Can't extract variable labels
# because d$lc is not atomic
var_label(d)
#> Error: `x` should be atomic
# This can be done "manually" along the following lines
varlabs <- lapply(d, attr, "label")
var_label(dd) <- varlabs[1] # skip the list column
lapply(dd, attr, "label")
#> $copy
#> NULL
#>
#> $x
#> [1] "This is x"
#>
#> $lc
#> NULL
r <- data_frame( ch = structure(letters[1:2], some_attribute=TRUE) ) %>%
labelled::remove_attributes("some_attribute")
is.factor(r$ch)
I believe
labelled/R/remove_attributes.R
Line 31 in 650e920
stringsAsFactors=FALSE
.Hi,
I recently started to use labelled
more frequently and find it a bit difficult to switch between representation as labelled vector and factors
. Internally, many functions require factors
, but factors are not as flexible as labelled vectors. So it would be nice to define a new format
class for converting between the two. I.e., have something like
x <- labelled(c(1,2,2), labels = c(1 = "x", 2 = "y"))
fmt <- format(x)
x_fct <- as_factor(x)
xx <- as_labelled(x_fct, fmt)
where xx == x
holds. This would allow keeping the data in as labelled vectors with values as specified in the database and switching back and forth between factors and labelled vectors as needed. Are there any plans in this direction or would you accept a pull request?
Best,
Kevin
In that case, if the labelled vector is not converted to a factor, it will be converted to a character or a numeric vector, not kept as a labelled vector
See examples in val_label
. Have you been planning a feature, which was changed afterwards?
Prepare for release:
devtools::check_win_devel()
rhub::check_for_cran()
Perform release:
devtools::check_win_devel()
(again!)devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
Template from r-lib/usethis#338
The labelled
and sjlabelled
packages are especially useful when automating the production of tables, graphs, and other results from real data. However, I have a few ideas for in-house (possibly worthy of sharing with others) functions that are analogous to the labels, but instead specify whether the columns in a data set are dependent, mediator, or independent variables; and whether they are ordinal or nominal (if categorical/labelled). With some easy-to-use functions that keep these attributes when performing e.g. tidyverse-operations, it would be very easy to produce large amounts of graphs where some functions down the pipeline "understand" what should go on the x-axis, y-axis, caption, etc.
Sure, it is easy to add regular attributes with attr(df$my_var, "type_of_variable") <- "independent"
, but the ecosystem of functions in this/these packages seem convenient for the same purpose. Though, I am not sure whether there should be a fixed attribute that is called "vartype", or just some generic functions for the user to define one's own attributes.
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
revdepcheck::revdep_reset()
revdepcheck::revdep_check(num_workers = 4)
revdep\email.yml
revdepcheck::revdep_email()
Submit to CRAN:
usethis::use_version()
cran-comments.md
devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
usethis::use_github_release()
CRAN-RELEASE
usethis::use_dev_version()
If I set a variable label with set_variable_labels and later apply dplyr::filter, the variable label is removed. Here's a small example:
library(dplyr)
library(labelled)
df <- tibble(id = 1:2, can = factor(c('yes', 'no'))) %>%
set_variable_labels(can = 'Cannabis use')
#variable label is there
df$can
#variable label is not there
filter(df, id == 1)$can
I'm not sure if this is a bug of dplyr or of the labelled package. It seems to have been introduced with dplyr version 0.8
Following tidyverse/haven#185, some functions for variable and value labels that could be used with %>%
operator.
Sorry to bother, may I point you to an unanswered question at stackoverflow. I am not sure whether I make a mistake or there is an unwanted behavior in the package, but haven_labelled_spss
data seems to less the label
attribute after using any form of na_values
. Many thanks!
I'd like to include tagged missings (NAs) in the Date variable. But when I do the following
x <- rep(c(1,2),5)
x[[4]] <- tagged_na('a')
y <- as.Date(x, origin = '1992-01-01')
class(y)
#[1] "Date"
labelled(y, c("NA"=tagged_na('a')))
#Error: 'y' must be a numeric or a character vector
Is this behavior by design, or do I write a valid feature request? :-)
Prepare for release:
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Perform release:
devtools::check_win_devel()
(again!)cran-comments.md
devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
CRAN-RELEASE
Template from r-lib/usethis#338
Hey @juba and @larmarange,
I have been working with labelled survey data lately. Every time I do so, I find Stata far superior to R when it comes to doing some of the most basic things that we need to do when exploring that kind of data…
Take variable labels, which are essential to get a grip of any new survey dataset. How is the user supposed to list them all? Variable labels being in the attributes
, users might want to do this:
for each column:
list variable name and label
Unless I am mistaken, this is not easily doable. The user might try this, which won't work:
attr(labelled_data, "label")
apply(labelled_data, 2, attr, "label")
What the user actually needs is:
vapply(labelled_data, attr, character(1), "label", exact = TRUE) # or...
sapply(labelled_data, attr, "label") # ... but non-strict and risky: might return partial matches
So, to get all variable labels in a easily-searchable format like a data frame, the user needs, at the very least (and these examples do not even preclude partial matching):
data.frame(vapply(labelled_data, attr, character(1), "label"))
tibble::enframe(vapply(labelled_data, attr, character(1), "label"))
In all cases above, the user needs to be fairly familiar with R to get the labels. Furthermore, a single missing variable label will kill the function with a cryptic message:
Error in vapply(labelled_data, attr, character(1), "label") :
values must be length 1,
but FUN(X[[4]]) result is length 0
Here, [[4]]
is the column (variable) where the variable no variable label (NULL
).
I wrote a short function to list and search variable labels.
It is named var_labels
in the spirit of the labelled
package by Joseph, from which I took some code to write the show_values
argument, and it is similar to the lookfor
function that I wrote for questionr
many years ago (thanks for improving it, Julien!):
#' @param data a labelled data frame
#' @param show_values add a column showing labelled values
#' @param ... character string(s) to match in the variable names or labels
#' @param ignore.case whether to ignore case when matching variable names or labels
#' @return a tibble
var_labels <- function(data, show_values = FALSE, ..., ignore.case = TRUE) {
require(magrittr) # can easily be removed if need be
require(tibble) # preferrable in my view to returning a data.frame
# variable labels -> tibble
vars <- names(data)
lbls <- tibble::tibble(variable = vars) %>%
tibble::add_column(
label = vapply(vars, function(x) {
# similar to labelled:::var_label.default
x <- attr(data[[ x ]], "label", exact = TRUE)
# handle missing variable labels
ifelse(is.null(x), NA_character_, x)
}, character(1))
)
# add labelled values
if (show_values) {
# similar to labelled:::val_labels.haven_labelled
lbls <- tibble::add_column(
lbls,
values = vapply(vars, function(x) {
x <- attr(data[[ x ]], "labels", exact = TRUE)
# handle missing no value labels
if (is.null(x)) {
NA_character_
} else {
x <- paste0("[", x, "] ", names(x))
paste(x, collapse = " ")
}
}, character(1))
)
}
# subset to matching rows (a more complex option would be to use `tidyselect`)
find <- c(...)
if (length(find)) {
find <- paste(find, collapse = "|")
find <- grepl(find, lbls$variable, ignore.case = ignore.case) |
grepl(find, lbls$label, ignore.case = ignore.case)
lbls[ find, ]
} else {
lbls
}
}
(The vapply
part cannot be written more efficiently due to the possibility of missing values. Using purrr::attr_getter
does not solve the issue, as attr_getter
simply wraps around attr
.)
Example, using some labelled data included in questionr
:
library(questionr)
data(fertility)
women$unlabelled_test_variable <- 1L
var_labels(women)
var_labels(women, show_values = TRUE)
var_labels(women, "weight", "child") # Stata equivalent: lookfor weight child
var_labels(women, "hiv", show_values = TRUE)
Now, I do not know where to submit that function: are any of you interested in including it in questionr
or labelled
?
I also submitted a simpler function to haven
, and opened another issue to discuss its search support.
I was using remove_val_label
to remove labels of some data saved a months ago under labelled class, but since val_label.labelled
method was deleted, it does not work. I do not know if it would be a good option to include this method again, or there would be another way to remove labels.
When converting a data.frame, a strict argument (checking if all values in the vector have a label) could be relevant to convert only those factors.
Prepare for release:
haven
2.0.0 released on CRAN (required for the different tests)devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
email.yml
then revdepcheck::revdep_email_maintainers()
Perform release:
devtools::check_win_devel()
(again!)devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
Template from r-lib/usethis#338
When applied to a data.frame, non labelled variable are droped!
@elinw suggested here ropensci/skimr#296 that it might be beneficial for different labelled classes to exist, for the different underlying types. This seems to make a lot of sense to me, because this would make it easier (possible?) to write appropriate summary, skim, print methods etc.
Do you agree or is it actually possible now too?
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
revdepcheck::revdep_check(num_workers = 4)
revdep\email.yml
revdepcheck::revdep_email()
Submit to CRAN:
usethis::use_version()
cran-comments.md
devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
usethis::use_github_release()
CRAN-RELEASE
usethis::use_dev_version()
Below d
is a tibble read from an SPSS file with haven::read_spss()
. I am getting:
print(d)
Error in `levels<-`(`*tmp*`, value = as.character(levels)) :
factor level [9] is duplicated
> traceback()
19: factor(x, labs, ordered = ordered)
18: as_factor.haven_labelled(x, "labels")
17: as_factor(x, "labels")
16: lbl_pillar_info(x)
15: pillar_shaft.haven_labelled(X[[i]], ...)
14: FUN(X[[i]], ...)
13: lapply(.x, .f, ...)
12: map(x[pillar_shown], pillar_shaft)
11: colonnade_get_width(x, width, rowid_width)
10: pillar::squeeze(x$mcf, width = width)
9: format.trunc_mat(mat)
8: format(mat)
7: format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
6: format(x, ..., n = n, width = width, n_extra = n_extra)
5: paste0(..., "\n")
4: cat(paste0(..., "\n"), sep = "")
3: cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
2: print.tbl(x)
1: (function (x, ...)
UseMethod("print"))(x)
I suppose the print method makes a factor out of labelled variable for printing assuming that value labels are unique. There is no such restriction in, say, SPSS. Sometimes ppl take advantage of it.
It would be very useful to have a function that automatically sets all data.frame labels as transformed versions of the column names. Similar to the janitor package's clean_names()
function that creates and sets snakecase column names, I would like to be able to set all column labels to a readable version of the column names from within a pipe. (Usually I am transforming from snakecase to title case and replacing "_" with spaces).
I could see two approaches to this:
The more straight forward but less flexible approach would be to allow the user a limited set of pre-defined transformation options (e.g. title case, all caps, replace "_" with " ").
Allow a user to use any function to transform. I'm not sure the best way to do this, but perhaps it could employ some of the tools underlying rename_all()
in dplyr: (https://github.com/tidyverse/dplyr/blob/master/R/funs.R)
Seems to be caused by this line checking for the old class name.
Line 144 in 358b2d9
Maybe it should be a call to is.labelled()
instead.
Here's a minimal example illustrating the unexpected behaviour:
> library(labelled)
[...]
> df <- data.frame(x=labelled(1:3, labels=c(a=1, b=2, c=3)))
> str(df)
'data.frame': 3 obs. of 1 variable:
$ x: 'haven_labelled' int 1 2 3
..- attr(*, "labels")= Named num 1 2 3
.. ..- attr(*, "names")= chr "a" "b" "c"
> to_factor(df) # Unexpected: makes no change to labelled column `x`
x
1 1
2 2
3 3
> to_factor(df$x) # Expected: changes levels to factor
[1] a b c
Levels: a b c
> as.data.frame(lapply(df, to_factor)) # Expected behaviour of calling to_factor() on a data.frame
x
1 a
2 b
3 c
The following snippet shows that changing the line to call is.labelled()
instead fixes the behaviour.
> # Patch suspect function with call to `is.labelled()` instead
> utils::assignInNamespace(
+ '.to_factor_col_data_frame',
+ function (x, levels = c("labels", "values", "prefixed"), ordered = FALSE,
+ nolabel_to_na = FALSE, sort_levels = c("auto", "none", "labels",
+ "values"), decreasing = FALSE, labelled_only = TRUE,
+ drop_unused_labels = FALSE, strict = FALSE, ...)
+ {
+ if (is.labelled(x)) # <-- Change is here
+ x <- to_factor(x, levels = levels, ordered = ordered,
+ nolabel_to_na = nolabel_to_na, sort_levels = sort_levels,
+ decreasing = decreasing, drop_unused_labels = drop_unused_labels,
+ strict = strict, ...)
+ else if (!labelled_only)
+ x <- to_factor(x)
+ x
+ },
+ 'labelled'
+ )
> to_factor(df) # Now follows expected behaviour
x
1 a
2 b
3 c
Tested with package labelled
version 2.0.1
The if check only looks for "labelled" class and misses "haven_labelled".
Corrected in this fork: NoahMarconi@4001267
I also needed a JSON format. If you're open to a commit like that I can edit to make the JSON optional (e.g. prefixed or JSON) and submit a pull request.
I want to create variable foo
but x
is created (with value labels):
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data.frame(
foo = labelled::labelled(1:5, c(a=1, b=2))
) %>%
str()
#> 'data.frame': 5 obs. of 1 variable:
#> $ x:Class 'labelled' atomic [1:5] 1 2 3 4 5
#> .. ..- attr(*, "labels")= Named num [1:2] 1 2
#> .. .. ..- attr(*, "names")= chr [1:2] "a" "b"
But for tibble
s it is OK:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tibble(
foo = labelled::labelled(1:5, c(a=1, b=2))
) %>%
str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 1 variable:
#> $ foo:Class 'labelled' atomic [1:5] 1 2 3 4 5
#> .. ..- attr(*, "labels")= Named num [1:2] 1 2
#> .. .. ..- attr(*, "names")= chr [1:2] "a" "b"
I'm not yet sure why it happens.
Check tidyverse/haven#108
Hey Joseph,
I think the first and family names are inverted here:
Lines 10 to 11 in 5ae5354
Don't ask me why I'm noticing that now and here!
Hope you're good :)
e.g.
as_factor(1:4, "prefixed")
[1]
Levels: prefixed
This is a little annoying, especially because most haven functions do not complain. Maybe you could check the type in val_labels
and cast it correctly if possible or throw an error if not?
x = 1L:5L
labelled::val_labels(x) <- c("low" = 1)
haven::na_tag(x)
Error:
x
must be a double vector
Prepare for release:
devtools::check()
devtools::check_win_devel()
rhub::check_for_cran()
rhub::check(platform = 'ubuntu-rchk')
rhub::check_with_sanitizers()
revdepcheck::revdep_reset()
revdepcheck::revdep_check(num_workers = 4)
revdep\email.yml
revdepcheck::revdep_email()
Submit to CRAN:
usethis::use_version()
cran-comments.md
devtools::submit_cran()
pkgdown::build_site()
Wait for CRAN...
usethis::use_github_release()
CRAN-RELEASE
usethis::use_dev_version()
Hi!
If i provide a named list of values to set_variable_labels(), it does not work because it converts the list into a list. This makes it impossible to follow the efficient workflow of...
The problem is the first line in set_variable_labels: values <- list(...)
To convert labelled data to character
Function sort_val_labels to sort labels according to value or according to label.
Would it be interesting to have a function (or an option) to trim out the "format.*" (e.g. format.stata, etc...) attributes of the variables?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.