ipums / ipumsr Goto Github PK

View Code? Open in Web Editor NEW

16.0 3.0 4.0 175.78 MB

Request, download, and read IPUMS data in R

Home Page: https://tech.popdata.org/ipumsr/

License: Mozilla Public License 2.0

R 99.61% CSS 0.39%

ipumsr's Introduction

ipumsr

ipumsr provides an R interface for handling IPUMS data, allowing users to:

Easily read files downloaded from the IPUMS extract system
Request data, download files, and get metadata from certain IPUMS collections
Interpret and process data using the contextual information that is included with many IPUMS files

Installation

To install the package from CRAN, use

install.packages("ipumsr")

To install the development version of the package, use

remotes::install_github("ipums/ipumsr")

What is IPUMS?

IPUMS is the world’s largest publicly available population database, providing census and survey data from around the world integrated across time and space. IPUMS integration and documentation make it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community context. Data and services are available free of charge.

IPUMS consists of multiple projects, or collections, that provide different data products.

Microdata projects distribute data for individual survey units, like people or households.
Aggregate data projects distribute summary tables of aggregate statistics for particular geographic units along with corresponding GIS mapping files.

ipumsr supports different levels of functionality for each IPUMS project, as summarized in the table below.

	Data Type	Description
IPUMS USA	Microdata	U.S. Census and American Community Survey microdata (1850-present)
IPUMS CPS	Microdata	Current Population Survey microdata including basic monthly surveys and supplements (1962-present)
IPUMS International	Microdata	Census microdata covering over 100 countries, contemporary and historical
IPUMS NHGIS	Aggregate Data	Tabular U.S. Census data and GIS mapping files (1790-present)
IPUMS IHGIS	Aggregate Data	Tabular and GIS data from population, housing, and agricultural censuses around the world
IPUMS Time Use	Microdata	Time use microdata from the U.S. (1930-present) and thirteen other countries (1965-present)
IPUMS Health Surveys	Microdata	Microdata from the U.S. National Health Interview Survey (NHIS) (1963-present) and Medical Expenditure Panel Survey (MEPS) (1996-present)
IPUMS Global Health	Microdata	Health survey microdata for low- and middle-income countries, including harmonized data collections for Demographic and Health Surveys (DHS) and Performance Monitoring for Action (PMA) surveys
IPUMS Higher Ed	Microdata	Survey microdata on the science and engineering workforce in the U.S. from 1993 to 2013

ipumsr uses the IPUMS API to submit data requests, download data extracts, and get metadata, so the scope of functionality generally corresponds to that available via the API. As the IPUMS team extends the API to support more functionality for more projects, we aim to extend ipumsr capabilities accordingly.

Getting started

If you’re new to IPUMS data, learn more about what’s available through the IPUMS Projects Overview. Then, see vignette("ipums") for an overview of how to obtain IPUMS data.

The package vignettes are the best place to explore what ipumsr has to offer:

To read IPUMS data extracts into R, see vignette("ipums-read").
To interact with the IPUMS extract and metadata system via the IPUMS API, see vignette("ipums-api").
For additional details about microdata and NHGIS extract requests, see vignette("ipums-api-micro") and vignette("ipums-api-nhgis").
To work with labelled values in IPUMS data, see vignette("value-labels").
For techniques for working with large data extracts, see vignette("ipums-bigdata").

The IPUMS support website also houses many project-specific R-based training exercises. However, note that some of these exercises may not be be up to date with ipumsr’s current functionality.

Related work

The survey and srvyr packages can help you incorporate IPUMS survey weights into your analysis for various survey designs.
See haven for more information about value labels and labelled vectors
hipread underlies the hierarchical file reading functions in ipumsr

Getting help + contributing

We greatly appreciate feedback and development contributions. Please submit any bug reports, pull requests, or other suggestions on GitHub. Before contributing, please be sure to read the Contributing Guidelines and the Code of Conduct.

If you have general questions or concerns about IPUMS data, check out our user forum or send an email to [email protected].

ipumsr's People

Contributors

Stargazers

Watchers

Forkers

jiakaichen ronaldsanchez87 avande38 ar-puuk

ipumsr's Issues

Improve lower_vars workflow (created May 1, 2020 by @dtburk on mnpopcenter/ipumsr)

May 1, 2020 @dtburk:

In response to mnpopcenter/ipumsr#56, we added a warning message when the lower_vars argument to any of the read_ipums_* functions is ignored. As described in the discussion of that issue, the reason the argument is sometimes ignored is to make sure the case of the variable names stays in sync between the data and the ipums_ddi object associated with the data. Keeping these in sync is helpful if the user wants to use a function like set_ipums_var_attributes() that attaches metadata from the ipums_ddi to variables in a data.frame. However, by making these metadata-attaching functions a little smarter, we can probably allow the case of variable names to get out of sync between ipums_ddi and data.frame, while still allowing users to attach metadata if they want to. Once we make those fixes, we can allow users to convert variable names to lowercase when they read in the data, even if they have already read in the DDI.

Release ipumsr 0.6.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

error message typo in ipums_view

If htmltools, shiny, or DT are not present when trying to call ipums_view(ddi), the user is prompted with the following error:

|Error in ipums_view(ddi) :
| Please install htmltools, shiny, and DT using
| `install.packages(c('htmltools', 'shiny', 'DT')

The closing ) as well as single quote are missing from the end of this message, which could confuse some users.

Please remove dependencies on rgdal, rgeos, and/or maptools

This package depends on (depends, imports or suggests) raster and one or more of the retiring packages rgdal, rgeos or maptools (https://r-spatial.org/r/2022/04/12/evolution.html). Since raster 3.6.3, all use of external FOSS library functionality has been transferred to terra, making the retiring packages very likely redundant. It would help greatly if you could remove dependencies on the retiring packages as soon as possible.

Table not available through ipumsr?

Hi there, I am trying to submit an NHGIS data extract through ipumsr and I'm unable to locate a table that I know is available through the website. The table is B19001H and I need it for both 2005-2009 ACS and 2014-2018 ACS. If helpful, the titles are:

For 2005-2009 ACS: Household Income in the Past 12 Months (in 2009 Inflation-Adjusted Dollars) (White Alone, Not Hispanic or Latino Householder.
For 2014-2018 ACS: Household Income in the Past 12 Months (in 2018 Inflation-Adjusted Dollars) (White Alone, Not Hispanic or Latino Householder.

Is it possible to make this available through ipumsr, or should I manually download this extract? Thank you!

Get rid of message related to reading in a subset of variables

From ipumsr created by dtburk: mnpopcenter/ipumsr#72

We shouldn't see this message when we specify a subset of variables with the vars argument to read_ipums_micro():

Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(vars_of_interest)` instead of `vars_of_interest` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

I got this message with ipumsr version 0.4.5 and tidyselect version 1.1.0.

`download_extract()` fails if provided an `ipums_extract` that has finished on server but does not have links in R

If attempting to download an extract by providing an ipums_extract object that was not yet completed at the time it was generated, download_extract() gets, but does not successfully use, the updated status of this extract provided by get_extract_info(). An expired extract error is thrown.

Should be able to be addressed by updating the is_ready variable after getting updated info:

if (!is_ready) {
  extract <- get_extract_info(extract, api_key = api_key)
}

should be changed to

if (!is_ready) {
  extract <- get_extract_info(extract, api_key = api_key)
  is_ready <- extract_is_completed_and_has_links(extract)
}

optional parameter in read_ipums_micro to choose how haven-labelled variables are handled (created Apr 19, 2021 by @schmert on mnpopcenter/ipumsr)

Apr 19, 2021 @schmert:

Haven-labelled variables are unfamiliar to many R users. The {ipumsr} documentation even includes instructions that suggest that R users will almost always want to alter the haven-labelled variables output by read_ipums_micro before doing any real work -- with zap_values, to_character, etc.

Would it be possible to add a parameter to read_ipums_micro that allows the user to choose how labelled variables are output in the first place? For example,
output_labelled_as = c("haven", "value", "label", "factor")
with the default being the current "haven"?

This could save R users a ton of headaches. Thanks.

understanding parsing of DDI file using regular expression

Hello. I was working on parsing a DDI file and was looking at the IPUMSR source code. One thing I found a bit confusing was a portion of the ddi_read.R file, which seems to parse the <CodInstr> section of the variable node.

Most of the time, the categorical information is contained within the <catgry> tag, however I noticed this section of the code that uses a regular expression to parse that portion of the CodInstr tag. The code is below. My question is, why is it necessary to parse the CodInstr section of the DDI file, and whether this is a common thing. The regular expression is very specific, so I am not sure that it would generalize very well. Is this specific function used only for the specific "total personal income" INCTOT variable, or are there other variables that also have categorical information in the CodInstr tag.

The code from IPUMSR is found in the specified file ddi_read.R starting at line 907.

parse_code_regex <- function(x, vtype) {
  if (vtype %in% c("numeric", "integer")) {
    labels <- fostr_named_capture(
      x,
      "^(?<val>-?[0-9.,]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(?<lbl>.+?)$",
      only_matches = TRUE
    )

    labels$val <- as.numeric(fostr_replace_all(labels$val, ",", ""))
  } else {
    labels <- fostr_named_capture(
      x,
      "^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$",
      only_matches = TRUE
    )
  }

  labels
}

Consider swapping order of package URLs in the DESCRIPTION file

Currently ipums.org is the first-listed URL in the DESCRIPTION file, which means that links to the package generated by tools such as downlit will go to that URL. It might make more sense to list the GitHub URL, or tech.popdata.org/ipumsr, first in the DESCRIPTION file for this reason.

pkgdown site links `as_factor()` to `forcats::as_factor()` instead of `haven::as_factor()`

The value-labels vignette refers to as_factor(), and pkgdown attempts to automatically link this to the appropriate function documentation, but in this case, it links to forcats::as_factor() instead of haven::as_factor(). If the vignette was just referring to a function from an external package, we could just use haven::as_factor() explicitly. However, ipumsr re-exports haven::as_factor() so that ipumsr users don't have to load haven to use it, so it wouldn't be ideal if we had to use haven::as_factor() in the vignette just to get the pkgdown link to work properly.

One partial solution would be to replace references to as_factor() with

[`as_factor()`](https://haven.tidyverse.org/reference/as_factor.html)

in the text of the vignette, but then those links would look different from the links auto-generated by pkgdown, and we would have to manually update the url if haven ever moved its documentation site. Moreover, that approach wouldn't work for code references to as_factor().

It's possible that we should create an issue on pkgdown or downlit requesting a new feature that allows pkgdown users to manually specify which package a function is from for function names that appear in multiple packages, or alternatively, an update that checks for function name matches in re-exported functions before looking more widely.

Update project info and UI for `ipums_website()`

ipums_website() has several issues that should be addressed. Currently, the list of supported projects is out of date and the UI is somewhat inconsistent. While this function likely does not get substantial use, it may remain useful given the current absence of a metadata API for microdata projects. We need to:

Update project names that are out of date (including hyphens)
Add recent IPUMS projects and remove retired ones
Allow use of API codes to specify projects for consistency with other functions in package
Allow function to work on OS other than Windows
Don't require var argument, since some projects that do not have variable-specific websites are supported
Streamline S3 dispatch, as a different argument is required if specifying project name manually (as opposed to with an ipums_ddi object)
Deprecate superfluous arguments and update defaults where confusing

ipums_view not correctly displaying table in RStudio viewer

ipums_view does not show value labels in the RStudio viewer. Instead, it will write something like "Showing 1 to 6 of 6 entries" without showing any of the entries on the page. Opening the page in a new browser widow solves the issue.

Return all when case_selection_type is "detailed" and no selections specified

For example, the detailed race could be of value for looking at multiple groups and having a dataset that can be filtered through versus several subset pulls or a hodgepodge set that may not address questions without multiple iterations.
Is there anyway to improve usability to allow for detailed to be selected and all be returned.
This can be manually built but that is considerable tedium.

Current default behavior:
var_spec("RACE",
case_selection_type= "detailed", case_selections =c('must include exactly'))

Revised default behavior:
var_spec("RACE",
case_selection_type= "detailed", case_selections ="all, unless you list specific codes"))

`add_to_extract()` silently swallows unused arguments

add_to_extract() allows arbitrary argument names for cross-product compatibility, but no check is done to warn users if they include arguments that are not relevant for the particular extract type they are working with. This produces confusing behavior. For instance:

extract <- define_extract_usa(
  samples = "us2017b",
  variables = "YEAR",
  description = "Test extract"
)

# Returns extract with no modifications or warnings since there is no "vars" field in a usa_extract
add_to_extract(
  extract,
  vars = "New Variable"
)

We do warn users for remove_from_extract(), so this just requires an extrapolation of that check to add_to_extract().

Is there a reference mapping the sample name code to the sample dataset name?

Is there a reference mapping the sample name code to the sample dataset name? I don't see it in the vignette. Thank you!

Possible to add support for dplyr select helper verbs for `define_extract_cps`?

Hi,

When I am defining an API extract with many samples (i.e. n >= 20), would it be possible to add support for dplyr select helper verbs? Something like this -

define_extract_cps(
  samples = select(starts_with("2022"))
)

Add geomarker support (created Oct 5, 2018 by @gergness on mnpopcenter/ipumsr)

Oct 5, 2018 @gergness:

Seems like the codebooks are in the same format as Terra, so should be straightforward

Enhancement: IPUMS data download (created Jan 15, 2021 by @edarin on mnpopcenter/ipumsr)

Jan 15, 2021 @edarin:

Thanks a lot for developing this useful package.

Would it be possible to create a function for downloading data directly in R from the website?

Thanks!

Cannot use `api_key` argument in `submit_extract`

The API request in submit_extract is missing the api_key argument, so users can only submit an extract if their API key is in their .Renviron. Attempting to submit an extract with the API key specified explicitly in the api_key argument fails.

ipumsr and tidyselect 1.2.1

The next version of tidyselect that I'm about to release will cause CRAN failures for ipumsr because its tests are checking for exact matches of error messages generated in tidyselect and these have now changed. Since error message contents aren't part of the tidyselect API, could you please use testthat snapshots instead?

survey weights (created May 14, 2019 by @gergness on mnpopcenter/ipumsr)

May 14, 2019 @gergness:

When I was first writing ipumsr I did some work translating the stata code on static pages of ipums.org to explain how to use survey weight variables. It's always been on my todo list to help projects update, but I never did get around to it.

Yesterday, two IPUMS users on twitter were talking about this:
https://twitter.com/surlyurbanist/status/1127968834902605825

To make sure it doesn't get lost, here's the translation of CPS, USA & NHIS user notes on weights for R.

CPS - Replicate Weights

Adapted from https://cps.ipums.org/cps/repwt.shtml

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.

The sample should be treated as a single stratum (the weights contain the relevant information from the sample design), so no PSU should be specified.
The full-sample weight must be specified.
You then specify the replicate weights in the repweights argument. Note that IPUMS-CPS data contain a variable called REPWTP, which merely indicates the presence of replicate weights and is coded 1 for every case. Therefore, make sure to use a regular expression like "REPWTP[0-9]+" to make sure you don't include REPWTP.
The fpc argument should not be specified.
The type argument should be set to "Jkn" and rho to 0.5
The mse argument should be set to TRUE

R (survey package)

# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~WTSUPP, repweights = "REPWTP[0-9]+", type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)

R (srvyr package)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = WTSUPP, repweights = matches("REPWTP[0-9]+"), type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)

After setting up the svy object, we can now use it to perform weighted calcuations. For example, to
calculate the mean of a variable named VAR1:

R (survey package)

svymean(~VAR1, svy)

R (srvyr package)

svy %>% 
  summarize(mn = survey_mean(VAR1))

And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:

R (survey package)

svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)

R (srvyr package)

svy %>% 
  filter(AGE >= 25 & AGE < 65) %>%
  summarize(mn = survey_mean(VAR1))

USA - Replicate weights

Adapted from: https://usa.ipums.org/usa/repwt.shtml

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.

The sample should be treated as a single stratum (the weights contain the relevant information from the sample design), so no PSU should be specified.
The full-sample weight must be specified.
You then specify the replicate weights in the repweights argument. Note that IPUMS-USA data contain a variable called REPWTP, which merely indicates the presence of replicate weights and is coded 1 for every case. Therefore, make sure to use a regular expression like "REPWTP[0-9]+" to make sure you don't include REPWTP.
The fpc argument should not be specified.
The type argument should be set to "Fay" and rho to 0.5
The mse argument should be set to TRUE

R (survey package)

# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~PERWT, repweights = "REPWTP[0-9]+", type = "Fay", rho = 0.5, mse = TRUE)

R (srvyr package)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = PERWT, repweights = matches("REPWTP[0-9]+"), type = "Fay", rho = 0.5, mse = TRUE)

After setting up the svy object, we can now use it to perform weighted calcuations. For example, to
calculate the mean of a variable named VAR1:

R (survey package)

svymean(~VAR1, svy)

R (srvyr package)

svy %>% 
  summarize(mn = survey_mean(VAR1))

And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:

R (survey package)

svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)

R (srvyr package)

svy %>% 
  filter(AGE >= 25 & AGE < 65) %>%
  summarize(mn = survey_mean(VAR1))

IPUMS NHIS

Adapted from https://nhis.ipums.org/nhis/userNotes_variance.shtml

General Syntax to Account for Sample Design

The following general syntax will allow users to account for sampling weights and design variables when using STATA, SAS, SAS-callable SUDAAN, or R (through the survey or srvyr package) to estimate, for example, means using IPUMS NHIS data.

...

R (survey)

# If not installed already: install.packages("survey")
library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)

svymean(~VAR1, svy)

R (srvyr)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)

svy %>% 
  summarize(mn = survey_mean(VAR1))

Subsetting IPUMS NHIS Data

...

R (survey)

library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)

svy_subset <- subset(svy, AGE >= 65)
svymean(~VAR1, svy_subset)

R (srvyr)

library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)

svy %>% 
  filter(AGE >= 65) %>%
  summarize(mn = survey_mean(VAR1))

Error in read_ipums_ddi

I am using the function read_ipums_ddi to import the ATUS.
It used to work fine in the past.

I get the following error

Error in read_xml.character(ddi_file_load, data_layer = NULL) :    Opening and ending tag mismatch: meta line 12 and head [76]

ipums / ipumsr Goto Github PK

ipumsr's Introduction

ipumsr

Installation

What is IPUMS?

Getting started

Related work

Getting help + contributing

ipumsr's People

Contributors

Stargazers

Watchers

Forkers

ipumsr's Issues

CPS - Replicate Weights

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

R (survey package)

R (srvyr package)

R (survey package)

R (srvyr package)

R (survey package)

R (srvyr package)

USA - Replicate weights

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

R (survey package)

R (srvyr package)

R (survey package)

R (srvyr package)

R (survey package)

R (srvyr package)

IPUMS NHIS

General Syntax to Account for Sample Design

R (survey)

R (srvyr)

Subsetting IPUMS NHIS Data

R (survey)

R (srvyr)

Recommend Projects

Recommend Topics

Recommend Org