Git Product home page Git Product logo

autumn's Introduction

autumn: Fast, Modern, and Tidy Raking

“And as to me, I know nothing else but miracles” - Walt Whitman, probably talking about this package.

Travis-CI build status Coverage Status

Iterative proportional fitting (raking) is a straightforward and fast way to generate weights which ensure a dataset reflects known target marginal distributions: put simply, survey professionals use raking to ensure that samples represent the population they are drawn from.

Existing R implementations of raking are frustrating to use, have antiquated syntax, require external dependencies or compilation, have inadequate documentation, generate difficult to understand errors, run slowly, and don’t support “tidy” workflows. autumn is a modern package built from the ground up to fix these problems.

Installation

autumn will be submitted to CRAN in January 2020 June 2020. In the meantime, you can install it using the following command:

# Install GitHub version:
devtools::install_github("aaronrudkin/autumn")

Usage

The workhorse function of autumn is harvest(), which takes at minimum two arguments: 1) a data.frame (or tibble) containing data; 2) target proportions. At its simplest, a call to harvest() works as follows:

# Standard R function call
harvest(respondent_data, ns_target)

# Using `magrittr`'s pipe operator
respondent_data %>% harvest(ns_target)

It just works! This function call will iteratively weight observations to match the target proportions and add a column weights to the data frame (it is also possible to rename the column or return the weights as a vector). Default parameters are helpful and sane: weights are guaranteed mean 1 and maximum 5.

Specifying a Target

The main challenge when running harvest() is to correctly specify target proportions. Two formats are supported: 1) a list of named vectors; 2) a data.frame or tibble.

When supplying targets as a list of named vectors, it looks like this:

list(
  gender = c(Male = 0.4829, Female = 0.5171), 
  region = c(Midwest = 0.2086, 
             Northeast = 0.1764, 
             South = 0.3775, 
             West = 0.2374)
)

Each list element should match the name of a single variable in the data, and each vector name should match a value the variable can take. The numeric values should be positive and sum to 1 within each variable.

When supplying data as a data.frame or tibble, the data.frame should have three columns (by default harvest() looks for columns named “variable”, “level”, and “proportion” – although these names can be overridden):

target_tbl
#> # A tibble: 6 x 3
#>   variable level     proportion
#>   <chr>    <chr>          <dbl>
#> 1 gender   Male           0.483
#> 2 gender   Female         0.517
#> 3 region   Midwest        0.209
#> 4 region   Northeast      0.176
#> 5 region   South          0.378
#> 6 region   West           0.237

Advanced Usage

autumn supports a variety of advanced features including:

  • Supplying starting weights
  • Adjusting maximum weights
  • Adjusting convergence and iteration criteria
  • Adjusting variable selection and error calculation criteria
  • Handling missing data appropriately
  • Calculating design effects for produced weights
  • Summarizing raking results

Interested in doing something fancy? Check out our R vignettes for more details: TODO VIGNETTES GO HERE

Speed 🚀

How fast is autumn? Fast.

Below, we present results of three different benchmark scenarios, each using real data (the first two benchmarks use the respondent_data and ns_target datasets included with autumn). All of these benchmarks use identical data and default parameterizations, and were run on a low power 2016-vintage personal computer. The larger the the dataset and the more complicated the rake, the more you benefit from using autumn. Customizing convergence criteria to allow for earlier termination can result in further speed improvements over existing software.

Note:

Small scale

This benchmark generates weights for a dataset of 6,691 observations, raking on 10 variables. Compared with the implementation in anesrake, autumn is about 67% faster and allocates one third less memory. Compared with the implementation in survey, autumn is about 4X as fast and allocates 20% more memory.

#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 autumn        1.35s    1.73s     0.572  738.35MB    3.01 
#> 2 anesrake      2.46s    2.89s     0.315    1.11GB    2.42 
#> 3 survey        4.57s    6.76s     0.148  614.84MB    0.866

Medium scale

Consider a raking task that is more difficult to converge: the same dataset (6,691 observations) raked on 17 variables. The extra variables involve interactions which greatly complicate convergence. autumn is three times as fast as anesrake and uses almost two thirds less memory (survey will not complete the rake):

#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 autumn         2.3s    3.02s    0.327     1.22GB     6.58
#> 2 anesrake      8.41s    9.99s    0.0945    3.11GB     4.77

Large scale

Finally, consider an extremely resource intensive problem: raking a much larger dataset of 108,660 observations on 17 variables. In this scenario, autumn is 11 times faster and uses 92% less memory. (This benchmark is limited to 10 iterations):

#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <chr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 autumn        47.1s    48.8s   0.0200     20.8GB     2.44
#> 2 anesrake      8.15m     8.8m   0.00189   238.6GB     2.52

Why is the package called “autumn”?

Authorship and Funding

autumn is written and maintained by Aaron Rudkin. Target proportions in the included ns_target data were developed by Alex Rossell-Hayes.

If you have any comments, issues, or concerns, please open a GitHub issue. Contributions are welcome. Please see our Contributor Code of Conduct for details.

autumn was developed in conjunction with Democracy Fund + UCLA Nationscape, one of the largest public opinion surveys ever conducted. UCLA’s Nationscape team are: Tyler Reny, Alex Rossell-Hayes, Aaron Rudkin, Chris Tausanovitch, and Lynn Vavreck. Funding for this project was provided by Democracy Fund, part of the Omidyar Group.

UCLA + Democracy Fund

Package hex logo adapted from art by Freepik from flaticon.com

autumn's People

Contributors

aaronrudkin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

autumn's Issues

v0.10 checklist

Things to do before v0.10:

  • Code: Complete Spencer design effects
  • Vignette: Specifying targets
  • Vignette: Convergence criteria
  • Vignette: Error functions
  • Vignette: Variable selection functions
  • Vignette: Speed, memory, and convergence benchmarks
  • Vignette: Design effects
  • Tests: Design effects
  • Tests: Code coverage for various convergence methods
  • Examples and documentation: Design effects
  • Check included data for desirable performance
  • Fix vignette note in README
  • Fix CODE OF CONDUCT
  • News
  • cran-comments.md
  • Test binary package
  • GitHub release

Error for missing levels in the data

I have some variables in a data set the a want to weight. However, not all levels are present in the data set.

library(dplyr)
library(autumn)

harvest(d, weights)

So I get this error:

Error in check_any_data_issues(data, target, weights) : Errors detected in data. Some variables have values in the weight targets which are not present in the data:`

Here is a dput of the quotes

dput(weights)
list(Rec_Age = c(`1` = 0, `2` = 0.181, `3` = 0.2877, `4` = 0.3311, 
`5` = 0.2001), Rec_Income = c(`1` = 0.1105, `2` = 0.2852, `3` = 0.2343, 
`4` = 0.3699), Q6 = c(`1` = 0.067, `2` = 0.3409, `3` = 0.592), 
    RECQ5_1 = c(`1` = 0.4099, `2` = 0.5239, `3` = 0.0662), RECQ5_2 = c(`1` = 0.1621, 
    `2` = 0.3803, `3` = 0.4576), RECQ5_3 = c(`1` = 0.0508, `2` = 0.294, 
    `3` = 0.6551), RECQ5_4 = c(`1` = 0.103, `2` = 0.4864, `3` = 0.4106
    ))

and the data:


dput(d)
structure(list(RESPID = structure(c(459, 311, 223, 60, 613, 495, 
300, 273, 78, 170, 217, 61, 175, 619, 270, 218, 453, 492, 23, 
65, 33, 113, 532, 26, 119, 49, 208, 102, 200, 165, 435, 298, 
593, 220, 111, 53, 494, 271, 305, 420, 323, 607, 105, 19, 426, 
171, 330, 201, 332, 277), label = "RESPID - Respondent ID", format.spss = "F10.0", display_width = 0L), 
    Rec_Age = structure(c(4, 2, 4, 3, 4, 4, 4, 3, 2, 2, 3, 2, 
    3, 4, 4, 2, 4, 4, 2, 3, 2, 2, 2, 3, 3, 2, 2, 2, 2, 3, 2, 
    3, 2, 3, 4, 3, 4, 3, 2, 3, 3, 3, 4, 4, 4, 2, 2, 3, 4, 3), label = "Rec_Age - Recode Age", format.spss = "F1.0", display_width = 0L), 
    Rec_Income = structure(c(3, 1, 2, 1, 1, 2, 2, 3, 2, 1, 2, 
    2, 2, 1, 1, 2, 2, 3, 3, 2, 2, 2, 2, 3, 2, 3, 2, 2, 1, 2, 
    2, 2, 1, 3, 1, 1, 1, 1, 1, 3, 3, 2, 3, 3, 2, 2, 2, 2, 2, 
    2), label = "Rec_Income - Recode Income", format.spss = "F1.0", display_width = 0L), 
    Q6 = structure(c(2, 1, 2, 3, 2, 3, 2, 1, 3, 2, 2, 3, 3, 3, 
    2, 2, 3, 3, 2, 1, 2, 3, 3, 2, 2, 2, 1, 2, 1, 2, 2, 3, 3, 
    2, 3, 2, 3, 2, 2, 1, 3, 2, 2, 2, 3, 2, 2, 1, 3, 2), label = "Q6 - Wie stark interessieren Sie sich für Bekleidung und Mode?", format.spss = "F1.0", display_width = 0L), 
    RECQ5_1 = c(1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 
    1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 
    3, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1), RECQ5_2 = c(2, 
    2, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 1, 3, 
    2, 3, 2, 2, 2, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 1, 
    1, 3, 3, 2, 1, 3, 1, 2, 1, 3, 2), RECQ5_3 = c(3, 1, 3, 3, 
    3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 1, 2, 2, 3, 3, 
    2, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 3, 2, 3, 3, 2, 1, 3, 3, 
    3, 2, 3, 1, 3, 3, 3, 2), RECQ5_4 = c(1, 2, 2, 2, 2, 2, 1, 
    1, 3, 2, 2, 3, 3, 3, 1, 1, 2, 3, 1, 1, 1, 3, 2, 1, 2, 1, 
    1, 1, 1, 2, 1, 3, 3, 3, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 
    1, 2, 3, 2, 2)), row.names = c(NA, -50L), class = "data.frame")

Flag to collapse small raking targets

It has been brought up that a useful flag for harvest would be the ability to automatically collapse target levels (either to remove smaller levels and improve convergence, or to solve the problem of missing interaction levels). I'm not fully sure how I'd implement this, but I think it's a good idea.

Floating point comparison issues with "targets that do not sum to 1" error

Great package - thanks!

I've been playing around with it, and I've encountered an issue (maybe a bug?).

It looks like the function check_any_data_issues is looking to see if all of the weighting targets add up to 1. I have a situation where they do all add up to 1, but I'm getting the error message "Target variable ... has targets that do not sum to 1." I suspect this is a floating point comparison issue.

Here's a very small example that should be reproducible (let me know if it doesn't work for you):

atlanta <- structure(list(w_race = c("White", "White", "White", "White", 
"Black", "White", "Black", "White", "Other race", "Black", "White", 
"White", "White", "White", "White", "White", "White", "White", 
"White", "White")), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

(atlanta)
#>        w_race
#> 1       White
#> 2       White
#> 3       White
#> 4       White
#> 5       Black
#> 6       White
#> 7       Black
#> 8       White
#> 9  Other race
#> 10      Black
#> 11      White
#> 12      White
#> 13      White
#> 14      White
#> 15      White
#> 16      White
#> 17      White
#> 18      White
#> 19      White
#> 20      White

targets <- structure(list(variable = c("w_race", "w_race", "w_race"), level = c("Black", 
"Other race", "White"), proportion = c(0.299944881294484, 0.0993062927185731, 
0.600748825986942)), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

(targets)
#>   variable      level proportion
#> 1   w_race      Black 0.29994488
#> 2   w_race Other race 0.09930629
#> 3   w_race      White 0.60074883

sum(targets$proportion)
#> [1] 1

autumn::harvest(atlanta, target = targets)
#> Error in check_any_data_issues(data, target, weights): Errors detected in weight targets:
#> Target variable `w_race` has targets that do not sum to 1.

Created on 2020-07-14 by the reprex package (v0.3.0)

If I'm missing something, please let me know. Thanks again for all of your work on the package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.