kuriwaki / cces_cumulative Goto Github PK

View Code? Open in Web Editor NEW

15.0 3.0 1.0 73.83 MB

Building a cumulative file (2006 - 2023) for the Cooperative Congressional Election Study

Home Page: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/II2DB6

R 39.49% TeX 60.43% Shell 0.08%

dataverse survey harvard elections

cces_cumulative's Introduction

CCES Cumulative File

Shiro Kuriwaki

This repository is R code to build the Cooperative Congressional Election Study (CCES) cumulative file (2006 - 2022).

Please feel free to file any questions or requests about the cumulative file as Github issues.

Getting Started

Start by downloading either the .dta, .Rds, or .feather file on the dataverse page to your computer. This repository does not track the data due to size constraints, but feel free to contact me if you need the newest version not on Dataverse. The .Rds format can be read into R.

dat <- readRDS("cumulative_2006-2022.Rds")

Make sure to load the tidyverse package first. The Rds file can be dealt with as a base-R data.frame, but it was built completely in the tidyverse environment so using it as a tibble gives full features.

library(tidyverse)
dat

## # A tibble: 617,455 × 103
##     year case_id weight weight_cumulative state            st      cong  cong_up
##  * <int> <chr>    <dbl>             <dbl> <int+lbl>        <int+l> <fct> <fct>  
##  1  2006 439219   1.85              1.67  37 [North Carol… 37 [NC] 109   110    
##  2  2006 439224   0.968             0.872 39 [Ohio]        39 [OH] 109   110    
##  3  2006 439228   1.59              1.44  34 [New Jersey]  34 [NJ] 109   110    
##  4  2006 439237   1.40              1.26  17 [Illinois]    17 [IL] 109   110    
##  5  2006 439238   0.903             0.813 36 [New York]    36 [NY] 109   110    
##  6  2006 439242   0.839             0.756 48 [Texas]       48 [TX] 109   110    
##  7  2006 439251   0.777             0.700 27 [Minnesota]   27 [MN] 109   110    
##  8  2006 439254   0.839             0.756 32 [Nevada]      32 [NV] 109   110    
##  9  2006 439255   0.331             0.299 48 [Texas]       48 [TX] 109   110    
## 10  2006 439263   1.10              0.993 24 [Maryland]    24 [MD] 109   110    
## # ℹ 617,445 more rows
## # ℹ 95 more variables: state_post <int+lbl>, st_post <int+lbl>, dist <int>,
## #   dist_up <int>, cd <chr>, cd_up <chr>, dist_post <int>, dist_up_post <int>,
## #   cd_post <chr>, cd_up_post <chr>, zipcode <chr>, county_fips <chr>,
## #   tookpost <int+lbl>, weight_post <dbl>, rvweight <dbl>, rvweight_post <dbl>,
## #   starttime <dttm>, pid3 <int+lbl>, pid3_leaner <int+lbl>, pid7 <int+lbl>,
## #   ideo5 <fct>, gender <int+lbl>, sex <int+lbl>, gender4 <int+lbl>, …

A Stata .dta can also be read in by Stata, or in R through haven::read_dta(). You will need the haven package loaded.

The arrow files can be loaded with arrow::read_feather(). They are currently modeled so that it would give the same output as reading the dta file.

Each row is a respondent, and each variable is information associated with that respondent. Note that this cumulative dataset extracts only a couple of key variables from each year’s CCES, which has hundreds of columns.

What’s New

Unified Variable Names

Most variables in this dataset come straight from each year’s CCES. However, it renames and standardizes variable names, making them accessible in one place. Please see the guide or the Crunch dataset for a full list and description of these variables.

Candidate Names and Identifiers

The cumulative file has added candidate name and identifiers that a respondent chose. In the original year-specific datasets, the response values for a vote choice question is usually a generic label, e.g. Candidate1 and Candidate2 (with separate look-up variables appended). The cumulative dataset shows both the generic label and the chosen candidate’s name, party, and identifier, which will vary across individuals.

select(dat, year, case_id, matches("voted_sen"))

## # A tibble: 617,455 × 5
##     year case_id voted_sen                   voted_sen_party voted_sen_chosen   
##    <int> <chr>   <fct>                       <fct>           <chr>              
##  1  2006 439219  <NA>                        <NA>            <NA>               
##  2  2006 439224  [Democrat / Candidate 1]    Democratic      Sherrod C. Brown (…
##  3  2006 439228  [Democrat / Candidate 1]    Democratic      Robert Menendez (D)
##  4  2006 439237  <NA>                        <NA>            <NA>               
##  5  2006 439238  [Democrat / Candidate 1]    Democratic      Hillary Rodham Cli…
##  6  2006 439242  I Did Not Vote In This Race <NA>            <NA>               
##  7  2006 439251  [Republican / Candidate 2]  Republican      Mark Kennedy (R)   
##  8  2006 439254  [Democrat / Candidate 1]    Democratic      Jack Carter (D)    
##  9  2006 439255  [Democrat / Candidate 1]    Democratic      Barbara Ann Radnof…
## 10  2006 439263  I Did Not Vote In This Race <NA>            <NA>               
## # ℹ 617,445 more rows

Crunch

A version of the dataset is also included in Crunch, a database platform that makes it easy to view and analyze survey data either with our without any programming experience. For access to View the dataset (free), please sign up here: https://harvard.az1.qualtrics.com/jfe/form/SV_066hQi4Eeco3Kap. Some features include:

A web GUI for quickly browsing variables

Quickly check cross-tabs and bar graphs, with customizable formatting

Sharable widgets.

For questions and more access, please contact the CCES Team.

Crunch datasets can also be manipulated from a R package, crunch: https://github.com/Crunch-io/rcrunch.

install.packages("crunch")

For a bit more on using the R crunch package for your own purposes, see the crunch package vignettes, pkgdown website, or a short vignette in this repo.

Organization of Scripts

R scripts 01 - 07 reproduce the cumulative dataset starting from each year’s CCES on dataverse.

01_define-names-labels.R constructs two variable name tables – one that names and describes each variable to be in the final dataset, and another that indicates which variables corresponds to the candidate columns in each year’s CCES.
02_download-cces-dataverse.R indicates a (partial) way to download the component CCES data from dataverse so that the rest of the code can be run.
03_read-common.R pulls out the common contents with minimal formatting (e.g. state, case identifier variable names)
04_prepare-fixes.R makes some fixes to variables in each year’s datasets.
05_stack-cumulative.R pulls out the variables of interest from annual CCES files, we stack this into a long dataset where each row is a respondent from CCES.
06_extract_politicians.R pulls out the “contextual variables” at the respondent-level. information on candidates and representatives. It uses some long format voting tables from 05.
07_merge-contextual_upload.R combines all the variables together, essentially combining the output of 04 on 05. Saves a .Rds and sav version.
08_format-crunch.R logs into Crunch, and adds variable names, descriptions, groupings, and other Crunch attributes to the Crunch dataset. It also adds variables and exports a .dta version

More scripts are in 00_prepare, they format other datasets like NOMINATE, CQ, and DIME.

cces_cumulative's People

Contributors

Stargazers

Watchers

Forkers

enye92

cces_cumulative's Issues

cumulative data

get vote and approval vars

spreadsheet with summary vars of each geography

A place you might get started is: http://nrs.harvard.edu/urn-3:hul.eresource:socialex

`st` should be text, not factor, in Stata

otherwise people need to look up the labels by label list st, which is always a nuisance

Add 2020 vote validation when it comes out

Probably summer 2021

intent_pres_party

simple party R/D would be useful, related to #35

Merge cq and voteview for ICPSR numbers

account for straight party option in 2018 post

The party lever question we added this year (CC18_409) messes with the vote choice question, because respondents who answered D or R to this question were not asked the main vote choice questions. Instead your vote choice should be assumed from the party chosen.

cc18 <- read_dta("data/source/cces/2018_cc.dta")
df <- read_dta("data/release/cumulative_2006_2018.dta")

# those who used party lever
strt <- filter(cc18, CC18_409 %in% 1:2) %>% 
  transmute(case_id = as.character(caseid),
            year = 2018)

# should be filled in
df %>% 
  select(year, case_id, voted_sen_party) %>% 
  semi_join(strt) %>% 
  count(voted_sen_party)

# A tibble: 3 x 2
#  voted_sen_party     n
#        <dbl+lbl> <int>
#1  1 [Democratic]   104
#2  2 [Republican]   130
#3 NA               3768

add countyFIPS to 2017 when made available

Add union membership in the next version

`case_id` should be int, not char, in Stata

cd code have leading 0s

what about at-large?

generate formatted and labelled dta

add zip and county

to minimal crunch cumulative database

Add "hispanic" variable to dataset

Bernard Fraga points out that "Are you of Spanish, Latino, or Hispanic origin or descent?" is asked IF RACE != "Hispanic", so this column provides important information that should be provided alongside race.

Add religion in next version

helpers for guide

explanation of haven_labelled
showing haven labels in new dplyr
how to change all labels to factors
merge with different questions

Add hispanic origin and any-part hispanic

Cuban, Puerto Rican

also add a note to coalesce hispanic = 1 with race != "Hispanic" --> "any-part Hispanic" (race_h)

more demographic vars (employmnet, children, home ownership)

I'm thinking employment status (employ), health insurance (healthins), children under 18 (child18), military status (milstat), and homeownership (ownhome).

Fix and document correction in 2016 validated vote data

see e.g. https://twitter.com/b_schaffner/status/962481705545322496

duplicate labels no longer allowed in default print method

Due to tidyverse/haven#364, this prohibits the first cut display of the dataset in R (also see tidyverse/haven#424). This affects c("pid7", "race", "marstat", "newsint", "approval_pres", "approval_gov").

> attr(df$approval_pres, "labels")
              Strongly Approve 
                             1 
                       Approve 
                             2 
              Somewhat Approve 
                             2 
                    Disapprove 
                             3 
           Somewhat Disapprove 
                             3 
           Strongly Disapprove 
                             4 
                Never Heard Of 
                             5 
                      Not Sure 
                             5 
Neither Approve Nor Disapprove 
                             6 
                            NA 
                   -2147483648 
> select(df, approval_pres)
Error: `labels` must be unique
Backtrace:
     █
  1. ├─[ (function (x, ...) ... ]
  2. └─tibble:::print.tbl(x)
  3.   ├─[ tibble:::cat_line(...) ] with 2 more calls
  6.   ├─[ base::format(...) ]
  7.   └─tibble:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8.     └─tibble::trunc_mat(x, n = n, width = width, n_extra = n_extra)
  9.       ├─[ base::as.data.frame(...) ]
 10.       ├─[ utils::head(...) ]
 11.       └─utils:::head.data.frame(x, n)
 12.         ├─[ ...[] ]
 13.         └─tibble:::`[.tbl_df`(x, seq_len(n), , drop = FALSE)
 14.           └─tibble:::map(result, subset_rows, i)
 15.             └─base::lapply(.x, .f, ...)
 16.               └─tibble:::FUN(X[[i]], ...)
 17.                 ├─[ ...[] ]
 18.                 └─haven:::`[.haven_labelled`(x, i)
 19.                   └─haven::labelled(...)
>

Add citizen (binary) as a common variable

Add follow the news (newsint)

Seems to be asked in all years except 2006.

# 2007 newsint
# 2008 V244
# 2009 v244
# 2010 V244
# 2011 V244
# 2012- newsint

2008 vote (old cumulative) is coalescing non voters or asked non-vote as a valid option

which makes the comparisoon inconsistent.

. keep if mod(year, 4) == 0 & tookpost == 1
(355,449 observations deleted)

. bysort year: tab voted_pres_party educ

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2008

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       308      3,206      2,823        802      2,791      1,589 |    11,519 
     Republican |       287      4,203      2,907        939      2,639      1,091 |    12,066 
Other Candidate |        12        114        111         23        101         36 |       397 
   Did not Vote |       251      1,691        581        178        252         86 |     3,039 
----------------+------------------------------------------------------------------+----------
          Total |       858      9,214      6,422      1,942      5,783      2,802 |    27,021 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2012

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       330      4,000      4,663      1,773      4,784      3,498 |    19,048 
     Republican |       302      4,616      5,036      1,959      4,419      2,409 |    18,741 
Other Candidate |        25        246        392        155        354        190 |     1,362 
   Did not Vote |         1         24         20          6         18          8 |        77 
----------------+------------------------------------------------------------------+----------
          Total |       658      8,886     10,111      3,893      9,575      6,105 |    39,228 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2016

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       270      3,740      5,197      2,443      6,213      4,273 |    22,136 
     Republican |       312      5,258      4,800      2,319      3,992      2,074 |    18,755 
Other Candidate |        47        672        978        494      1,171        679 |     4,041 
   Did not Vote |         1         14         17          6         26         17 |        81 
----------------+------------------------------------------------------------------+----------
          Total |       630      9,684     10,992      5,262     11,402      7,043 |    45,013 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2020

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       361      4,613      5,499      2,783      7,630      5,301 |    26,187 
     Republican |       346      5,328      3,862      2,220      3,945      2,002 |    17,703 
Other Candidate |        23        187        311        158        469        310 |     1,458 
   Did not Vote |         2         21         23         12         34         22 |       114 
----------------+------------------------------------------------------------------+----------
          Total |       732     10,149      9,695      5,173     12,078      7,635 |    45,462 


. restore

See the growth in the 4th category without the others changing

> ccp %>% xtabs(~ year + vote_pres_08, .)
      vote_pres_08
year       1     2     3     4     5
  2006     0     0     0 36421     0
  2007     0     0     0  9999     0
  2008 11519 12066   397  8818     0
  2009  5570  5614   246  2313    57
  2010 23410 23342  1417  6939   292
  2011 10062  8062   937    92   997
  2012 23425 19314  1207 10148   441
> cc08 %>% xtabs(~ CC410)
Error in as.data.frame.default(data, optional = TRUE) : 
  cannot coerce class ‘"formula"’ to a data.frame
> cc08 %>% xtabs(~ CC410, .)
CC410
    1     2     3     4     5     6     7     8     9 
12066 11519   125    20    78    49   125    23     4 
> cc08 %>% xtabs(~ as_factor(CC410), .)

A table of keys that link FECID to ICPSR ID

This should be fairly straightforward to do since we have the FEC recipients file.

Strip away extraneous columns. Something like dplyr::distinct(FECID, ICPSRID)
Keep only candidates
Classify challengers and group them with their incumbent opponent's races (or names that are in the CCES metadata)

add voting and approval vars to dta

extract candidate vs. candidate metadata

from dataverse data, format to merge with other Voteview and FEC data

`intent_pres_party` nulls

intent_pres_party is null for Libertarian and Green party preference

intent_pres could be coalesced to one column

voted_pres needs multiple years because each pres cycle we ask both c and c - 1, but for intent we could collapse.

calibrate crunch and dataverse counts

Errors in voted_rep_chosen

From Bernard Fraga:

something odd about voted_rep_chosen values for at least some districts in 2012, and it made me think that there must be a coding error with either the cd variable or the voted_rep_chosen variable. Specifically, for many districts in California, voted_rep_chosen takes values of candidates from other districts, usually in small amounts, but from bordering districts. For example, some 2012 CA-8 respondents chose Jackie Speier (D) and other candidates from the 14th district, not the 8th district (Pelosi).

It doesn’t look like this is an issue in other years, so I assume it has to do with inputzip-cd matching or something odd with the voted_rep_chosen table due to redistricting. However, it would be good to know who these individuals actually saw on the survey, especially if they were given candidates who weren’t running in their actual district.

standardize and add income measure

incorporate post weights and distinguish them from pre weights

Continue downloading MC profile datasets

get MC list from Rvoteview

Best if we can get it from package Rvoteview.

Senators and House members from Congresses 109- 115

visualization for CD change 2006-2016

visualization for change values calculated of congressional districts for Congresses 109,110,113,115

Add some knowledge questions

These tend to be asked multiple years. Pick a few of the main ones (perhaps CurrentHouse party)

	[1] Republicans	[2] Democrats	[3] Neither	[4] Not sure
[CC20_310a] U.S. House of Representatives	◯	◯	◯	◯
[CC20_310b] U.S. Senate	◯	◯	◯	◯
[CC20_310c] $inputstate State Senate (shown if shown if inputstate != 11)	◯	◯	◯	◯
[CC20_310d] $LowerChamberName (shown if shown if LowerChamberName)	◯	◯	◯	◯

	[1] Never Heard of Person	[2] Republican	[3] Democrat	[4] Other Party / Independent	[5] Not sure
[CC20_311a] $CurrentGovName (shown if shown if CurrentGovName)	◯	◯	◯	◯	◯
[CC20_311b] $CurrentSen1Name (shown if shown if CurrentSen1Name)	◯	◯	◯	◯	◯
[CC20_311c] $CurrentSen2Name (shown if shown if CurrentSen2Name)	◯	◯	◯	◯	◯
[CC20_311d] $CurrentHouseName (shown if shown if CurrentHouseName)	◯	◯	◯	◯	◯

(thanks to Alexander Agadjanian)

                 newsint     n
               <dbl+lbl> <int>
1  1 [Hardly At All]      1967
2  2 [Only Now And Then]  4141
3  3 [Don't Know]          711
4  4 [Some Of The Time]  11182
5  5 [Most Of The Time]  32678
6 NA                      1834

Year	Pre	Post
2006	vote06turn	v4004
2008	CC326	cc403
2010	CC354	CC401
2012	CC354	CC401
2014	CC354	CC401
2016	CC16_364	CC16_401
2018	CC18_350	CC18_401
2020	CC20_363	CC20_401

unify categorizations of validated vote

  vv_turnout_gvm_char                                                          n
   <chr>                                                                    <int>
 1 NA                                                                       74600
 2 Validated Record Of Voting In General Election                           71664
 3 ""                                                                       70649
 4 Polling                                                                  31387
 5 No Evidence On Whether The R Voted Or Not                                19972
 6 Unknown                                                                  16404
 7 Absentee                                                                 13895
 8 Unknownmethod                                                            10821
 9 Mail                                                                      9839
10 Did Not Vote-Verified Record Of Being Unregistered                        9205
11 Matchednovote                                                             8819
12 Earlyvote                                                                 8315
13 Did Not Vote-R Said They Didn't Vote                                      7881
14 Did Not Vote - Verified Record Of Not Voting                              5448
15 Early                                                                     5206
16 Did Not Vote-R Has Verified Registration File, But Not Vote History File  4930
17 Did Not Vote-R Said They're Note Registered To Vote                       2923
18 Virginia Doesn't Maintain Vote History Files                              1733
19 Did Not Vote-Non-Citizen                                                   865

pid3 2010 missing vs. not sure

5 (usually not sure) is not in 2010

> dcast(ccc, year ~ pid3) 
Using 'vv_turnout_pvm' as value column. Use 'value.var' to override
Aggregation function missing: defaulting to length
   year     1     2     3    4    5   NA
1  2006 11776 11237 11338 2012    1   57
2  2007  3051  2868  3131  677  219   53
3  2008 11510 10764  7836 1304 1386    0
4  2009  5969  4422  4613  981  815    0
5  2010 16876 14503 13630 1675    0 8716
6  2011  7295  5303  6111  269 1172    0
7  2012 27124 20588 20570 2914 2839    0
8  2013  6040  3712  5220  438  989    1
9  2014 20428 13228 15952 2397 4146   49
10 2015  5439  3622  4063  318  804    4
11 2016 24881 15300 18238 2379 3782   20
12 2017  6769  4235  5298  840 1058    0