Git Product home page Git Product logo

cces_cumulative's Introduction

CCES Cumulative File

Shiro Kuriwaki

This repository is R code to build the Cooperative Congressional Election Study (CCES) cumulative file (2006 - 2022).

Please feel free to file any questions or requests about the cumulative file as Github issues.

Getting Started

Start by downloading either the .dta, .Rds, or .feather file on the dataverse page to your computer. This repository does not track the data due to size constraints, but feel free to contact me if you need the newest version not on Dataverse. The .Rds format can be read into R.

dat <- readRDS("cumulative_2006-2022.Rds")

Make sure to load the tidyverse package first. The Rds file can be dealt with as a base-R data.frame, but it was built completely in the tidyverse environment so using it as a tibble gives full features.

library(tidyverse)
dat
## # A tibble: 617,455 × 103
##     year case_id weight weight_cumulative state            st      cong  cong_up
##  * <int> <chr>    <dbl>             <dbl> <int+lbl>        <int+l> <fct> <fct>  
##  1  2006 439219   1.85              1.67  37 [North Carol… 37 [NC] 109   110    
##  2  2006 439224   0.968             0.872 39 [Ohio]        39 [OH] 109   110    
##  3  2006 439228   1.59              1.44  34 [New Jersey]  34 [NJ] 109   110    
##  4  2006 439237   1.40              1.26  17 [Illinois]    17 [IL] 109   110    
##  5  2006 439238   0.903             0.813 36 [New York]    36 [NY] 109   110    
##  6  2006 439242   0.839             0.756 48 [Texas]       48 [TX] 109   110    
##  7  2006 439251   0.777             0.700 27 [Minnesota]   27 [MN] 109   110    
##  8  2006 439254   0.839             0.756 32 [Nevada]      32 [NV] 109   110    
##  9  2006 439255   0.331             0.299 48 [Texas]       48 [TX] 109   110    
## 10  2006 439263   1.10              0.993 24 [Maryland]    24 [MD] 109   110    
## # ℹ 617,445 more rows
## # ℹ 95 more variables: state_post <int+lbl>, st_post <int+lbl>, dist <int>,
## #   dist_up <int>, cd <chr>, cd_up <chr>, dist_post <int>, dist_up_post <int>,
## #   cd_post <chr>, cd_up_post <chr>, zipcode <chr>, county_fips <chr>,
## #   tookpost <int+lbl>, weight_post <dbl>, rvweight <dbl>, rvweight_post <dbl>,
## #   starttime <dttm>, pid3 <int+lbl>, pid3_leaner <int+lbl>, pid7 <int+lbl>,
## #   ideo5 <fct>, gender <int+lbl>, sex <int+lbl>, gender4 <int+lbl>, …

A Stata .dta can also be read in by Stata, or in R through haven::read_dta(). You will need the haven package loaded.

The arrow files can be loaded with arrow::read_feather(). They are currently modeled so that it would give the same output as reading the dta file.

Each row is a respondent, and each variable is information associated with that respondent. Note that this cumulative dataset extracts only a couple of key variables from each year’s CCES, which has hundreds of columns.

What’s New

Unified Variable Names

Most variables in this dataset come straight from each year’s CCES. However, it renames and standardizes variable names, making them accessible in one place. Please see the guide or the Crunch dataset for a full list and description of these variables.

Candidate Names and Identifiers

The cumulative file has added candidate name and identifiers that a respondent chose. In the original year-specific datasets, the response values for a vote choice question is usually a generic label, e.g. Candidate1 and Candidate2 (with separate look-up variables appended). The cumulative dataset shows both the generic label and the chosen candidate’s name, party, and identifier, which will vary across individuals.

select(dat, year, case_id, matches("voted_sen"))
## # A tibble: 617,455 × 5
##     year case_id voted_sen                   voted_sen_party voted_sen_chosen   
##    <int> <chr>   <fct>                       <fct>           <chr>              
##  1  2006 439219  <NA>                        <NA>            <NA>               
##  2  2006 439224  [Democrat / Candidate 1]    Democratic      Sherrod C. Brown (…
##  3  2006 439228  [Democrat / Candidate 1]    Democratic      Robert Menendez (D)
##  4  2006 439237  <NA>                        <NA>            <NA>               
##  5  2006 439238  [Democrat / Candidate 1]    Democratic      Hillary Rodham Cli…
##  6  2006 439242  I Did Not Vote In This Race <NA>            <NA>               
##  7  2006 439251  [Republican / Candidate 2]  Republican      Mark Kennedy (R)   
##  8  2006 439254  [Democrat / Candidate 1]    Democratic      Jack Carter (D)    
##  9  2006 439255  [Democrat / Candidate 1]    Democratic      Barbara Ann Radnof…
## 10  2006 439263  I Did Not Vote In This Race <NA>            <NA>               
## # ℹ 617,445 more rows

Crunch

A version of the dataset is also included in Crunch, a database platform that makes it easy to view and analyze survey data either with our without any programming experience. For access to View the dataset (free), please sign up here: https://harvard.az1.qualtrics.com/jfe/form/SV_066hQi4Eeco3Kap. Some features include:

A web GUI for quickly browsing variables

Browse Variables with Crunch

Quickly check cross-tabs and bar graphs, with customizable formatting

Cross-tabulate Variables with Crunch

Sharable widgets.

For questions and more access, please contact the CCES Team.

Crunch datasets can also be manipulated from a R package, crunch: https://github.com/Crunch-io/rcrunch.

install.packages("crunch")

For a bit more on using the R crunch package for your own purposes, see the crunch package vignettes, pkgdown website, or a short vignette in this repo.

Organization of Scripts

R scripts 01 - 07 reproduce the cumulative dataset starting from each year’s CCES on dataverse.

  • 01_define-names-labels.R constructs two variable name tables – one that names and describes each variable to be in the final dataset, and another that indicates which variables corresponds to the candidate columns in each year’s CCES.
  • 02_download-cces-dataverse.R indicates a (partial) way to download the component CCES data from dataverse so that the rest of the code can be run.
  • 03_read-common.R pulls out the common contents with minimal formatting (e.g. state, case identifier variable names)
  • 04_prepare-fixes.R makes some fixes to variables in each year’s datasets.
  • 05_stack-cumulative.R pulls out the variables of interest from annual CCES files, we stack this into a long dataset where each row is a respondent from CCES.
  • 06_extract_politicians.R pulls out the “contextual variables” at the respondent-level. information on candidates and representatives. It uses some long format voting tables from 05.
  • 07_merge-contextual_upload.R combines all the variables together, essentially combining the output of 04 on 05. Saves a .Rds and sav version.
  • 08_format-crunch.R logs into Crunch, and adds variable names, descriptions, groupings, and other Crunch attributes to the Crunch dataset. It also adds variables and exports a .dta version

More scripts are in 00_prepare, they format other datasets like NOMINATE, CQ, and DIME.

cces_cumulative's People

Contributors

enye92 avatar jcha1997 avatar kuriwaki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

enye92

cces_cumulative's Issues

account for straight party option in 2018 post

The party lever question we added this year (CC18_409) messes with the vote choice question, because respondents who answered D or R to this question were not asked the main vote choice questions. Instead your vote choice should be assumed from the party chosen.

cc18 <- read_dta("data/source/cces/2018_cc.dta")
df <- read_dta("data/release/cumulative_2006_2018.dta")

# those who used party lever
strt <- filter(cc18, CC18_409 %in% 1:2) %>% 
  transmute(case_id = as.character(caseid),
            year = 2018)

# should be filled in
df %>% 
  select(year, case_id, voted_sen_party) %>% 
  semi_join(strt) %>% 
  count(voted_sen_party)

# A tibble: 3 x 2
#  voted_sen_party     n
#        <dbl+lbl> <int>
#1  1 [Democratic]   104
#2  2 [Republican]   130
#3 NA               3768

Add "hispanic" variable to dataset

Bernard Fraga points out that "Are you of Spanish, Latino, or Hispanic origin or descent?" is asked IF RACE != "Hispanic", so this column provides important information that should be provided alongside race.

helpers for guide

  • explanation of haven_labelled
  • showing haven labels in new dplyr
  • how to change all labels to factors
  • merge with different questions

duplicate labels no longer allowed in default print method

Due to tidyverse/haven#364, this prohibits the first cut display of the dataset in R (also see tidyverse/haven#424). This affects c("pid7", "race", "marstat", "newsint", "approval_pres", "approval_gov").

> attr(df$approval_pres, "labels")
              Strongly Approve 
                             1 
                       Approve 
                             2 
              Somewhat Approve 
                             2 
                    Disapprove 
                             3 
           Somewhat Disapprove 
                             3 
           Strongly Disapprove 
                             4 
                Never Heard Of 
                             5 
                      Not Sure 
                             5 
Neither Approve Nor Disapprove 
                             6 
                            NA 
                   -2147483648 
> select(df, approval_pres)
Error: `labels` must be unique
Backtrace:
     █
  1. ├─[ (function (x, ...) ... ]
  2. └─tibble:::print.tbl(x)
  3.   ├─[ tibble:::cat_line(...) ] with 2 more calls
  6.   ├─[ base::format(...) ]
  7.   └─tibble:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
  8.     └─tibble::trunc_mat(x, n = n, width = width, n_extra = n_extra)
  9.       ├─[ base::as.data.frame(...) ]
 10.       ├─[ utils::head(...) ]
 11.       └─utils:::head.data.frame(x, n)
 12.         ├─[ ...[] ]
 13.         └─tibble:::`[.tbl_df`(x, seq_len(n), , drop = FALSE)
 14.           └─tibble:::map(result, subset_rows, i)
 15.             └─base::lapply(.x, .f, ...)
 16.               └─tibble:::FUN(X[[i]], ...)
 17.                 ├─[ ...[] ]
 18.                 └─haven:::`[.haven_labelled`(x, i)
 19.                   └─haven::labelled(...)
> 

Add follow the news (newsint)

Seems to be asked in all years except 2006.

# 2007 newsint
# 2008 V244
# 2009 v244
# 2010 V244
# 2011 V244
# 2012- newsint

2008 vote (old cumulative) is coalescing non voters or asked non-vote as a valid option

which makes the comparisoon inconsistent.

. keep if mod(year, 4) == 0 & tookpost == 1
(355,449 observations deleted)

. bysort year: tab voted_pres_party educ

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2008

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       308      3,206      2,823        802      2,791      1,589 |    11,519 
     Republican |       287      4,203      2,907        939      2,639      1,091 |    12,066 
Other Candidate |        12        114        111         23        101         36 |       397 
   Did not Vote |       251      1,691        581        178        252         86 |     3,039 
----------------+------------------------------------------------------------------+----------
          Total |       858      9,214      6,422      1,942      5,783      2,802 |    27,021 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2012

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       330      4,000      4,663      1,773      4,784      3,498 |    19,048 
     Republican |       302      4,616      5,036      1,959      4,419      2,409 |    18,741 
Other Candidate |        25        246        392        155        354        190 |     1,362 
   Did not Vote |         1         24         20          6         18          8 |        77 
----------------+------------------------------------------------------------------+----------
          Total |       658      8,886     10,111      3,893      9,575      6,105 |    39,228 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2016

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       270      3,740      5,197      2,443      6,213      4,273 |    22,136 
     Republican |       312      5,258      4,800      2,319      3,992      2,074 |    18,755 
Other Candidate |        47        672        978        494      1,171        679 |     4,041 
   Did not Vote |         1         14         17          6         26         17 |        81 
----------------+------------------------------------------------------------------+----------
          Total |       630      9,684     10,992      5,262     11,402      7,043 |    45,013 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
-> year = 2020

 President vote |
        in last |                             Education
       election |     No HS  High Scho  Some Coll     2-Year     4-Year  Post-Grad |     Total
----------------+------------------------------------------------------------------+----------
     Democratic |       361      4,613      5,499      2,783      7,630      5,301 |    26,187 
     Republican |       346      5,328      3,862      2,220      3,945      2,002 |    17,703 
Other Candidate |        23        187        311        158        469        310 |     1,458 
   Did not Vote |         2         21         23         12         34         22 |       114 
----------------+------------------------------------------------------------------+----------
          Total |       732     10,149      9,695      5,173     12,078      7,635 |    45,462 


. restore

See the growth in the 4th category without the others changing

> ccp %>% xtabs(~ year + vote_pres_08, .)
      vote_pres_08
year       1     2     3     4     5
  2006     0     0     0 36421     0
  2007     0     0     0  9999     0
  2008 11519 12066   397  8818     0
  2009  5570  5614   246  2313    57
  2010 23410 23342  1417  6939   292
  2011 10062  8062   937    92   997
  2012 23425 19314  1207 10148   441
> cc08 %>% xtabs(~ CC410)
Error in as.data.frame.default(data, optional = TRUE) : 
  cannot coerce class ‘"formula"’ to a data.frame
> cc08 %>% xtabs(~ CC410, .)
CC410
    1     2     3     4     5     6     7     8     9 
12066 11519   125    20    78    49   125    23     4 
> cc08 %>% xtabs(~ as_factor(CC410), .)

A table of keys that link FECID to ICPSR ID

This should be fairly straightforward to do since we have the FEC recipients file.

  • Strip away extraneous columns. Something like dplyr::distinct(FECID, ICPSRID)
  • Keep only candidates
  • Classify challengers and group them with their incumbent opponent's races (or names that are in the CCES metadata)

Errors in voted_rep_chosen

From Bernard Fraga:

something odd about voted_rep_chosen values for at least some districts in 2012, and it made me think that there must be a coding error with either the cd variable or the voted_rep_chosen variable. Specifically, for many districts in California, voted_rep_chosen takes values of candidates from other districts, usually in small amounts, but from bordering districts. For example, some 2012 CA-8 respondents chose Jackie Speier (D) and other candidates from the 14th district, not the 8th district (Pelosi).

It doesn’t look like this is an issue in other years, so I assume it has to do with inputzip-cd matching or something odd with the voted_rep_chosen table due to redistricting. However, it would be good to know who these individuals actually saw on the survey, especially if they were given candidates who weren’t running in their actual district.

Add some knowledge questions

These tend to be asked multiple years. Pick a few of the main ones (perhaps CurrentHouse party)

  [1] Republicans [2] Democrats [3] Neither [4] Not sure
[CC20_310a] U.S. House of Representatives
[CC20_310b] U.S. Senate
[CC20_310c] $inputstate State Senate (shown if shown if inputstate != 11)
[CC20_310d] $LowerChamberName (shown if shown if LowerChamberName)
  [1] Never Heard
of Person
[2] Republican [3] Democrat [4] Other Party / Independent [5] Not sure
[CC20_311a] $CurrentGovName (shown if shown if CurrentGovName)
[CC20_311b] $CurrentSen1Name (shown if shown if CurrentSen1Name)
[CC20_311c] $CurrentSen2Name (shown if shown if CurrentSen2Name)
[CC20_311d] $CurrentHouseName (shown if shown if CurrentHouseName)

(thanks to Alexander Agadjanian)

weights clarification

for 2012 and 2016, use commonweight_vv as main weights becuase that is the equivalent of commonweight in all other years -- i.e. weights for pre-wave that have been computed after vote validation.

make labels for prepositions lower case

                 newsint     n
               <dbl+lbl> <int>
1  1 [Hardly At All]      1967
2  2 [Only Now And Then]  4141
3  3 [Don't Know]          711
4  4 [Some Of The Time]  11182
5  5 [Most Of The Time]  32678
6 NA                      1834

CD change 2006-2016

data set to calculate the change of congressional districts for Congresses 109,110,113,115

Add self-reported turnout

Year Pre Post
2006 vote06turn v4004
2008 CC326 cc403
2010 CC354 CC401
2012 CC354 CC401
2014 CC354 CC401
2016 CC16_364 CC16_401
2018 CC18_350 CC18_401
2020 CC20_363 CC20_401

unify categorizations of validated vote

  vv_turnout_gvm_char                                                          n
   <chr>                                                                    <int>
 1 NA                                                                       74600
 2 Validated Record Of Voting In General Election                           71664
 3 ""                                                                       70649
 4 Polling                                                                  31387
 5 No Evidence On Whether The R Voted Or Not                                19972
 6 Unknown                                                                  16404
 7 Absentee                                                                 13895
 8 Unknownmethod                                                            10821
 9 Mail                                                                      9839
10 Did Not Vote-Verified Record Of Being Unregistered                        9205
11 Matchednovote                                                             8819
12 Earlyvote                                                                 8315
13 Did Not Vote-R Said They Didn't Vote                                      7881
14 Did Not Vote - Verified Record Of Not Voting                              5448
15 Early                                                                     5206
16 Did Not Vote-R Has Verified Registration File, But Not Vote History File  4930
17 Did Not Vote-R Said They're Note Registered To Vote                       2923
18 Virginia Doesn't Maintain Vote History Files                              1733
19 Did Not Vote-Non-Citizen                                                   865

pid3 2010 missing vs. not sure

5 (usually not sure) is not in 2010

> dcast(ccc, year ~ pid3) 
Using 'vv_turnout_pvm' as value column. Use 'value.var' to override
Aggregation function missing: defaulting to length
   year     1     2     3    4    5   NA
1  2006 11776 11237 11338 2012    1   57
2  2007  3051  2868  3131  677  219   53
3  2008 11510 10764  7836 1304 1386    0
4  2009  5969  4422  4613  981  815    0
5  2010 16876 14503 13630 1675    0 8716
6  2011  7295  5303  6111  269 1172    0
7  2012 27124 20588 20570 2914 2839    0
8  2013  6040  3712  5220  438  989    1
9  2014 20428 13228 15952 2397 4146   49
10 2015  5439  3622  4063  318  804    4
11 2016 24881 15300 18238 2379 3782   20
12 2017  6769  4235  5298  840 1058    0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.