scunning1975 / mixtape Goto Github PK

View Code? Open in Web Editor NEW

377.0 21.0 214.0 237.97 MB

Data and Program files for Causal Inference: The Mixtape

License: Other

R 10.42% Stata 12.64% TeX 70.28% Python 6.67%

mixtape's Introduction

mixtape

Data files for Causal Inference: The Mixtape

Contributions are very welcome. You can read this guide for more guidance.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

mixtape's People

Contributors

Stargazers

Watchers

Forkers

be-green ian-mcbride mpaulacaldas jtaylor351 sametrie annobadan simeond pvillaparo daaronr enoches hongqiumei tyleransom potterzot emily-mu yusufrezak anhnguyendepocen jtsayagog snowdj yfnian haluong89-bcn econjosh peterhaglich samcarlen giorgio-bendoni murattasdemir alvarogutyerrez lanetk emaynluap deholandacaio colemanrob ebehii nicholasdemark jlstinn chrisgalgojr cm1518 lnsongxf statgrl devlin120 jngod2011 abyanka erauld 72181426 mserramo josuema nicduquette wangzhiyi lusiyang-cis semilleroecusta daviddiviny jfsantosm asjadnaqvi tomcaputo jgendron abrahamlartey sidoniam7 richardupward rodrigodourado10 dwinkler1 epogrebnyak gadjognon plrr anaveenan biostata calebl37 yifankang rishanannadurai deluair fintrek alexanderthclark willshelley404 yuegu1994 ekpeno christophay juisen pedrojma albarran kwrahman rubenmtzc joshi-sushant jumarti96 ozdengo arlionn nwakhidah mahyar-eb veniciuss elipp86 hithy123 farrokhsiar celiu edjeeongithub tmoradi fernandaarj luaburto owain-s juan-mateos shiya-liu cxxxa naseramanzadeh awhobbs elbittar

mixtape's Issues

ATT equation (4.3) on page 128

Second line introduces the symbol $E_i^0$. Shouldn't that be $Y_i^0$?

(I'm using LaTeX notation. I don't know how to do this in Markdown.)

Chapter 9, code castle_5 not included on the website, 1 minor typo

Dear Scott, thank you for this extremely valuable and easy-to-read book. Recently I was given chapter 9 of it as reading, and found two issues that I thought I could tell you here.

1. Missing castle_5 on the website

Link: 9. Difference-in-Differences 9.6.8 Bacon decomposition.

On pp.499-500 of the book there are castle_5.do and castle_5.r, and they are also in this repository, plus the Python version.

2. Typo in both the book and the website

Link: 9 Difference-in-Differences 9.5.7 Final thoughts, and also p.461 of the paper book (sorry I don't know if I have the latest edition, but since it's also on the website, I assume it was not spotted yet.)

These, it turns, out are

Should be: These, it turns out, are

Typo

...attributes, equation ~~2.27~~ 2.23? is likely violated—at least in our example.

Where: Paragraph 3 on https://mixtape.scunning.com/probability-and-regression.html#mean-independence

synth6.do matstate

Hi,

Just running into some issues with this line of Stata code:

matstate=state1/state2/state4/state5/state6/state8/state9/state10/state11/state12/state13/state15/state16/state17/state18/state20/state21/state22/state23/state24/state25/state26/state27/state28/state29/state30/state31/state32/state33/state34/state35/state36/state37/state38/state39/state40/state41/state42/state45/state46/state47/state48/state49/state51/state53/state55;

Stata reports that matstate is a unrecognized command.

is ssl tweak needed for some extra protection?

@kylebutts, great jon on python version! Curious about ssl import:

mixtape/python/card.py

Lines 9 to 10 in a8d007b

 import ssl 

 ssl._create_default_https_context = ssl._create_unverified_context

Would python code work without it?

clean up unused imports in python version

import numpy as np 
import pandas as pd 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
from itertools import combinations 
import plotnine as p

Not every script uses every import, so possible to remove unused import and free some space. Can make a PR for this.

Typo in book - Probability and Regression Review, page 71, equation 2.44

Now, under the first, fourth, and fifth, assumptions, we can write:
...
V(y|x) = σ^2

Should be V(u|x) = σ^2?

Add CONTRIBUTING.md file

CONTIBUTING.md files usually explain to new contributors how to contribute via Github. They can also specify other requirements (e.g. a code style, etc). Here in an example from one of the repos from the World Bank.

For this project, I think you could specify:

How to contribute via Github
Explain the difference between an issue and a PR.

If you want, I can submit a PR (a suggested change) and we can continue the conversation there.

Conditional probability example on page 20

It looks to me like Pr(Fail | Pass) = 0.6, not 0.45. Am I missing something?

Table reference @ref(tab:gruber_ddd)

A table reference in 9.5.2 of the online version did not compile

https://mixtape.scunning.com/ch8.html#state-mandated-maternity-benefits

collecting mixtape refactoring proposals

minor error in abortion_dd.R

Love these materials. I was just working through them and noticed that in abortion_dd.R I needed to change lines 28 and 29 to

sd = reg$std.error[-1:-25],
mean = reg$coefficients[-1:-25],

Currently, they run [-1:-75]

Typo in Figure 10.8 title

In Chapter 10, there is a typo in figure 10.8.

The plot title says "Back male incarceration" instead of "Black male incarceration." Doesn't look like the code for that figure is included, so I'm not sure where to find the typo and make a pull request.

randomization inference p-values

I'm getting a different p-value than is calculated in ri.py or commented in ri.do. I only know python (excited by the new python code added!) so I'll reference the py file.

I believe the issue is with the line p_value = p_value[p_value['permutation'] == 1] which doesn't calculate a p-value based on a weak nor strict inequality. Prior to that line, signed t-statistics are ordered and ranked. There are several observations with the same t-stat of 1. So, if we wanted the p-value calculation to use a weak inequality (find the share of observations with a weakly higher ATE), the following minimal edits would do the job.

p_value = p_value[p_value['ate'] == 1] 
p_value['rank'].max() / n

This gives 0.4285.

The simplest code I can think to do the same thing is the following, though it's not the smartest because it relies on permutations instead of combinations.

from itertools import permutations
import pandas as pd
import numpy as np

url = 'https://github.com/scunning1975/mixtape/raw/master/ri.dta'
df = pd.read_stata(url, index_col = 'name')
observed_t_stat = 1
y_vec = df.y.values 

# create vector of treatment assignments
# use -1 instead of 0 for dot product assist
d = np.concatenate( [np.ones(4), (-1)*np.ones(4)] ) 

t_stats = np.array([])
for d_vec in permutations(d):    
    t = np.dot(y_vec, d_vec) / 4 # signed t-stat
    t_stats = np.append(t_stats, t)

p_value = (t_stats >= observed_t_stat).mean()

I'm making this an issue, because I want to check my own understanding (I'm self-studying) and I think there's also the issue of whether or not the code should be using absolute values for the t-statistics to match the book.

Flip sort to calculate p-value

mixtape/R/ri.R

Line 44 in 3c84896

arrange(ate) %>%

I think this needs to be arrange(desc(ate)) to get the one-tailed p-value correct. Doesn't matter much (still insignificant). Just FYI, here's alternative (generalized) code (similar to that in #20):

library(tidyverse)
library(haven)

read_data <- function(df)
{
  full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/", 
                     df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

ri <- 
  read_data("ri.dta") %>%
  mutate(id = row_number())

actual_treated <- 
  ri %>% 
  filter(d==1) %>% 
  pull(id)

combo <-
  tibble(treated = combn(nrow(ri), sum(ri$d), simplify = FALSE),
         permutation = 1:length(treated)) %>%
  crossing(ri %>% select(id, y)) %>%
  rowwise() %>%
  mutate(d = is.element(id, treated))

ates <- 
  combo %>% 
  group_by(permutation) %>%
  summarize(te1 = sum(d * y, na.rm = TRUE), 
            te0 = sum((1 - d) * y, na.rm = TRUE)) %>%
  mutate(ate = te1 - te0) %>% 
  arrange(desc(ate)) %>% 
  mutate(rank = row_number(),
         p_value = rank/nrow(.))

actual_permutation <-
  combo %>% 
  rowwise() %>% 
  filter(setequal(actual_treated, treated)) %>% 
  select(permutation) %>% 
  distinct() %>% 
  pull()

ates %>%  
  filter(permutation == actual_permutation)
#> # A tibble: 1 x 6
#>   permutation   te1   te0   ate  rank p_value
#>         <int> <dbl> <dbl> <dbl> <int>   <dbl>
#> 1           1    34    30     4    25   0.357

^{Created on 2021-02-27 by the reprex package (v1.0.0)}

Set notation in Probability and Regression Review

I am not sure if this is the right place to mention issues with the book's content itself. If not, feel free to close!

In the Probability Tree section, the footnote is not displaying the latex equation correctly (and I believe union should be intersection):

The set notation $\cap$ means “union” and refers to two events occurring together.

This seems to occur inside other footnotes too.

Similarly, in the Venn Diagram and Sets section (errors in bold):

Whenever we want to describe a set of events in which either A or B could occur, it is: A∩B. And this is pronounced “A union B,” which means it is the new set that contains every element from A and every element from B. Any element that is in either set A or set B, then, is also in the new union set. And whenever we want to describe a set of events that occurred together—the joint set—it’s A∪B, which is pronounced “A intersect B.”

In the Contingency Tables section, in table 2.4 the table headers describing events B and ~B are mixed up:

Event labels	*Coach is not rehired (B)*	*Coach is rehired (∼B)*	Total
(A)(A) Bowl game	Pr(A,∼B)Pr(A,∼B)=0.1	Pr(A,B)Pr(A,B)=0.5	Pr(A)Pr(A)=0.6
(∼A)(∼A) no Bowl game	Pr(∼A,∼B)=0.1Pr(∼A,∼B)=0.1	Pr(∼A,B)=0.3Pr(∼A,B)=0.3	Pr(B)=0.4Pr(B)=0.4
Total	Pr(∼B)=0.2Pr(∼B)=0.2	Pr(B)=0.8Pr(B)=0.8	1.0

Below equation 2.7 in the same section, the term Pr(A,B) is missing an ending bracket, and equation 2.8 is not tagged correctly.
Hope this is somewhat helpful!

moviestar.py missing import of Stargazer

Stargazer is used in moviestar.py, but it is not imported in the script. I think you just need to add from stargazer.stargazer import Stargazer.

Happy to submit a pull request if you'd prefer.

abortion_ddd.do

Title says DDD estimate for 15-19 year olds vs. 20-24 year but code shows 15-19 vs 25-29 year olds: if bf==1 & (age==15 | age==25)

Some of the Python code might be not run

I noticed that some Python code might not run as is. For example, in https://github.com/scunning1975/mixtape/blob/master/python/synth_1.py, it looks like we need:

To import robjects, IntVector, plt
To import pandas2ri and call pandas2ri.activate() so that the dataframe texas would get converted

I have only looked at the code from the synthetic control chapter. I can make a PR for this.

Mistake in graph in Section 3.1.1

The third graph in Section 3.1.1 shows that B causes Y, but in the text, the following is stated (tenth paragraph in the same subsection, which is the third paragraph after the third graph):
"It is telling what is happening, and it is telling what is not happening. For instance, notice that
B has no direct effect on the child’s earnings except through its effect on schooling."

Perhaps the mistake is that you used the same graph as the last one in Section 3.1.3.

Variant on tea.R

This is just FYI.

I wanted a variant of tea.R where a more causal interpretation could be offered (there is a kind of causation going on here, as discussed below). The code here is a bit more general (and a bit shorter), as I put the guesses as a vector in one column.

library(tidyverse)
library(utils) # For combn function

n_cups <- 8L

# Can choose any n_cups/2 cups to have "milk first"
# The code in the book assumes (wlog) that these are 1, 2, 3, 4.
milk_first <- sample(n_cups, n_cups/2L)
milk_first
#> [1] 3 6 2 8
n_guesses <- length(milk_first)

guesses <-
  tibble(guesses = combn(n_cups, n_guesses, simplify = FALSE)) %>%
  rowwise() %>%
  mutate(correct = setequal(guesses, milk_first)) 

# Look at typical guesses. These are vectors.
guesses$guesses[1]
#> [[1]]
#> [1] 1 2 3 4
guesses$guesses[67]
#> [[1]]
#> [1] 4 5 6 8

sum(guesses$correct)
#> [1] 1
p_value <- mean(guesses$correct)
p_value
#> [1] 0.01428571

guesses %>% filter(correct) %>% select(guesses) %>% pull()
#> [[1]]
#> [1] 2 3 6 8

# Now suppose that protocol is changed. For each cup, Fisher tosses a coin:
# if heads, he puts in milk first; if tails, he puts in tea first.
# Muriel Bristol then sips each of the cups and declares either "milk first" or
# "tea first".
#
# With this protocol, it's easier to see a causal interpretation. Label the
# milk-first cases as treatment; tea-first cases as controls. The question is
# whether treatment leads to the outcome of "milk first". One has a sample of 
# n_cups independent observations and can use the regular framework. But here,
# we just assume that Bristol guessed all correctly and calculate exact p-value
# for that.
#
# (Note that one can apply a similar causal interpretation to the original 
#  protocol, but one couldn't apply the standard framework to it; 
#  no SUTVA for one, as Bristol will declear four and only four cases as 
#  "tea first". 
#
#  Alternatively, if Fisher did not tell Bristol that four
#  cups are milk-first and four are tea-first, then he could use the same 
#  standard framework, as the guesses would become independent again under
#  the sharp null that putting tea or milk first does not affect Bristol's 
#  guess.)
milk_first <- which(sample(c(TRUE, FALSE), n_cups, replace=TRUE))
milk_first
#> [1] 2 7

guess_n <- function(x) combn(n_cups, x, simplify = FALSE)

guesses <-
  tibble(guesses = unlist(lapply(0:n_cups, guess_n), recursive = FALSE)) %>%
  rowwise() %>%
  mutate(correct = setequal(guesses, milk_first))

sum(guesses$correct)
#> [1] 1
p_value <- mean(guesses$correct)
p_value
#> [1] 0.00390625

guesses %>% filter(correct) %>% select(guesses) %>% pull()
#> [[1]]
#> [1] 2 7

^{Created on 2021-02-27 by the reprex package (v1.0.0)}

Typo in reference to figure

On this page, there is a reference to figure 49 whereas all other figures are references as Figure 7.x. Here's the paragraph:

"Figure 49 shows the first stage, and it is really interesting. Look at all those 3s and 4s at the top of the picture. There’s a clear pattern—those with birthdays in the third and fourth quarter have more schooling on average than do those with birthdays in the first and second quarters. That relationship gets weaker as we move into later cohorts, but that is probably because for later cohorts, the price on higher levels of schooling was rising so much that fewer and fewer people were dropping out before finishing their high school degree."

Colinearity raises a warning because of a singular matrix in both regressions in castle_1.R
Inclusion of original Cheng and Hoekstra specification with CDL as the causal regressor doesn't line up with castle_1.do

SCtools now on CRAN

I think you can go ahead and drop the need for devtools and install the CRAN version of SCtools.

mixtape/R/synth_1.R

Line 5 in d05e7e1

if(!require(SCtools)) devtools::install_github("bcastanho/SCtools")

stargazer not imported in this file prior to use

Just to let you know that stargazer is not imported in this file so the code will error at this line:

mixtape/python/collider_discrimination.py

Line 30 in 77e3de4

st = Stargazer((lm_1,lm_2,lm_3))

Seems fine in the other file in this chapter (chapter 3) which is movie_star.py.

(Really enjoying the book so far!! Thanks so much)

	import ssl
	ssl._create_default_https_context = ssl._create_unverified_context