Git Product home page Git Product logo

mixtape's Introduction

mixtape

Data files for Causal Inference: The Mixtape

Contributions are very welcome. You can read this guide for more guidance.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

mixtape's People

Contributors

alexanderthclark avatar be-green avatar edjeeongithub avatar ian-mcbride avatar kylebutts avatar lrdegeest avatar mpaulacaldas avatar mserramo avatar scunning1975 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mixtape's Issues

ATT equation (4.3) on page 128

Second line introduces the symbol $E_i^0$. Shouldn't that be $Y_i^0$?

(I'm using LaTeX notation. I don't know how to do this in Markdown.)

Chapter 9, code castle_5 not included on the website, 1 minor typo

Dear Scott, thank you for this extremely valuable and easy-to-read book. Recently I was given chapter 9 of it as reading, and found two issues that I thought I could tell you here.

1. Missing castle_5 on the website

Link: 9. Difference-in-Differences 9.6.8 Bacon decomposition.

On pp.499-500 of the book there are castle_5.do and castle_5.r, and they are also in this repository, plus the Python version.

2. Typo in both the book and the website

Link: 9  Difference-in-Differences 9.5.7 Final thoughts, and also p.461 of the paper book (sorry I don't know if I have the latest edition, but since it's also on the website, I assume it was not spotted yet.)

These, it turns, out are

Should be: These, it turns out, are

synth6.do matstate

Hi,

Just running into some issues with this line of Stata code:

matstate=state1/state2/state4/state5/state6/state8/state9/state10/state11/state12/state13/state15/state16/state17/state18/state20/state21/state22/state23/state24/state25/state26/state27/state28/state29/state30/state31/state32/state33/state34/state35/state36/state37/state38/state39/state40/state41/state42/state45/state46/state47/state48/state49/state51/state53/state55; 

Stata reports that matstate is a unrecognized command.

clean up unused imports in python version

import numpy as np 
import pandas as pd 
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
from itertools import combinations 
import plotnine as p

Not every script uses every import, so possible to remove unused import and free some space. Can make a PR for this.

Add CONTRIBUTING.md file

CONTIBUTING.md files usually explain to new contributors how to contribute via Github. They can also specify other requirements (e.g. a code style, etc). Here in an example from one of the repos from the World Bank.

For this project, I think you could specify:

  1. How to contribute via Github
  2. Explain the difference between an issue and a PR.

If you want, I can submit a PR (a suggested change) and we can continue the conversation there.

minor error in abortion_dd.R

Love these materials. I was just working through them and noticed that in abortion_dd.R I needed to change lines 28 and 29 to

sd = reg$std.error[-1:-25],
mean = reg$coefficients[-1:-25],

Currently, they run [-1:-75]

Typo in Figure 10.8 title

In Chapter 10, there is a typo in figure 10.8.

The plot title says "Back male incarceration" instead of "Black male incarceration." Doesn't look like the code for that figure is included, so I'm not sure where to find the typo and make a pull request.

randomization inference p-values

I'm getting a different p-value than is calculated in ri.py or commented in ri.do. I only know python (excited by the new python code added!) so I'll reference the py file.

I believe the issue is with the line p_value = p_value[p_value['permutation'] == 1] which doesn't calculate a p-value based on a weak nor strict inequality. Prior to that line, signed t-statistics are ordered and ranked. There are several observations with the same t-stat of 1. So, if we wanted the p-value calculation to use a weak inequality (find the share of observations with a weakly higher ATE), the following minimal edits would do the job.

p_value = p_value[p_value['ate'] == 1] 
p_value['rank'].max() / n

This gives 0.4285.

The simplest code I can think to do the same thing is the following, though it's not the smartest because it relies on permutations instead of combinations.

from itertools import permutations
import pandas as pd
import numpy as np

url = 'https://github.com/scunning1975/mixtape/raw/master/ri.dta'
df = pd.read_stata(url, index_col = 'name')
observed_t_stat = 1
y_vec = df.y.values 

# create vector of treatment assignments
# use -1 instead of 0 for dot product assist
d = np.concatenate( [np.ones(4), (-1)*np.ones(4)] ) 

t_stats = np.array([])
for d_vec in permutations(d):    
    t = np.dot(y_vec, d_vec) / 4 # signed t-stat
    t_stats = np.append(t_stats, t)

p_value = (t_stats >= observed_t_stat).mean()

I'm making this an issue, because I want to check my own understanding (I'm self-studying) and I think there's also the issue of whether or not the code should be using absolute values for the t-statistics to match the book.

Flip sort to calculate p-value

mixtape/R/ri.R

Line 44 in 3c84896

arrange(ate) %>%

I think this needs to be arrange(desc(ate)) to get the one-tailed p-value correct. Doesn't matter much (still insignificant). Just FYI, here's alternative (generalized) code (similar to that in #20):

library(tidyverse)
library(haven)

read_data <- function(df)
{
  full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/", 
                     df, sep = "")
  df <- read_dta(full_path)
  return(df)
}

ri <- 
  read_data("ri.dta") %>%
  mutate(id = row_number())

actual_treated <- 
  ri %>% 
  filter(d==1) %>% 
  pull(id)

combo <-
  tibble(treated = combn(nrow(ri), sum(ri$d), simplify = FALSE),
         permutation = 1:length(treated)) %>%
  crossing(ri %>% select(id, y)) %>%
  rowwise() %>%
  mutate(d = is.element(id, treated))

ates <- 
  combo %>% 
  group_by(permutation) %>%
  summarize(te1 = sum(d * y, na.rm = TRUE), 
            te0 = sum((1 - d) * y, na.rm = TRUE)) %>%
  mutate(ate = te1 - te0) %>% 
  arrange(desc(ate)) %>% 
  mutate(rank = row_number(),
         p_value = rank/nrow(.))

actual_permutation <-
  combo %>% 
  rowwise() %>% 
  filter(setequal(actual_treated, treated)) %>% 
  select(permutation) %>% 
  distinct() %>% 
  pull()

ates %>%  
  filter(permutation == actual_permutation)
#> # A tibble: 1 x 6
#>   permutation   te1   te0   ate  rank p_value
#>         <int> <dbl> <dbl> <dbl> <int>   <dbl>
#> 1           1    34    30     4    25   0.357

Created on 2021-02-27 by the reprex package (v1.0.0)

Set notation in Probability and Regression Review

I am not sure if this is the right place to mention issues with the book's content itself. If not, feel free to close!

In the Probability Tree section, the footnote is not displaying the latex equation correctly (and I believe union should be intersection):

The set notation \(\cap\) means “union” and refers to two events occurring together.

This seems to occur inside other footnotes too.

Similarly, in the Venn Diagram and Sets section (errors in bold):

Whenever we want to describe a set of events in which either A or B could occur, it is: A∩B. And this is pronounced “A union B,” which means it is the new set that contains every element from A and every element from B. Any element that is in either set A or set B, then, is also in the new union set. And whenever we want to describe a set of events that occurred together—the joint set—it’s A∪B, which is pronounced “A intersect B.”

In the Contingency Tables section, in table 2.4 the table headers describing events B and ~B are mixed up:

Event labels Coach is not rehired (B) Coach is rehired (∼B) Total
(A)(A) Bowl game Pr(A,∼B)Pr(A,∼B)=0.1 Pr(A,B)Pr(A,B)=0.5 Pr(A)Pr(A)=0.6
(∼A)(∼A) no Bowl game Pr(∼A,∼B)=0.1Pr(∼A,∼B)=0.1 Pr(∼A,B)=0.3Pr(∼A,B)=0.3 Pr(B)=0.4Pr(B)=0.4
Total Pr(∼B)=0.2Pr(∼B)=0.2 Pr(B)=0.8Pr(B)=0.8 1.0

Below equation 2.7 in the same section, the term Pr(A,B) is missing an ending bracket, and equation 2.8 is not tagged correctly.
Hope this is somewhat helpful!

abortion_ddd.do

Title says DDD estimate for 15-19 year olds vs. 20-24 year but code shows 15-19 vs 25-29 year olds: if bf==1 & (age==15 | age==25)

Mistake in graph in Section 3.1.1

The third graph in Section 3.1.1 shows that B causes Y, but in the text, the following is stated (tenth paragraph in the same subsection, which is the third paragraph after the third graph):
"It is telling what is happening, and it is telling what is not happening. For instance, notice that
B has no direct effect on the child’s earnings except through its effect on schooling."

Perhaps the mistake is that you used the same graph as the last one in Section 3.1.3.

Variant on tea.R

This is just FYI.

I wanted a variant of tea.R where a more causal interpretation could be offered (there is a kind of causation going on here, as discussed below). The code here is a bit more general (and a bit shorter), as I put the guesses as a vector in one column.

library(tidyverse)
library(utils) # For combn function

n_cups <- 8L

# Can choose any n_cups/2 cups to have "milk first"
# The code in the book assumes (wlog) that these are 1, 2, 3, 4.
milk_first <- sample(n_cups, n_cups/2L)
milk_first
#> [1] 3 6 2 8
n_guesses <- length(milk_first)

guesses <-
  tibble(guesses = combn(n_cups, n_guesses, simplify = FALSE)) %>%
  rowwise() %>%
  mutate(correct = setequal(guesses, milk_first)) 

# Look at typical guesses. These are vectors.
guesses$guesses[1]
#> [[1]]
#> [1] 1 2 3 4
guesses$guesses[67]
#> [[1]]
#> [1] 4 5 6 8

sum(guesses$correct)
#> [1] 1
p_value <- mean(guesses$correct)
p_value
#> [1] 0.01428571

guesses %>% filter(correct) %>% select(guesses) %>% pull()
#> [[1]]
#> [1] 2 3 6 8

# Now suppose that protocol is changed. For each cup, Fisher tosses a coin:
# if heads, he puts in milk first; if tails, he puts in tea first.
# Muriel Bristol then sips each of the cups and declares either "milk first" or
# "tea first".
#
# With this protocol, it's easier to see a causal interpretation. Label the
# milk-first cases as treatment; tea-first cases as controls. The question is
# whether treatment leads to the outcome of "milk first". One has a sample of 
# n_cups independent observations and can use the regular framework. But here,
# we just assume that Bristol guessed all correctly and calculate exact p-value
# for that.
#
# (Note that one can apply a similar causal interpretation to the original 
#  protocol, but one couldn't apply the standard framework to it; 
#  no SUTVA for one, as Bristol will declear four and only four cases as 
#  "tea first". 
#
#  Alternatively, if Fisher did not tell Bristol that four
#  cups are milk-first and four are tea-first, then he could use the same 
#  standard framework, as the guesses would become independent again under
#  the sharp null that putting tea or milk first does not affect Bristol's 
#  guess.)
milk_first <- which(sample(c(TRUE, FALSE), n_cups, replace=TRUE))
milk_first
#> [1] 2 7

guess_n <- function(x) combn(n_cups, x, simplify = FALSE)

guesses <-
  tibble(guesses = unlist(lapply(0:n_cups, guess_n), recursive = FALSE)) %>%
  rowwise() %>%
  mutate(correct = setequal(guesses, milk_first))

sum(guesses$correct)
#> [1] 1
p_value <- mean(guesses$correct)
p_value
#> [1] 0.00390625

guesses %>% filter(correct) %>% select(guesses) %>% pull()
#> [[1]]
#> [1] 2 7

Created on 2021-02-27 by the reprex package (v1.0.0)

Typo in reference to figure

On this page, there is a reference to figure 49 whereas all other figures are references as Figure 7.x. Here's the paragraph:

"Figure 49 shows the first stage, and it is really interesting. Look at all those 3s and 4s at the top of the picture. There’s a clear pattern—those with birthdays in the third and fourth quarter have more schooling on average than do those with birthdays in the first and second quarters. That relationship gets weaker as we move into later cohorts, but that is probably because for later cohorts, the price on higher levels of schooling was rising so much that fewer and fewer people were dropping out before finishing their high school degree."

Slight Typo when cross-referencing in book

In section 7.5 of the book there is a bookdown cross-reference (@ref{fig:qob2}) that doesn't get rendered. I am fairy sure it is because you used curly brackets instead of parentheses but can't see the source code so not sure.

(also, thank you so much this book has been a lifesaver)

Issues with lmb_5.R and lmb_6.R

lmb_data %>% mutate(demvoteshare_sq = demvoteshare_c^2)

need
lmb_data <- lmb_data %>% mutate(demvoteshare_sq = demvoteshare_c^2)

Castle.R issues

@ttp63 was kind enough to notice some issues with the two-way fixed effects model in castle, but it seems that there are some issues with the new version:

  • Colinearity raises a warning because of a singular matrix in both regressions in castle_1.R
  • Inclusion of original Cheng and Hoekstra specification with CDL as the causal regressor doesn't line up with castle_1.do

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.