This report summarizes the statistical modeling and analysis results we obtained performing a Bayesian analysis on a dataset collecting the main parameters which are considered during application for Masters Programs.
The goal of our analysis has been to fit a model capable to predict the probability of admission for students following a Bayesian approach, and to understand what are the parameters that affect the most the probability of being accepted.
Our Dataset contains information about 500 indian students, for each of them we have seven important parameters usually considered during application for the Master Program and a response variable varying between {0,1}, that describes the chance of admission. Data are collected for 500 students.
To have a better idea, let's take a look at the first 7 row of the dataset:
Let’s explore our data:
GRE is a widely used test in U.S.A (but also in other countries) for admission in Master’s programs. It measure verbal reasoning, quantitative reasoning, analytical writing, and critical thinking, by testing algebra, geometry, arithmetic, and vocabulary knowledge.
In general, the minimum and the maximum points that a student can obtain are, respectively, 260 and 340.
The TOEFL, instead, measures the English language ability of non-native speakers. Its range goes from 0 to 120.
This is a categorical variable which indicates the rank of the university the student comes from. It will be split in dummy variables when implementing the model.
To evaluate SOP and LOR of each student, they are given a rating from 1 to 5 in steps of 0.5, which indicates the “strength”.
The CGPA indicates the Cumulative Grade Point Average of each student at university at the moment of the application for Master’s program.
A simple dummy variable which indicates whether the student already did research or not.
We decided to implement 2 different bayesian models:
- Logistic regression model
- Beta regression model
We assumed our response variable Y as binary,
So we can now assume:
Now the aim is to model the effect of predictors on
In our case we decided to use a logit link function
and so
$$\pi_{i} = g^{-1}\left(\boldsymbol{\beta}^{\top} \boldsymbol{x}{i}\right)=\frac{e^{\boldsymbol{\beta}^{\top} \boldsymbol{x}{i}}}{1+e^{\boldsymbol{\beta}^{\top} \boldsymbol{x}_{i}}} $$
Under
\begin{aligned} p(\boldsymbol{y} \mid \boldsymbol{\beta}) &=\prod_{i=1}^{n} p\left(y_{i} \mid \pi_{i}\right) \ &=\prod_{i=1}^{n} \pi_{i}^{y_{i}}\left(1-\pi_{i}\right)^{1-y_{i}} \ &=\prod_{i=1}^{n} h\left(\boldsymbol{\beta}^{\top} \boldsymbol{x}{i}\right)^{y{i}}\left(1-h\left(\boldsymbol{\beta}^{\top} \boldsymbol{x}{i}\right)\right)^{1-y{i}} \end{aligned}
And the priors on
Since the prior is not conjugate neither semiconjugate, to approximate the posterior
To implement the Metropolis Hastings algorithm we use JAGS, a program for analysis of Bayesian models using Markov Chain Monte Carlo (MCMC) simulation, that easily allows us to implement Gibbs sampler and Metropolis Hastings algorithms.
MCMC is a numerical technique and hence subject to approximation error. For this reason before using the output that we obtained for posterior inference on
The trace plots provide a graphical representation of the Markov chain for
The chain should be concentrated within a region of high posterior probability centered around the mode of
In trace plots and autocorrelation plot we see no problems. In the first case we want the chain to converge to the mode, while in the latter we would avoid to have high value of autocorrelation inside the chain and this is, clearly, the case.
The idea behind the Geweke test is: if the chain has reached convergence then statistics computed from different portions of the chian should be similar
Considering two portions of the chain:
-
$\boldsymbol\theta_I$ : the initial 10% of the chain (with size$n_I$ ) -
$\boldsymbol\theta_L$ : the last 50% of the chain (with size$n_L$ )
The Geweke statistics is
Large absolute values of the Geweke Statistics lead us to reject the hypothesis of stationarity.
-0.23 | -0.32 | 0.72 | 1.7 | 1.4 | 1.07 | 1.26 | -0.5 | 0.6 | -0.37 | -0.56 |
As we can see from the output, in our case the test does not suggest problems in our chain.
Rather than a test, is a value that quantifies the reduction in the effective number of draws, due to the correlation in the chain.
5000 | 5000 | 50000 | 5000 | 5000 | 5000 | 5000 | 5310.3 | 5000 | 5000 | 5238.9 |
Both graphical and formal tests for MCMC convergence lead us to conclude that the Markov chain that we generated provide a good aproximation of the true posterior distribution of the beta regressors.
We can then state that each column of the output matrix will then approximate the posterior distribution of the relative
Now we want to use our model on "new" observations, and see if it is able to predict whether a student will be admitted or not.
As new observation we consider the test data frame, that we created at the beginning by splitting the original dataset. It contains 125 new observations: $\boldsymbol{x}^j =(x{j,1}^),....,(x_{j,11}^*)$ for
With the posterior distribution of
For i = 1,...S:
- For j = 1,....,J:
- Compute
$\eta ^{(s)} = \boldsymbol\beta^{(s)T}\boldsymbol{x}^*$ - Compute
$\pi ^{(s)} = g^{-1}(\boldsymbol\beta^{(s)T}\boldsymbol{x}^*)$
- Compute
After obtaining 5000 values for
We can compare what we’ve obtained with the observed value from test data:
Watching at the confusion matrix we can see that we obtain:
- Overall accuracy
$\approx81%$ - Sensitivity:
$\approx84%$ - Specificity:
$\approx76%$
To perform variable selection we introduce a (p,1) binary vector
$\delta_{j}=\left{\begin{array}{lll} 1 & \text { if } & X_{j} \text { is included in the model } \ 0 & \text { if } & X_{j} \text { is not included in the model } \end{array}\right.$
So we have that
and we assign a prior both on
The resulting joint prior is
We also assigned a prior to
Also in this case JAGS will help use to find the posterior distribution for $\boldsymbol\beta \quad
Notice that:
- if
$\delta_{j}^{(s)} = 0$ then implicitly$\beta_{j}^s = 0$ and so the predictor$X_j$ is not included in the model at time s - if
$\delta_{j}^{(s)} = 1$ then the predictor$X_j$ is included in the model at time s with coefficients$\beta_{j}^s$
Therefore, we can use the frequency of
Formally:
We can see the posterior probability of inclusion for each predictor in a plot:
How can we use this kind of results to do our predictions? BMA!
The idea: Instead of selecting one single model, we use all the models generated with the spike and slab, and then compute a prediction of y which is averaged with respect to all these models.
We can use the posterior model probabilities to weight prediction obtained under each of the
First we compute the predictive distribution for all of our S model $$\begin{aligned} p\left(y^{} \mid y_{1}, \ldots, y_{n}, \mathcal{M_s}\right) &=\int p\left(y^{} \mid \boldsymbol{\beta}, \mathcal{M_s}\right) \cdot p\left(\boldsymbol{\beta} \mid \mathcal{M_s}, y_{1}, \ldots, y_{n}\right) d \boldsymbol{\beta} \ &=\int \text { Sampling distribution at } y^{*} \cdot \text { Posterior of } \boldsymbol{\beta} d \boldsymbol{\beta} \end{aligned} $$ and now we use all these predictive distribution to obtain one last average predictive distribution for our individuals.
This approach is called Bayesian Model averaging
$$p\left(y^{} \mid y_{1}, \ldots, y_{n}\right)=\sum_{k=1}^{K} p\left(y^{} \mid y_{1}, \ldots, y_{n}, \mathcal{M}{k}\right) p\left(\mathcal{M}{k} \mid y_{1}, \ldots, y_{n}\right) \quad (1)$$
In general, this requires the computation of posterior model probabilities, but since we are dealing with GLMs, we are not able to compute the posterior model probability, so we need to find a way to approximate (1).
At each iteration we have:
$\boldsymbol\beta^{(s)} = (\beta_1^{s},....,\beta_p^{(s)})$ $\boldsymbol\delta^{(s)} = (\delta_1^{s},....,\delta_p^{(s)})$
and we can think to
For s = 1,....,S
- Compute
$\eta^{s} = \delta_1^{(s)}\beta_1^{(s)}x_1^* + ..... +\delta_p^{(s)}\beta_p^{(s)}x_p^*$ (basically we are computing a linear predictor under a different model) - Compute
$\pi^{(s)} = g^{-1}(\eta^{(s)})$ - Draw $y^{(s)}$ from $p(y|\pi^{(s)})$
And eventually the output ${y^{(1)},.....,y^{(S)}}$ is a BMA sample from the predictive distribution of Y for subject i.
We can now evaluate how good is our model in prediction, comparing our prediction with the true values from test data and creating a confusion matrix:
We obtained:
- Overall accuracy:
$\approx85.5%$ - Sensitivity:
$\approx85.6%$ - Specificity:
$\approx86.6%$
And comparing this results with the ones obtained performing prediction with the model with all variables included, we can state that thanks to spike and slab variable selection and BMA, we obtained an improvement in our prediction, especially in terms of specificity which raised to
We also decided to model the original chance of being admitted, using a Beta regression.
The beta regression models are used to model variables that assume values in the interval (0, 1). They're based on the assumption that the response is beta-distributed and that its mean is related to a set of regressors through a linear predictor and a link function. The model also includes a precision parameter on which we to put a prior^[Beta Regression in R - Francisco Cribari-Neto, Achim Zeileis].
Since we assume that
This parameterizazion holds and we can demonstrate it, since:
Now by replacing
So we end with:
- Shape 1 =
$a$ =$\mu\phi$ - Shape 2 =
$b$ =$(1-\mu)\phi$
We will create a linear model for
To visualize how this new parameterization holds (i.e the distribution does not change) we can plot together some density beta distributions with the two different parameterization:
The linear model for
To obtain the parameters of the Beta we need also
The likelihood of
Which is a function of
As usual, we have a prior on
With these element, we implemented a Metropolis-Hastings algorithm using JAGS.
JAGS algorithm has:
- As likelihood:
$Y_i \stackrel{ind.}{\sim} dBeta(\tilde{a},\tilde{b})$ $\tilde{a}=\mu_i \phi$ $\tilde{b}=(1-\mu_i) \phi$ - a model which is:
$log \frac{\mu_i}{1-\mu_i}=\beta^Tx_i$ while the priors are: - for
$\beta$ we put a$dN(0,0.001)$ - for
$\phi$ we put a$dGamma(0.01,0.01)$
We decided to be weakly informative and let the model "learn" from the data.
We can observe that they're all centered around the mode, as we would like to have.
In the graphical checks we didn’t find unusual pattern and the convergence near the mode is reached early, so we don’t need to make thinning or to set a burn-in period.
Now let’s see if the diagnostic tests show problems:
-0.89 | 1.4 | 0.96 | -0.47 | 0.08 | 0.16 | 0.14 | -0.9 | -0.58 | -0.64 | 0.19 |
5000 | 5000 | 50000 | 5000 | 5000 | 5000 | 5000 | 4555.9 | 5000 | 5000 | 5000 |
Then, to visualize how posterior values of the parameters are distributed we can use boxplots again:
After creating the model, we used the posterior distribution of Beta parameters to approximate the posterior predictive distribution of $y^$ based on the test observations $x^$ after the split we made before.
Since in the test vector we have the original chances of admission, we can compare them with our prediction in order to evaluate the predictive performance of the model.
To approximate the posterior predictive distribution of
For i = 1,...S:
- For j = 1,....,125:
- Compute
$\eta_j ^{(s)} = \boldsymbol\beta^{(s)T}\boldsymbol{x}^*_j$ - Compute
$\mu_j ^{(s)} = h(\boldsymbol\beta^{(s)T}\boldsymbol{x}^*_j)$ - Compute the parameters
$a_j^{(s)}=\mu_j^{(s)}\phi_j^{(s)}$ - Compute the parameters
$b_j^{(s)}=(1-\mu_j^{(s)})\phi_j^{(s)}$ - Draw a
$y^{*(s)}_j\sim Beta(a_j^{(s)},b_j^{(s)})$
- Compute
Now we can compute the mode of the 125 predictive posterior distribution $(y^1,...y{125}^) \sim (a_j,b_j)$ and then compare it with the observed values:
We can also calculate the MSE, which is
To perform variable selection in our Beta regression we used a spike & slab prior as in the logistic case. Also here we assigned a prior to
such that their joint prior will be the "spike & slab" prior.
We assigned also a prior on the parameter
To be non informative, we assigned
Barplot of the estimate of the posterior probability of inclusion for each predictor:
As we did in the logistic case, we can implement a BMA strategy, starting from the result of the Spike & Slab variable selection, to predict new values $Y^$. We did it with an algorithm which is equal to the one used for predictions above, the only difference is that the linear predictor is the result of: $$\eta=\boldsymbol{\gamma^T\beta^T} \ x^ $$
The MSE in this case is
The BMA strategy didn't improve the MSE, but this is a sort of "frequentist ground" for comparison. The strength of bayesian predictions is that we obtain a distribution for the future observation, instead of a single fitted value. A better idea would be to compare the posterior predictive distribution before and after the Spike and Slab variable selection with the true value from test data for four different students:
From the result of our analysis we can state that the logistic model we implemented performs very well, but it can tell a student only whether it has High or Low chance of being admitted. While whit the beta regression model, we are able to provide to each individuals a distribution of their probability of being admit. Comparing the mode of these distributions with the true value of the test dataset, the model doesn’t show high precision of the prediction, but this comparison is quite limiting for a bayesian approach in which we can get a distribution for the chance of each individual instead of a fitted value.