fastBigStat
- Prerequisites
- Notation
Some Preliminaries
"Exact" vs. Approximate Statistical Inference: Historical Perspective
The Multivariate Normal Distribution Family
Example: Linear Models, Exact and Approximate Inference
Multiple Inference Procedures
- The Bonferroni Inequality
- The Scheffe' method
Simulation of MV Normal Random Vectors
Types of Convergence
Central Limit Theorems (CLTs)
Example: Method of Moments (MM) Estimators
The Delta Method

fastBigStat

A fast, practical introduction to asymptotic normality of statistical estimators.

Includes the necessary support material on random vectors and the multivariate (MV) normal (Gaussian) distribution family, as well as some insight into statistical inference based on normal distributions.

Author: Norm Matloff, UC Davis; bio

Prerequisites

Calculus-based probability theory -- random variables, cdfs, density functions, expected value etc.
Linear algebra.
Familiarity with limits.

Notation

Vectors will be in column form.
Prime (') will indicate matrix transpose.
The identity matrix, of whatever given size, will be denoted by I.

Some Preliminaries

Consider a random vector X = (X₁,...,X_p)'.
E(X) means the vector of componentwise means E(X_i).
Covariance
- For random variables U and V, with means α and β, define
  
  Cov(U,V) = E[(U-α)(V-β)]
  
  Intuitively, if often when U is above its mean α then V is also above its mean β, then the covariance will be positive, etc. It's like correlation, and in fact the latter is a scaled version of covariance,
  
  ρ(U,V) = Cov(U,V) / [sqrt(Var(U) sqrt(Var(V))]
- The covariance matrix Cov(X) of a random vector X = (X₁,...,X_p)' is a p x p matrix with (i,j) element Cov(X_i,X_j), which reduces to Var(X_i) when i = j.
- In matrix terms, Cov(X) = E[(X - μ) (X - μ)'], where μ = E(X).
- Let Σ be a covariance matrix. As a real, symmetric matrix, it can be diagonalized, i.e. Σ = P D P' for an orthogonal matrix P and diagonal matrix D. Let W = PD^1/2P', where the square root matrix simply takes the square roots of the diagonal elements of D. Then W² = Σ, and we can thus write W = Σ^1/2.
Law of Iterated Expectation
- Let U and V be random quantities (scalars here, for convenience) for which the expectations below exist. Then
  
  E[ E(V|U) ] = E(V)
  
  Note that E(V|U) is treated as a random variable, which it in fact is, since it is a function of U.
- Intuition: Suppose we wish to find the mean height of all UC Davis students. We could ask each academic department to find the mean height of their students, then average those mean heights. (Since some departments are larger than others, that latter average is weighted.)
Role of the Data
- Say we have data W₁,...,W_n.
- The statistical view is to treat the data as a random sample from a population.
- In computer science, the data are often viewed as fixed, but some analyses describe the data as coming from a "data generating mechanism" and the like.
- We take the statistical view here. Typically the data are assumed independent and identically distributed (iid). That first trait arises from sampling with replacement (for large n, not much of an issue), and the second means that each entity in the population has the same probability of being sampled.
- Note that this means that each W_i has the distribution of the population. E.g. if 29% of the units (e.g. people) in the population have value below 221.4, then P(W_i < 221.4) = 0.29.0.
  
  Similarly, if the population quantity is normally distributed, that means each W_i has a normal distribution (with the same mean and variance as the population).
- The sample mean
  
  W_bar = (1/n) (W₁+...+W_n)
  
  being the sample analog of the population mean, is a natural estimate (i.e. one that we might think of first if we were inventing statistics) of the population mean E(W), where W has the distribution of the popuulation
- Similarly, since
  
  Var(W) = E[(W - E(W))^2],
  
  a natural estimator of Var(W) is
  
  s² = (1/n) Σ_i (W_i - W_bar)²
  
  Suppose now W is a vector. The similar estimator of Cov(W) is
  
  (1/n) Σ_i (W_i - W_bar) (W_i - W_bar)'
  
  Many books will use n-1 rather than n as a divisor, to make s² unbiased, i.e.
  
  E(s²) = Var(W)
  
  This says the average of s², over all possible samples, is equal to the population variance. This usually won't matter for us, as we are primarily interested in large-n settings.

"Exact" vs. Approximate Statistical Inference: Historical Perspective

Say we have scalars W_i as above, with E(W_i) = μ and Var(W_i) = σ². The latter two are population values, and we wish to estimate μ in particular.
When statistical inference methodology -- confidence intervals (CIs) and hypothesis tests -- was being developed, they had no computers, so they just made the assumption that the sampled quantity, W above, has a normal distribution, to simplify things.
The pioneers of statistics didn't have asymptotic theory available either, so the methodology they developed is called "exact"; the probabilities etc. are exactly correct (if the assumption of normal W_i holds).
The notion of the normal distribution goes back to the 18th and 19th centuries, especially due to the Central Limit Theorem (CLT).
One might say that the normal family was there in Nature, waiting to be discovered.
But related distribution familes, notably Student-t, chi-squared and the F-distribution are "man-made," in the following sense.
With some clever math, they found that certain quantities of interest had the same distribution as that of the sum of k squares of N(0,1) variables. They thus decided to give this distribution family a name, chi-squared, thus inventing a new distribution family. (And they laboriously tabulated this family's cdf by hand, just as they had for the normal.)
With some more clever math, they found that a certain quantity had the distribution of a N(0,1) random variable divided by the square root of an independent chi-squared random variable. So they invented a new distribution family accordingly, the Student-t distribution. For similar reasons, they invented the F-distribution family.
This gives us the test taught in all elementary statistics courses: To test the hypothesis H₀: μ = c, form the expression T = (W_bar - c) / (s/n^0.5) (where s is as above, but with an n-1 divisor). This has a Student-t distribution with n-1 degrees of freedom, whose probabilities can then be used to form cutoff levels for the test.
For n = 50, for instance, the lower and upper 2.5% cutoffs are -2.01 and 2.01. So, we reject H₀ if (W_bar - c) / (s/n^0.5) is outside these bounds.
This is called "exact" inference, since the probabilities are exact, if the W_i are normal.
Later other distribution families were developed, such as the exponential and gamma, and thus exact methods were developed for them as well. This led to development of a general method known as maximum likelihood estimation (MLE).
Yet, people soon realized that in the expression T above, W_bar is approximately normal by the Central Limit Theorem. Moreover, the quantity s is approximately σ. So T is approximately N(0,1) distributed (this intuitive statement can be made rigorous; see below), even if W is not normal.
In other words, one can compare T to the N(0,1) distribution, and thus achieve approximately correct inference, even if the sample distribution was not normal. Hence the practice of approximate inference.
This led to a branch of mathematical statistics known as large sample theory. Core among them is the fact that if a sequence of random variables Z_n is asymptotically normal, then so is any smooth function of them g(Z_n). Since the CLT says, roughly, that sums are approximately normal, this gives us a sequence X_n, which when combined with a suitable function g can give us asymptotic normality in many useful situations.
E.g. for MLEs. The likelihood is a product, so the log likelihood is a sum, just what we need for the Central Limit Theorem! It doesn't quite fit the CLT, is the terms are not independent, but it's enough, as we can use some theory to take care of the non-independence.

We set the derivative of the log likelihood to 0, and solve for our parameter estimate. Here g becomes the inverse function that is needed to solve for the estimate, which is then approximately normal.

The context here is a little different -- we are still assuming an exact distribution, e.g. exponential, for the sampled population, but the principle is the same. Without large sample theory, in most cases we would not be able to derive the exact distribution of the MLE.
The situation with linear regression models is similar, as will be seen below.
The famous statistician George Box famously said, "All models are wrong, but some are useful." No parametric model is truly correct in practice. No one is 10 feet tall, and no one's weight is negative, so heights and weights cannot be exactly normal. No actual random variable is continuous, as our measuring instruments are of only finite precision, so again models such as normal, exponential, gamma and so on are necessarily inexact. Even the iid assumption is often problematic. So the term exact inference is illusory.
Approximate methods are very popular, including with me. I always use N(0,1) instead of the Student-t distribution, for instance. but many analysts today still favor "exact" statistical methods. They note for example that there is always the question of "How large is 'large'?" if one uses asymptotics.

The Multivariate Normal Distribution Family

Consider a random vector X = (X₁,...,X_p)'. The mean is now a vector μ, and standard deviation σ now becomes the p x p covariance matrix Σ. Notation: N(μ,Σ).
Density: c exp[-(t-μ)' Σ^-1 (t-μ)], where the constant c makes the p-fold integral equal to 1. This is for the case in which Σ is invertible.
We will always assume Σ is inverible unless stated otherwise. It simply means that no X_i a linear combination of the others.
For p = 1 we have the familiar "bell-shaped curve", while for p = 2, we get a 3-dimensional bell.

Every marginal distribution of X is also multivariate normal.
For a constant m x p matrix A, AX is also multivariate normal. In particular, if m = 1, then AX is univariate normal; the converse is also true, i.e. if a'X is normal for all a, then X is multivariate normal.
Conditional distributions: For simplicity, consider the conditional distribution of X_i given (X₁,...,X_j-1, X_j+1,...,X_p) = w. Then (inspiration for the familiar linear model):
- That distribution is normal.
- The mean of that distribution is linear in w, i.e. of the form c'w for some c. (A fairly simple formula for c is available but not shown here.)
- The variance of that distribution is constant in w.
For a symmetric matrix A and vector u, the scalar quantity u'Au is called a quadratic form. For p-variate normal X, the quadratic form M = (X-μ)' Σ^-1 (X-μ) turns out to have a chi-squared distribution with p degrees of freedom.
So if η is the q-th quantile of the χ² distribution with p degrees of freedom, then

P(M < η) = q
In the above picture, horizontal slices, i.e. level sets, are ellipses. For p > 2, the regions are p-dimensional ellipsoids. This can be used to derive confidence regions in statistical applications, as will be seen later.

Example: Linear Models, Exact and Approximate Inference

Let's see how all this plays out with the linear model.

Let Y be a random variable and X = (X₁,...,X_p)' a p-dimensional random vector. E.g. we might wish to predict Y = college GPA from X₁ = high school GPA, X₂ = SAT score and so on. Let's call this the "GPA example."

Classical linear model

The classic assumptions of the model are as follows. Let t = (t₁,...,t_p)'.

Linearity: E(Y | X = t) = β₀ + β₁ t₁ + ... + β_p t_p. for some unknown population vector β = (β₀, β₁,..., β_p)'.
Normality: The distribution of Y | X = t is normal for all t.
Homoscedasticity: Var(Y | X = t) is the same for all t, which we will denote by σ².
Independence: The random vectors (X₁,Y₁),..., (X_n,Y_n) are independent.

The function r(t) = E(Y|X=t) is called the regression function of Y on X. The term does NOT necessarily mean a linear model. So the first bulleted item above says, "The regression function of Y on X is linear."

Fixed-X vs. random-X models.

In many if not most applications, the X_i are random, so that the vectors (X₁,Y₁),..., (X_h,Y_h) are iid. In our GPA example above, we have sampled n students at random, and their X values -- HS GPA, SAT -- are thus random as well.
But historically, the X values have usually been taken to be fixed. A typical example would be, say, some industrial process in which a designed experiment is conducted, with pre-specified X values.
Today, the fixed-X view is still the typical one. In the GPA example, it would mean that we had sampled students with pre-specified HS GPA and SAT values, which would probably not be the case. However, we can still take the X values to be fixed, by conditioning on them.
In the fixed-X setting, we no longer have the second 'i' in 'iid', but it doesn't matter much. For instance, there are versions of the CLT that cover this situation.

Estimation

One estmates β from the data, using the formula for the estimates b = (b₀, b₁,...,b_p)':

b = (A'A)^-1 A'W

where A is the n x (p+1) matrix will all 1s in column 1 and in column j > 1, all n values of X_j-1. Here W is the vector of all n values of Y.
Write column j+1 of A as (X_j1,...,X_jn)'. If say X_j is human age, then this vector consists of the ages of all people in our dataset.
The vector b, as a linear combination of W, has an exact MV normal distribution if Y|X is exactly normally distributed.
By the way, since

E(Y | X) = X_extend' β

where X_extend = (1,X)'

we have that

E(W | A) = A β

and thus

E(b | A) = β

(unbiasedness).
What about Cov(b|A)? Start with the fact that, by assumption,

Cov(Y|A) = σ² I

σ² (A'A)^-1.

Write B = (A'A)^-1 A, so that b = BW. Then after some algebra, we have

Cov(B|A) = B σ² I B' = σ² (A'A)^-1

Note that A is a function of the X data, but here we take it to be a constant, by considering the conditional distribution of b given A.

Thus confidence intervals (CIs) and regions we obtain below are also conditional. However, note that if, e.g. we have a conditional CI at the 95% level, then the unconditional coverage probability is also 95%, by the Law of Iterated Expection above.

Exact inference

The unknown parameter σ² must be estimated from the data, as

s² = [1/(n-p-1)] Σ_i (Y_i - R_i b)²

where R_i is row i of A.
So the estimated conditional covariance matrix of b is

s² (A'A)^-1
The stringent assumptions above enable exact statistical inference. This enables exact confidence intervals and tests:
- The quantity b_i - β_i, divided by the estimated standard deviation (square root of element i+1 in the estimated Cov(b|A)), has a Student t-distribution with n-p-1 df, thus setting up a CI for β_i.
- In some cases, we may be interested in something like, say,
  
  β₂ - β₁ = c'β
  
  where c = (0,1,-1,0,0,...,0)'. Since c'b will then have a univariate normal distribution (recall the above material on MV normal distributions), we again can easily obtain a CI for the desired quantity. Note that we need the estimated variance of c'b, which is
  
  s² c'(A'A)^-1c

Approximate inference

The quantity A'W consists of sums, so b is asymptotically MV normally distributed. Thus for large n, we can perform inference without assuming a normal Y|X:

CIs for c'β: An approximate 95% confidence interval for, e.g. β₂ - β₁ is

c'b ± 1.96 [s² c'(A'A)^-1c]^0.5

Relaxing the assumptions

Just as Box pointed out that no model is perfect, no assumption (really, part of a model) is perfect. Let's look at the normality assumption (Y|X is normally distributed), and the homoscedasticity assumption, that says Var(Y|X=t) is constant in t.

What can we say, assuming that linearity still holds?

Typically, the larger is E(Y|X=t), then the larger is Var(Y|X=t). Say we are predicting weight from height. Since taller people tend to have more (interperson) variability in weight, homoscedasticity would be seriously violated.
If so, the estimator b will not be optimal, i.e. not Minimum Variance Unbiased. For optimality, weighted least squares must then be used, with weights equal to the reciprocal of the conditional variance. That is, we should minimize

Σ_i [Var(Y|X=X_i)]^-1[Y_i - b'X_i ]²

The word "should" applies here, as we usually don't know that conditional variance function.
But in the sense of consistency, it doesn't matter what weights we use. If the linearity assumption holds, then b will still be a consistent estimator of β, i.e. b → β as n → ∞. (See informal proof at the end of this section.)
As noted earlier, lack of normality of Y|X alone does not invalidate statistical inference, due to the CLT. But assuming homoscedasticity when it is not approximately correct does invalidate inference, as the condtional covariance matrix of b is no longer σ² (A'A)^-1.

The sandwich estimator

Concerning that latter point, an important application of large-sample theory is the sandwich estimator, which makes inference asymptotically valid even when Var(Y|X=t) is not constant in t. It makes a simple adjustment to the estimated Cov(b) matrix computed under the assumption of homogeneous variance.

One of the CRAN packages that implements this adjustment is named, of course, "sandwich." Here is an example:

z <- lm(mpg ~ .,mtcars)  # built-in R dataset
covz <- vcov(z)  # estimated covariance matrix of b
covzadj <- sandwich(z)  # adjusted version of covz
covz['carb','carb']  # prints 0.6868307
covzadj['carb','carb']  # prints 0.4209969

Informal proof of consistency

For convenience, assume Random-X regression. Let w(t) be our weight function.

Note that for any nonnegative weight function w(t), the quantity

E[w(X)(Y - m(X))²]

is minimized across all functions m by m(t) = E(Y|X=t), as can be seen by conditioning on X and applying ordinary calculus. So if linearity holds,

E[w(X)((Y - v'X)²]

achieves its minimum at v = β for any weight function w, in particular the w having constant value 1. In other words,

E[(Y - v'X)²]

achieves its minimum at v = β

Say we blindly do OLS (ordinary least squares, i.e. unweighted), i.e. we choose v = b to minimize

(1/n) Σ_i [Y_i - v'X_i ]²

which is the sample analog of

E[(Y - v'X)²]

For each fixed v, that sample quantity will converge to the corresponding population quantity. Then intuitively, the v that minimizes

(1/n) Σ_i [Y_i - v'X_i ]²

will converge to the v that minimizes

E[(Y - v'X)²]

i.e. b will converge to β. However, a formal proof of this requires some real analysis.

No-intercept models

In some settings, it is known from the nature of the problem that β₀ = 0. Thus our model is

E(Y | X = t) = β₁ t₁ + ... + β_p t_p

R's lm function allows us to specify this via -1 ("delete the constant term") in the model specification.

In simulation settings, it's often inconvenient to have an intercept term, but one can always transform to it by subtracting the means of each X feature and Y. This is easily done using the scale function, with center set to TRUE.

Multiple Inference Procedures

Suppose we form two 95% confidence intervals from our dataset. Though they are individually valid at the 95% level, this is not the case jointly. In other words, in repeated sampling, the proportion of samples in which both CIs contain the associated population parameter in the same sample is less than 95%. What if we want to form many CIs, and still have their joint confidence level at or above 95%?

This branch of statistics is called multiple inference or simultaneous inference.

The Bonferroni Inequality

One solution is to use the Bonferroni Inequality, which says that for any two events A and B,

P(A and B) ≥ 1 - P(not A) - P(not B)

This implies that if we want joint confidence level of two CIs to be at least 95%, we can form each at the 97.5% level.

The Scheffe' method

But this conservative approach is not feasible for forming more than two or three CIs. Another approach is the Scheffe' method, based on confidence ellipsoids.

How to form confidence ellipsoids

Consider the linear model first. If all the assumptions hold, The quantity

(b-β)'[s^-2 (A'A)] (b-β)

has an F-distribution with (p+1,n-p-1) df. This is the finite-sample version of the quadratic form discussed earlier. This sets up a confidence ellipsoid for β:

To form a (1-α) 100% confidence ellipsoid for β, let q be the upper-α quantile of the F distribution with (p+1,n-p-1) df. Then the confidence ellipsoid is the set of all t such that

(b-t)'[s^-2 (A'A)] (b-t) ≤ q
For large n, and now not assuming sampling from a normal population, we form an approximate (1-α) 100% confidence ellipsoid for β as follows: Let q be the upper-α quantile of the χ² distribution with p+1 df. Then the confidence ellipsoid is the set of all t such that

(b-t)'[s^-2 (A'A)] (b-t) ≤ q
More generally, consider any asymptotically normal estimator θ_est of a p-dimensional population parametric θ. Let Q denote its (asymptotic) covariance matrix. The confidence ellipsoid for the true θ is the set of all t such that

(b-t)' Q^-1 (b-t) ≤ q

Forming simultaneous confidence intervals

Now let's see how we can go from confidence ellipsoid to confidence intervals for scalar quantities, and in such a way that the intervals have a given simultaneous confidence level.

Again, consider any asymptotically normal estimator θ_est of a p-dimensional population parametric θ. Let Q denote its (asymptotic) covariance matrix, and form the confidence ellipsoid as above.
Consider a linear combination w'θ. Set b and c such that the lines (p=2), planes (p=3) or hyperplanes (p > 3)

w'θ = b_w

and

w'θ = c_w

are tangent to the ellipsoid from above and below. Say c_w is the smaller of b_w and c_w.

Let's visualize this, using the CRAN package ellipse. Its function ellipse.lm draws a confidence ellipse for the output of a linear model.

fit <- lm(mpg ~ qsec + wt , mtcars)
plot(ellipse(fit, which = c('qsec', 'wt'), level = 0.90), type = 'l') 
# draw a pair of parallel tangent lines, w = (2.5,-1)'
abline(-9.05,2.5)
abline(-5.70,2.5)

This is an example from the package, using a 90% confidence level.

{width=60%}

With 90% probability, θ is somewhere inside the ellipsoid. (Note that it is the ellipsoid that is random, not θ.)
And, if θ is in the ellipsoid, then w'θ will be in (c_w,b_w), though not only if; it is clear that there are some values of θ that are outside the ellipse but still between the two lines. In other words, (c_w,b_w) is a confidence interval for w'θ of level at least 90%.
For each possible w, we get the above pair of tangents. Geometrically it is clear that θ will be inside the ellipsoid if and only if w'θ is in (c_w,b_w) for all w.
Thus the probability that all the intervals (c_w,b_w) hold simultaneously is 90%.

Simulation of MV Normal Random Vectors

Suppose we wish the generate random vectors V having a p-variate normal distribution with mean ν and covariance matrix Γ. How can we do this?

We could find a square root matrix for Γ, G. Then we generate vectors X consisting of p independent N(0,1) random variables, and set

V = G X + ν

Since X has covariance matrix equal to the identity matrix I, V will have the specified mean and covariance matrix, and it will be MV normal by linearity.

R is famous for its CRAN package collection, so it is not surprising that this operation has already been coded, first in an early package, MASS and then in others. The function MASS::mvrnorm does this.

What about regression contexts, say linear models with the standard assumptions? The model

E(Y | X = t) = β₀ + β₁ t₁ + ... + β_p t_p,

Var(Y|X=t) = σ²

can be (and in most books, usually is) written as

Y = β₀ + β₁ t₁ + ... + β_p t_p + ε

where ε has a N(0,σ²) distribution and is independent of X. (In the fixed-X context, the latter assumption is automatic.)

So, one can use mvrnorm to generate X, then rnorm to generate ε, thus obtaining Y.

Types of Convergence

A sequence of random variables V_n is said to converge in probability to a random variable V if for every ε > 0,

lim_{n → ∞} P(|V_n - V| > ε) = 0
For random vectors V_i, replace | | by, e.g. Euclidean distance.
Say we have random variables Q_n. If for some constant c we have

P(lim_{n → ∞} Q_n = c) = 1

then we say Q_n converges almost surely to c. (The term "almost surely" is just a fancy way of saying, "With probability 1.")
Example: the Strong Law of Large Numbers (SLLN). If the Q_n are iid with mean μ, then the sample average converges to the distributional average: Set

A_n = (1/n) (Q₁+...+Q_n)

Then A_n converges almost surely to μ.
If cdfs of random variables converge, i.e. for each t,

lim_{n → ∞} P(Q_n ≤ t) = P(Q ≤ t)

for some random variable Q, we say Q_n converges in distribution to Q. Sometimes people omit mention of Q, simply referring to its distribution.
Some types of convergence imply others:

almost sure => in probability => in distribution
In p dimensions, we say that X_i converges in distribution to X if the p-dimensional cdfs converge.

That definition is hard to deal with, but the Cramer-Wold device often makes things easy: Consider a sequence of random vectors X_n and another random vector X. Then X_n converges in distribution to X if and only if for all p-vectors c, c'X_n converges in distribution to c'X.
Slutsky Theorem (due to Russian mathematician Evgeny Slutsky). Say X_n converges in distribution to X, and Y_n converges in probability to a constant c. Then:
- (i) X_n + Y_n converges in distribution to X+c.
- (ii) X_n Y_n converges in distribution to Xc.
- (iii) X_n / Y_n converges in distribution to X/c.
The quantities can be matrix valued, in which case u/v means u v^-1, i.e. matrix inverse. Of course, the inverse must exist for all this to be valid.

Central Limit Theorems (CLTs)

Univariate Central Limit Theorem: Let X_i, i = 1,2,... be iid random variables with mean μ and variance σ². Define T_n = X₁+...+X_n, which has mean and variance n μ and n σ². Then

1/n^1/2 σ^-1 (T_n - n μ)

converges in distribution to N(0,1).

Many generalizations of the CLT exist, in which they relax the iid assmuption, especially the second 'i'.
Scaling by the factor n^1/2 is "just right" to make things work here. If, say we were to divide instead by n, the quantity would go to 0, by the SLLN. That certainly would not be helpful in terms of finding probabilities, confidence intervals and so on.

Other powers may apply for other quantities. For example, if we are interested in the maximum of X₁,...,X_n rather than their sum, then we divide by (2 log n)^1/2 (and the limiting distribution will be something other than normal).
We will often encounter a sequence of random variables Q_n that is asymptotically normal with mean ν and variance γ. Note carefully that that does NOT mean that E(Q_n) ≈ ν with a similar statement for variance. On the contrary, E(Q_n) could be infinite or undefined. Instead, it merely means that cdf of

n^0.5(Q_n - ν)/γ

converges (pointwise) to the cdf of N(0,1).
Multivariate Central Limit Theorem: Let X_i, i = 1,2,... be iid random vectors with mean vector μ and covariance matrix Σ. Define T_n = X₁+...+X_n, which has mean and covariance matrix n μ and n Σ. Then

1/n^1/2 (T_n - n μ)

converges in distribution to N(0,Σ). This can easily be proved using the Cramer-Wold device.

Example: Method of Moments (MM) Estimators

This is a serious application of the above methodology, and will take time to prepare and apply the concepts. The reader's patience is requested.

Background on MM estimators

Suppose we are modeling a random variable with some parametric distribution family, such as normal, exponential or gamma. We want to estimate the parameter(s) from our data X₁,...,X_n. How can we do this?
The most famous general method is Maximum Likelihood Estimation (MLE), but it's somewhat less-famous cousin, the Method of Moments (MM), is sometimes the handier one. (It is also the easier one to explain.)
This is best introduced by example. Say we are modeling our X_i as having an exponential distribution, i.e. they have density

τ exp(-τt).

for some unknown parameter τ. How do we construct the MM estimate T of τ?

The mean of this distribution is 1/τ, which could be estimated by 1/T once we obtain T. On the other hand, the classical estimate of a population mean is the sample mean,

X_bar = (1/n) ( X₁+...+X_n)

So set our estimated mean under the exponential assumption, 1/T, to our general estimated mean, without that assmuption:

1/T = X_bar

That gives us our MM estimate,

T = 1/X_bar

Done!
How does MM work in general, when we have more than one parameter? Let's use k to denote the number of parameters in the given parametric family. For instance, k = 2 for the univariate normal family, corresponding to the two parameters μ and σ². We then must bring in E(X²) and so on, as follows.
The r-th moment of a random variable X is defined to be m_r = E(X^r), r = 1,2,.... We will need to use the first k moments.

Note that this is a population value; its intuitive sample estimate is

M_r = (1/n) (X₁^r+...+X_n^r)
Alternatively, for k > 1 one may use the r-th central moment,

E[(X-E(X))^r]

Since for r = 2 this is Var(X) and for r = 3 we get the skewness of the distribution, things may be more convenient this way.

It will be clear from context whether we mean the central or noncentral moment.
Use τ to denote the population value of the parameter, and T to denote its sample estimate under MM. Both are vectors if k > 1.
Before continuing, let's review what happened when we set 1/T = X_bar above.. The left-hand side is our estimate of m₁ under the exponential model, while the right-hand side is a general estimator of m₁, not assuming that model This is how the Method of Moments works, by matching the two. We will have k such equations, then solve them for T.
For an example with k = 2, consider the gamma distribution family:

c τ₁^τ₂ t^{τ₂ - 1} exp(-τ₁t)

where c is a constant to make the density integrate to 1.0. This is a common model used in network communications and medical survival analysis for example.

Here is what the gamma family of densities looks like. (From Probability and Statistics for Data Science: Math + R + Data, N. Matloff, 2019. Here r and λ refer to our τ₂ and τ₁.)

The model is handy when one has a distribution on (0,∞), assumed unimodal.
The population moments are

E(X) = τ₂ / τ₁

and

Var(X) = τ₂ / τ₁²
So, we equate:

M₁ = T₂ / T₁

and (using the central moment for M₂)

M₂ = T₂ / T₁²
These are nonlinear equations, but we are lucky here; they are solvable from simple algebra (otherwise we would need to use numerical approximation methods):

T₁ = M₁ / M₂

and

T₂ = M₁ T₁ = M₁² / M₂

Consistency

Say we have an estimator θ_n for a parameter θ based on a sample of size n. As n goes to ∞, we would like our estimator to have the property that θ_n goes to θ. If it does so almost surely, we say it is strongly consistent; if the convergence is just in probability, it is known as weak consistency.

Since the SLLN implies that M_i is strongly consistent for m_i, this implies the same for the T_i, as long as the moments are continuous functions of the M_i. In the example above, for instance, M₁ and M₂ converge almost surely, hence the T_i do too.

Asymptotic normality

OK, we now have estimators for τ₁ and τ₂, and they are consistent. But the latter is a very weak property. For good practical value, it would be desirable to obtain confidence intervals for the τ_i from the T_k. For this we need asymptotic normality, as follows.
For simplicity, let's consider the case k = 1, so the density family has a single scalar parameter, τ, as we saw in the exponential example above. Then E(X) is some function g(τ). In the exponential example, g(τ) = 1/τ.

Assume g is differentiable with a continuous derivative g₁. Then Taylor's Theorem from calculus says

g(T) = g(τ) + g₁(T_betw) (T - τ)

for some T_betw between τ and T.
Now to set up using the CLT, recall that we will set

g(T) = M₁

Then write

n^0.5(M₁ - m₁) = n^0.5 [g(T) - g(τ)] = n^0.5 [g₁(T_betw) (T - τ)]

and then

n^0.5(T - τ) = n^0.5 (M₁ - m₁)/ g₁(T_betw)
We know that T is a strongly consistent estimator of τ, so by the continuity of g₁, the denominator in the RHS converges almost surely to g₁(τ). Applying the Slutsky Theorem and the CLT (M₁ is a sum of iid terms), we see that the RHS is asymptotically normal (though not with standard deviation 1).
In other words

n^0.5(T - τ) =_asympt. n^0.5 (M₁ - m₁)/ g₁(τ)

Here =_asymp. should not be viewed as "asymptotically equal to," but rather "having the same asymptotic distribution as."

The variance in that asymptotically normal distribution will be

Var(M₁)/[g₁(τ)]²
To compute a CI, we replace the quantities by their sample analogs. E.g. our estimate of Var(M₁) is

s²= (1/n) Σ_i=1ⁿ [X_i - M₁]²

(Divide by n-1 instead of n if you prefer, though there really is no reason to do so.)

Our estimate of g₁(τ) is

g₁(T) = -1/T²

Our CI, say for a 95% confidence level, is then

T ± 1.96 s/T²
For k = 2, we now have functions g and h, both having arguments T₁ and T₂. We now have derivatives g_∇ and h_∇, but they are now gradients.

n^0.5(M1 - m1) =_asymp.. n^0.5 g_∇' (T - τ)

n^0.5(M2 - m2) =_asymp. n^0.5 h_∇' (T - τ)

Write M = (M₁,M₂)' and similarly for m. Then

n^0.5 (M - m) =_asymp. n^0.5 A (T - τ)

where the matrix A consists of g_∇' in row 1 and h_∇' in row 2.
So, finally, we have that the asymptotic distribution of

n^0.5 (T - τ) is MV normal with covariance matrix

A^-1 Cov(M) A^-1'

This enables confidence intervals and ellipsoids as before. Of course, the matrix A must be estimated, by substituting T for τ. Again, this can be justified using the Slutsky properties.

The Delta Method

Earlier we said, "if a sequence of random variables Z_n including the case of random vectors), is asymptotically normal, then so is any smooth function of them g(Z_n)." This fact is quite useful.
Before we go on, note again that "asymptotically normal" means in the sense of the CLT. So if we say "g(Z_n) is asymptotically normal," we mean that its cdf is approximately equal to a normal cdf, and we must scale by n^0.5.
The proof is similar to our derivation for MM estimators above. E.g. for dimension 2, set μ= (μ₁,μ₂) to denote the mean of the asymptotic normal distribution, with Σ representing the corresponding covariance matrix.

Now write

g(Z_n) - μ ≈ g₁(μ)(Z_1n - μ₁) + g₂(μ)(Z_2n - μ₂) = A (Z_n - μ)

where the 1 x 2 matrix A = (g₁(μ), g₂(μ)).

The right-hand side is now a linear form in an asymptotically MV normally distributed vector, thus asymptotically MV (actually univariate) normal, with mean 0 and variance A Σ A'. Again, A must be estimated as we say in the Delta method above.

matloff / fastbigstat Goto Github PK

fastbigstat's Introduction

fastBigStat

Prerequisites

Notation

Some Preliminaries

"Exact" vs. Approximate Statistical Inference: Historical Perspective

The Multivariate Normal Distribution Family

Example: Linear Models, Exact and Approximate Inference

Classical linear model

Fixed-X vs. random-X models.

Estimation

Exact inference

Approximate inference

Relaxing the assumptions

No-intercept models

Multiple Inference Procedures

The Bonferroni Inequality

The Scheffe' method

Simulation of MV Normal Random Vectors

Types of Convergence

Central Limit Theorems (CLTs)

Example: Method of Moments (MM) Estimators

The Delta Method

fastbigstat's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org