Slide explaining the sample variance formula with intuition, derivation, and an example using heights data

How do we Estimate the Variance?

Suppose we would like to know the variance of a certain population; for instance, the variance of women’s heights in the UK.

We think of the height of a woman we will randomly select from the UK population as a random variable $Y$, and the population variance we want to find as $ \sigma^2 = \mathbb{V}\text{ar}(Y).$

Given some data $y_1, y_2, \dots, y_n$, how could we estimate $\sigma^2$?

If we wanted instead to estimate the mean $\mu$, the expected value of $Y$, the estimator we use is simply the average of this across the whole random sample:

$$ \bar{Y} = \frac{Y_1+Y_2+\dots +Y_n}{n} $$

Meanwhile, the variance is also an expected value, but now of $(Y-\mu)^2$. So, we might start by looking at the averaged value of this across our random sample instead:

$$ \frac{(Y_1-\mu)^2+(Y_2-\mu)^2+\dots +(Y_n-\mu)^2}{n} $$

Unfortunately, in our setup, this is not an estimator. An estimator must be a statistic; that is, a function of the underlying random sample $Y_1, Y_2, \dots , Y_n $. However, this expression depends on $\mu$, which we generally will not know.

In particular, even when we collect our data, turning the random sample $Y_1,Y_2, \dots , Y_n$ into a series of observed numbers $y_1,y_2, \dots y_n $, we still won’t be able to work out its value (as we still won’t know $\mu$). So, it is no use for producing estimates in practice.

Still, this was a good start, and we needn’t be discouraged. We don’t know $\mu$; but we do have our estimator of it, $\bar{Y}$. So, let’s try the same idea, but replacing $\mu$ with $\bar{Y}$:

$$ \frac{(Y_1-\bar{Y})^2+(Y_2-\bar{Y})^2+\dots +(Y_n-\bar{Y})^2}{n} $$

This is now a genuine estimator. However, replacing $\mu$ with $\bar{Y}$ tends to make the expression smaller than the true $\sigma^2$ on average.

Suppose, for instance, that in a particular sample, the group of women we ask just happen to be rather tall on the whole. Then $\bar{y}$ will be a bit bigger than $\mu$; but likewise, the numbers $y_i-\bar{y}$ will also tend to be a bit smaller on average than $y_i-\mu$. That is, the heights $y_i$ of our generally tall women, which are also greater than $\mu$ on average, will on average also be closer to their own (larger) sample mean, than to the (smaller) population mean $\mu$.

We correct for this by dividing by $n-1$ rather than $n$ on the bottom. This finally gives us our favoured estimator of the variance; the “sample variance”, which we will denote by $S_n^2$.

That is, we have the sample variance formula:

$$ S_n^2 = \frac{(Y_1-\bar{Y})^2+(Y_2-\bar{Y})^2+\dots +(Y_n-\bar{Y})^2}{n-1} $$

This last correction means that $S_n^2$ is neither too big nor too small on average. Formally, we have that:

$$ \mathbb{E}(S_n^2) = \sigma^2 $$

Naturally, if we want to estimate the standard deviation $ \sigma = \sqrt{ \mathbb{V}\text{ar}(Y)}$, we just take the square root of $S_n^2$. If we are careful, we can use the tempting notation $S_n$ here:

$$ S_n = \sqrt{ \frac{(Y_1-\bar{Y})^2+(Y_2-\bar{Y})^2+\dots +(Y_n-\bar{Y})^2}{n-1} }$$

However, we should remember that $S^2_n$ is defined first, and $S_n$ is just its square root.

Finally, an equivalent formula, often easier to work with in practice, is:

$$ S_n^2 = \frac{(Y_1^2+Y_2^2+\dots+Y_n^2)-n \bar{Y}^2}{n-1} $$


Background:


Extensions: