How Do We Estimate Variance? Sample Variance Formula Explained

Slide explaining the sample variance formula with intuition, derivation, and an example using heights data

How do we Estimate the Variance?

Suppose we would like to know the variance of a certain population; for instance, the variance of women’s heights in the UK.

We think of the height of a woman we will randomly select from the UK population as a random variable $Y$, and the population variance we want to find as $ \sigma^2 = \mathbb{V}\text{ar}(Y).$

Given some data $y_1, y_2, \dots, y_n$, how could we estimate $\sigma^2$?

If we wanted instead to estimate the mean $\mu$, the expected value of $Y$, the estimator we use is simply the average of this across the whole random sample:

$$ \bar{Y} = \frac{Y_1+Y_2+\dots +Y_n}{n} $$

Meanwhile, the variance is also an expected value, but now of $(Y-\mu)^2$. So, we might start by looking at the averaged value of this across our random sample instead:

$$ \frac{(Y_1-\mu)^2+(Y_2-\mu)^2+\dots +(Y_n-\mu)^2}{n} $$

Unfortunately, in our setup, this is not an estimator. An estimator must be a statistic; that is, a function of the underlying random sample $Y_1, Y_2, \dots , Y_n $. However, this expression depends on $\mu$, which we generally will not know.

In particular, even when we collect our data, turning the random sample $Y_1,Y_2, \dots , Y_n$ into a series of observed numbers $y_1,y_2, \dots y_n $, we still won’t be able to work out its value (as we still won’t know $\mu$). So, it is no use for producing estimates in practice.

Still, this was a good start, and we needn’t be discouraged. We don’t know $\mu$; but we do have our estimator of it, $\bar{Y}$. So, let’s try the same idea, but replacing $\mu$ with $\bar{Y}$:

$$ \frac{(Y_1-\bar{Y})^2+(Y_2-\bar{Y})^2+\dots +(Y_n-\bar{Y})^2}{n} $$

This is now a genuine estimator. However, replacing $\mu$ with $\bar{Y}$ tends to make the expression smaller than the true $\sigma^2$ on average.

Suppose, for instance, that in a particular sample, the group of women we ask just happen to be rather tall on the whole. Then $\bar{y}$ will be a bit bigger than $\mu$; but likewise, the numbers $y_i-\bar{y}$ will also tend to be a bit smaller on average than $y_i-\mu$. That is, the heights $y_i$ of our generally tall women, which are also greater than $\mu$ on average, will on average also be closer to their own (larger) sample mean, than to the (smaller) population mean $\mu$.

We correct for this by dividing by $n-1$ rather than $n$ on the bottom. This finally gives us our favoured estimator of the variance; the “sample variance”, which we will denote by $S_n^2$.

That is, we have the sample variance formula:

$$ S_n^2 = \frac{(Y_1-\bar{Y})^2+(Y_2-\bar{Y})^2+\dots +(Y_n-\bar{Y})^2}{n-1} $$

This last correction means that $S_n^2$ is neither too big nor too small on average. Formally, we have that:

$$ \mathbb{E}(S_n^2) = \sigma^2 $$

Naturally, if we want to estimate the standard deviation $ \sigma = \sqrt{ \mathbb{V}\text{ar}(Y)}$, we just take the square root of $S_n^2$. If we are careful, we can use the tempting notation $S_n$ here:

$$ S_n = \sqrt{ \frac{(Y_1-\bar{Y})^2+(Y_2-\bar{Y})^2+\dots +(Y_n-\bar{Y})^2}{n-1} }$$

However, we should remember that $S^2_n$ is defined first, and $S_n$ is just its square root.

Finally, an equivalent formula, often easier to work with in practice, is:

$$ S_n^2 = \frac{(Y_1^2+Y_2^2+\dots+Y_n^2)-n \bar{Y}^2}{n-1} $$

Background:

Variance and Standard Deviation

Extensions:

Unbiasedness