Statistical Learning — Bias Variance Tradeoff

4 min readJul 18, 2021

I am starting a series of supplementary materials which I find helpful when I was reading through the Statistical Learning course offered by Stanford. I hope that this series of materials will assist you in understanding more about the technical details of different topics. Please also let me know if you would like me to cover any topics in the course!

Today, I am going to talk about the bias-variance tradeoff in supervised learning introduced in the course. On a high level, the error made by your model in prediction can be consists of bias error, variance and irreducible error.

Bias error: Error originating from assumptions made on the model. A model which is simpler than the actual relationship between the output and the features has high bias (fitting data on a quadratic polynomial with a linear model). Similarly, a model which is more complex than the relationship has low bias. (fitting data on a linear line with a high degree polynomial)
Variance: Error originating from the model fitting to random noises in the training data (overfitting). Any perturbation to the training data will send the predictions on this perturbed data far different from the actual ground truth.
Irreducible error: Error which cannot be reduced by any model due to random events

We can’t reduce irreducible errors, but we can try our best to minimize reducible errors (bias error + variance). It is great if we are able to minimize both at the same time, however in reality this doesn’t occur, hence the bias-variance tradeoff. The bias-variance tradeoff basically means that if we use a very complex model to fit the training data, we have a low bias assumption on the model, however high variance on the predictions; If we have a high bias assumption, the variance on predictions will be low.

Let’s decompose the error to these 3 components in a mean squared error regression setting.

Let’s define a few notations before moving forward.

The bias of the fitted model is defined as the difference in the actual relationship between the output and the features, and the fitted model i.e. the expectation of f_hat over the training data X. Hence,

The variance on predictions is basically the variability of the predictions about the mean (recall the definition of variance). Taking the expectation over training data X, we have

Now, let’s start deriving.

Now, to simplify this term, note that f gives a deterministic value and E[f_hat] is the average of the output of the fitted model. Hence, (f-E[f_hat]) is a constant value and can be taken out of the expectation and the expectation of a constant value is just the constant value itself. Also note that E[epsilon] = 0. Hence, we have the following equations which will help us simplify the equation above:

So, the terms can be simplified as

Now, the first term above is the square of Bias, the second term is the irreducible error (Variance of epsilon) and the last term is the Variance of the predictions of the model. To simplify the 3rd term, we can multiply (E[f_hat]- f_hat) into (f-E[f_hat]) and into epsilon, hence:

Hence, the expectation of the mean squared error (MSE) can be decomposed into the 3 error terms as discussed previously:

I derived this referring to a few references as listed below. I hope these helps.

References:

Statistical Learning — Bias Variance Tradeoff

Written by Stanley G