Bias-variance dilemma, overfitting, and underfitting

6 min readJul 26, 2022

Bias and variance have a significant role in model performance, potentially leading to overfitting or underfitting. When looking for a good supervised machine learning model, we want to avoid both situations and find the best condition for bias and variance. Therefore, it is substantial to understand bias and variance and how we can best adjust these errors.

Introduction

When writing a supervised machine learning (ML) model, whether it is meant for classification or regression, our goal is that the model predicts the most accurately the target data, especially for new data. For example, consider a model to predict churn. We want it to capture well the relationships between the features and make accurate predictions on clients that it has never seen before, so the bank can take measures in advance to avoid churn.

Sometimes, our model performs very well on training data and we feel happy and confident with it, but when the test data is provided, the results are disappointing. This is known as overfitting. On the other hand, sometimes it can perform very poorly even on training data. This is known as underfitting. These two situations mean that the error associated with our model’s predictions is high and it won’t be useful with new data. The predictions won’t be reliable and the model is unable to generalize beyond the training set. To understand what is happening to make better models and avoid overfitting or underfitting, we need to comprehend two important concepts: bias and variance.

Bias and variance

Bias

Imagine you have a dataset with one feature (x) and one target (y), represented by the blue dots in fig. 1), and you write a model to fit it. You can run it many times with your training dataset and get an average model prediction, due to randomness in the dataset (meaning that the model will predict a range of outcomes and we get the average of these outcomes). The (squared) difference between these average model predictions and the actual values we are trying to predict is bias. It can also be seen as the error introduced in the model when the space of functions to fit is small and thus does not contain the best classifier, biasing the choice.

So, if your model fits your data highly, like in fig. 1a), then the difference between the average predictions and the true mean will be very small. Thinking about the function space, in this case, it is a large one that allows the algorithm to choose polynomials of a higher order such as the curve in fig. 1a). You have then a low bias model. Your model can capture very well the relationship between feature and target. On the other hand, if your model is a straight line like in fig. 1b), in this case, it isn’t a good fit for your data, doesn't capture well the relationship between feature and target, and thus you have large differences between the average prediction and the true values. The function space is limited, and therefore your model can’t choose a better function. You have a high bias model.

Fig. 1 — Different model complexity with a) complicated function with low bias and b) simple function with high bias.

Variance

So far we have considered the average predictions and the true values. If we now look at the predictions for each point and take the squared difference between them and the mean prediction of the model, this is the variance. Variance is a measure of how much your model’s predictions for each point move around its mean. A high variance model may give good predictions sometimes, but it can also predict absurd values because its predictions vary too much, just like the complex curve in fig. 1a). A low variance model will maintain its predictions consistent, within a small range, like the straight line in fig 1b), but it may not be the right one.

If we take a look at a graph showing bias and variance as functions of the complexity of the function space, then we’ll see that bias decreases with the increasing complexity of the function space. In this case, our model can choose a high complexity function to fit the data the most accurately possible and is not biased by simple solutions. On the other hand, variance increases with model complexity, because the more complex a model is, the larger the fluctuations in its predictions. The total error in fig. 2 is the sum of the bias and variance errors.

Fig. 2 — Bias and variance as a function of the complexity of the function space.

Mathematically speaking about bias and variance:

Model complexity

As we can notice in fig. 2, bias and variance are competing in model performance. Where bias is low and variance is high, the model is complex and takes into account a lot of parameters. It fits very well the training set but does not generalize well. It is overfitting. Where bias is high and variance is low, the model is simple, but in this case, it does not fit or generalize well. It is underfitting. Bias and variance are then in a tug of war regarding the model’s performance.

Fig. 3 — Bias and variance are in a tug of war. High bias or variance can break your model performance. (Courtesy of my dogs)

But thankfully there is a solution where the total error reaches the minimum. So what can we do to avoid overfitting or underfitting and have our model’s error stay at this minimum? There are a lot of discussions in statistical learning about estimating errors and techniques to improve the model’s performance (see refs. [1,2]). But if you are searching for more “hands-on” solutions, without getting too much into math and statistics, there are a few steps we can take towards finding the tradeoff bias-variance:

Some a priori knowledge about our dataset can be very useful when choosing which features are the most relevant to consider. Sometimes, having fewer features but good ones is better than considering all the features available. With more features, the model can become very complex and overfit. Therefore, feature selection is an important step.
Choose an accurate measure of the model’s performance. Depending on the goal of your model, the type of metrics you choose can lead to an erroneous evaluation of the model’s performance. For example, in situations where the costs of false positives and false negatives matter, such as in disease prediction, we need a measure that encodes this question, and accuracy won’t work (see my article on metrics to evaluate classification models).
Testing as many algorithms as possible, making use of cross-validation to avoid looking at only training errors, and tuning model parameters is an opportunity to explore different model complexities and choose the one that finds a good balance between bias and variance.

Conclusion

Bias and variance are two errors that can jeopardize model performance. High complexity models may seem to be a good idea since they perform so well on training data, but they are, in fact, overfitting due to low bias and high variance. On the other hand, low complexity models may not capture the relationship between features and target and perform very badly, underfitting due to high bias and low variance. Consequently, understanding the tradeoff bias — variance and finding the best model that minimizes both is crucial.

References

[1] Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: Springer, 2009.

[2] Von Luxburg, Ulrike, and Bernhard Schölkopf. “Statistical learning theory: Models, concepts, and results.” Handbook of the History of Logic. Vol. 10. North-Holland, 2011. 651–706.