Bias-Variance Trade-off is a common problem and every data scientist come across this problem quite often. A proper understanding of this trade-off would help us to build accurate models as well as avoid the mistake of overfitting and underfitting the data.
ESTIMATION OF f:
Everybody must have heard of the phrase, “we expect the sales to be 123 in a particular month” or “we expect the demand to rise a ‘p’ percentage point in the following week/month.” Ever wondered how we go about making these predictions?
Taking the example of predicting sales, we must have some data from the past about sales and other factors which have influenced sales. We can use this data to establish a relationship (assuming that a relationship exists) between sales and the other factors present in the data. The sales column is known as Response Variable (Y) while the rest of the columns in the data are known as Predictor Variables(X). The relationship between Y and X can be written as :
We often don't know the actual relationship so we estimate the function f(x) or simply f. In the above equation, f(x) is some unknown and fixed function of the predictor variables and represents systematic information that X provides about Y. The error term in the equation is random and is a residual variable produced by the statistical model when the model doesn't represent the relationship between the predictor and response variables. The error term is independent of X and has 0 mean.
WHY ESTIMATE f:
There are two reasons why we estimate the function f.
a) For making predictions.
When we want to make predictions about the future, for example, we want to know how many times a product will be sold in a month based on the past data. We are making a statistical guess about the future.
Also, if making predictions is our ultimate goal or the problem statement, we treat the function f as black box (whose internal mechanism is hidden) as we are not interested in its exact form unless and until it provides us with accurate values of Y for a given value of X.
Flexible methods are known to give accurate results for this purpose because as the name suggests, they are more flexible with the data and capture the trends present in the data more closely. This property is not present in robust methods and there are high chances of missing out on useful information, which is captured in flexible methods. Flexible methods include Decision Trees, Random Forests, etc.
b) For making inferences.
Inferences include answers to questions like, “how much will the sales of a product will be affected if say the price of that product is increased” or “which variables affect sales the most”.
So here, we need the exact form of the estimated f. We can not treat it as a black box. We require the simplest form of the estimated f which is as close to the actual f as possible.
Robust methods fulfill these requirements. These methods choose the form of estimated f which is simplified and approximated. Robust methods follow the trends in the data which are observed in all the data points. Linear Regression is a robust method.
Note: The problem statement will help you in deciding whether you need flexible methods or robust methods to estimate f.
ASSESSING THE MODEL ACCURACY:
Now after we have used our data to establish a relationship between X and Y, we need to assess the accuracy of the model. In other words, we need to quantify the extent to which the predicted response value for a given observation is close to the actual value for that observation.
We use Mean Square Error for this purpose, abbreviated as MSE. MSE can be calculated in two ways, using the training data (which we have used to estimate f) and using the test data (which is the unseen data for our statistical method). Actual accuracy of the model is represented by the test MSE. Using training MSE as a measure of the accuracy can lead to unnecessary confidence in our predictions/inferences. It is because we want to assess the performance of the model on “new” data (which is not known to the model) rather than how accurately it predicts the training observations, which are already known to the model.
Important Point: A model is considered the best fit for the given data, when it produces a low test MSE.
HOW TO ACHIEVE LOW TEST MSE:
Now we know that if we want our model to fit the data accurately, the model should produce low MSE. Expected Test MSE for a given value x can be decomposed into the following components:
So to achieve a low test MSE, we need a model with low variance and low bias. Now let’s try to understand these two terms and what they mean.
VARIANCE
Variance refers to the amount of change in the estimated f if a different training dataset is used to estimate f. Ideally, the estimated function f is expected not to vary too much between training data sets as all the observations are assumed to be following the same distribution. Therefore a method has high variance if small changes in data result in large changes in estimated f.
Flexible methods have high variances because they can follow the trends which are common to a particular training dataset. When this training dataset is changed, it will result in significant changes in the estimated f.
BIAS
Bias refers to the error that is introduced by approximating a complicated problem by a much simpler model. Suppose, we are fitting a non-linear data (which means that the response and the predictor variables are related quadratically) using linear regression model, then the model contains bias. It will not be able predict the values of Y as accurately as any other non-linear model. A method will have high bias if the true f is very different from the estimated f.
Flexible methods have low bias because they fit the data accurately and hence are capable of estimating a f which is as close to the true f as possible.
THE CATCH:
Did you notice something in the definitions of bias and variance?
Flexible methods have low bias but high variance. But to achieve low test MSE, we need low bias AND low variance. (If you think using robust methods will solve the problem for you, then no, it wont because robust methods have high bias and low variance)
This dilemma right here is known as bias-variance trade-off. It is a trade-off because we can not reduce bias and variance simultaneously. It is always easy to obtain a method with low bias and high variance (or vice-versa) but the real challenge lies in obtaining a method with low bias and low variance.
Now let’s look at the behavior of test MSE curve as a function of bias and variance.
From the decomposition of the expected test MSE, we clearly see that test MSE depends on variance and square of bias. Relative changes in these quantities determine whether test MSE will increase or decrease. As we increase the flexibility of methods or the model complexity, the bias tends to initially decrease faster than the variance increases. Hence, test MSE declines. It happens because the robust methods (with less degrees of freedom) observe the trends and patterns in the data which are common to all the data points. As the degrees of freedom is increased, the method’s ability to capture less observed patterns, increases. Consequently, the bias decreases significantly without much increase in the variance. But after some point, increasing flexibility doesn’t impact bias but starts to significantly increase the variance as the method starts following the noises in the data. Hence test MSE increases.
Hence, it depends on the problem statement and the decision-maker to decide whether to trade low bias for low variance (to draw inferences) or to trade low variance for low bias (to make accurate predictions about the future).