MEASURING FORECAST ERRORS

cBefore we delve into forecasting algorithms, it is fundamental to understand how we may evaluate their performance. This issue is sometimes overlooked in practice: Once a forecast is calculated and used to make a decision, it is often thrown away for good. This is a mistake, as the performance of the forecasting process should be carefully monitored.

In the following, we will measure the quality of a forecast by a forecasting error, defined as

This is the forecast error for the variable at time t. The notation refers to the forecast for time bucket t. It is not important when the forecast was made, but for when. If the horizon is h = 1, then , whereas if h = 2. Also, notice that by this definition the error is positive when demand is larger than the forecast (i.e., we underforecasted), whereas the error is negative when demand is smaller than the forecast (i.e., we over-forecasted). It is important to draw the line between two settings in which we may evaluate performance:

In historical simulation, we apply an algorithm on past data and check what the performance would have been, had we adopted that method. Of course, to make sense, historical simulation must be non-anticipative, and the information set that is used to compute F_t,h must not include any data after time bucket t. This may sound like a trivial remark, but, as we will see later, there might be an indirect way to violate this nonanticipativity principle.
In online monitoring, we gather errors at the end of each time bucket to possibly adjust some coefficients and fine-tune the forecasting algorithm. In fact, many forecasting methods depend on coefficients that can be set once and for all or dynamically adapted.

Whatever the case, performance is evaluated over a sample consisting of T time buckets, and of course we must aggregate errors by taking some form of average. The following list includes some of the most commonly used indicators:

Mean error:
Mean absolute deviation:
Root-mean-square error:
Mean percentage error:
Mean absolute percentage error:

As clearly suggested by Eq. (11.2), ME is just the plain average of errors over the sample, so that positive and negative errors cancel each other. As a consequence, ME does not discriminate between quite different error patterns like those illustrated in Table 11.1; the average error is zero in both case (a) and (b), but we cannot say that the accuracy is the same. Indeed, ME measures systematic deviation or bias, but not the accuracy of forecasts. A significantly positive ME tells us that we are systematically underforecasting demand; a significantly negative ME means that we are systematically overstating demand forecasts. This is an important piece of information because it shows that we should definitely revise the forecasting process, whereas a very small ME does not necessarily imply that we are doing a good job. To measure accuracy, we should get rid of the error sign. This may be accomplished by taking the absolute value of errors or by squaring them. The first idea leads to MAD; the second one leads to RMSE, where the square root is taken in order to express error and demand in the same units. In case (a) of Table 11.1, MAD is 0; on the contrary, the reader is invited to verify that MAD is in case (b), showing that MAD can tell the difference between the two sequences of errors. RMSE, with respect to MAD, tends to penalize large errors, as we may appreciate from Table 11.2. Cases (a) and (b) share the same ME and MAD, but RMSE is larger in case (b), which indeed features larger errors.

Table 11.1 Mean error measures bias, not accuracy.

Table 11.2 RMSE penalizes large errors more than MAD does.

The reader will certainly notice some similarity between RMSE and standard deviation. Comparing the definition in Eq. (11.4) and the familiar definition of sample variance and sample standard deviation in inferential statistics, one could even be tempted to divide by T − 1, rather than by T. This temptation should be resisted, however, since it misses a couple of important points.

In inferential statistics we have a random sample consisting of independent observations of identically distributed variables. In forecasting there is no reason to think that the expected value of the variable we are observing is stationary. We might be chasing a moving target, and this has a profound impact on forecasting algorithms.
Furthermore, in inferential statistics we use the whole sample to calculate the sample mean , which is then used to evaluate the sample standard deviation; a bias issue results, which must corrected. Here we use past demand information to generate the forecast ; then we evaluate the forecasting error with respect to a new demand observation Y_t, which was not used to generate .

By the same token, MAD as defined in Eq. (11.3) should not be confused with MAD as defined in Eq. (4.4), where we consider deviations with respect to a sample mean. In fact, in forecasting literature the name mean absolute error (MAE) is sometimes used, rather than MAD, in order to avoid this ambiguity.

A common feature of ME, MAD, and RMSE is that they measure the forecast error using the same units of measurement as demand. Now imagine that MAD is 10; is that good or bad? Well, very hard to say: If MAD is 10 when demand per time bucket is something like 1000, this is pretty good. If MAD is 10 and demand is something like 10, this is pretty awful. This is why MPE and MAPE have been proposed. By dividing forecast errors by demand,⁴ we consider a relative error. Apparently, MPE and MAPE are quite sensible measures. In fact, they do have some weak spots:

They cannot be adopted when demand during a time bucket can be zero. This may well be the case when the time bucket is small and/or demand is sporadic. In modern retail chains, replenishment occurs so often, and assortment variety is so high that this situation is far from being a rare occurrence.
Even when demand is nonzero, these indices can yield weird results if demand shows wide variations. A somewhat pathological example is shown in Table 11.3. The issue here is that forecast 1 is almost always perfect, and in one unlucky case it makes a big mistake, right when demand is low; the error is 9 when demand is 1, so percentage error is an astonishing 900%, which yields MAPE=90% when averaged over the 10 time buckets. Forecast 2, on the contrary, is systematically wrong, but it is just right at the right moment; the second MAPE is only 18%, suggesting that the second forecaster is better than the first one, which is highly debatable.

A possible remedy to the two difficulties above is to consider a ratio of averages, rather than an average of ratios. We may introduce the following performance measures:

The idea is to take performance measures, expressed in absolute terms, and make them relative by dividing by the average demand

Table 11.3 Illustrating potential dangers in using MPE and MAPE.

In these measures, it is very unlikely that average demand over the T time buckets is zero; if this occurs, there is a big problem, but not with forecasting.

As we see, there is a certain variety of measures, and we must mention that others have been proposed. As common in management, we should keep several indicators under control, in order to get a complete picture. We close the section with a few additional remarks:

Checking forecast accuracy is a way to gauge the inherent uncertainty in demand; this is helpful when we want to come up with something more than just a point forecast.
We should never forget that a forecast is an input to a decision process. Hence, alternative measures might consider the effect of a wrong forecast, most notably from the economic perspective. However, how this is accomplished depends on the strategies used to hedge against forecast errors.

MEASURING FORECAST ERRORS

Comments

Leave a Reply Cancel reply