Prioritize...
After you have read this section, you should be able to assess the skill of a categorical forecast using the appropriate techniques.
Read...
You will most likely encounter other forecast types beside deterministic. Categorical forecasts, such as the occurrence of rain, are another common type of forecast in meteorology that relies on a specific set of metrics to assess the skill. Read on to learn about the techniques commonly used to assess categorical forecasts.
Joint Histogram
For categorical forecasts, scatterplots and error histograms do not make sense. So, the exploratory error analysis is accomplished in a different way. Typically, we compute the joint histogram of the forecast categories and observed categories. The forecast categories are shown as columns, and the observed categories are rows. Each cell formed by the intersection of a row and column contains the count of how many times that column’s forecast was followed by that row’s outcome.
Although the joint histogram is formally called the confusion matrix, it need not be confusing at all. The cells on the diagonal (upper left to lower right) count cases where the forecast was correct. The cell containing event forecasts that failed to materialize captures ‘false alarms’ (bottom left), and the cell containing the count of events that were not forecasted captures ‘missed events’ (top right). Below is an example of the matrix where a, b, c, and d represent the count of each cell.
Total Count=N=a+b+c+d | Forecast: (Event Occurred) | Forecast: (Event Did Not Occur) |
---|---|---|
Observed: (Event Occurred) | TP=a | FN=b |
Observed: (Event Did Not Occur) | FP=c | TN=d |
Wrongly Forecasted
Correctly Forecasted
Let’s work through each cell starting in the top left. TP stands for true positive - the event was forecasted to occur and it did. FN stands for false negative - the event was not forecasted to occur, but it did (missed events). FP stands for false positive - the event was forecasted to occur, but it did not occur (false alarms). Finally, TN stands for true negative - the event was not forecasted to occur, and it did not occur. We will use these acronyms later on, so I encourage you to make use of this table and go back to it when necessary.
All of the popular forecast metrics for categorical forecasts are computed from this matrix, so it’s important that you can read the matrix (it is not necessary to memorize, although you will probably end up memorizing it unintentionally). The confusion matrix can be generalized to forecasts with more than two categories, but we won’t go into that here.
Let’s work through an example. We are going to look at daily rainfall in Ames, IA from the MOS GFS and compare the predictions to observations (same data source as the previous example). You can download the rainfall forecast here and the observations here. A 1 means rain occurred (was predicted) and a 0 means rain did not occur (was not predicted).
Begin by loading in the data using the code below:
# load in forecast data load("GFSMOS_KAMW_rain.RData") # load in observed data (https://mesonet.agron.iastate.edu/agclimate/hist/dailyRequest.php) load("ISUAgClimate_KAMW_rain.RData")
Now compute the values of a, b, c, and d for the confusion matrix by running the code below.
If we fill in the matrix, we would get the following table:
Total Count=N=a+b+c+d=3653 | Rain Predicted | No Rain Predicted |
---|---|---|
Rain Observed | TP=489 | FN=422 |
No Rain Observed | FP=326 | TN=2396 |
Wrongly Forecasted
Correctly Forecasted
You should also make sure a+b+c+d adds up to the length of the data set. Run the code below to check.
And there you have it! Your first confusion matrix.
Confusion Matrix Metrics
Now that we have the confusion matrix, we want to create some quantifiable metrics that allow us to assess the categorical forecast. The first metric we will talk about is the Percent Correct. Percent Correct is just the ratio of correct forecasts to all forecasts. It answers the question: overall, what faction of the forecasts were correct? The range is 0 to 1 with a perfect score equal to 1. The closer to 1 the better. Mathematically (referring back to the confusion matrix):
It is a great metric if events and non-events are both fairly likely and equally important to the user. But for rare events, you get a great percent correct value by always forecasting non-events. That is not helpful at all. Likewise, if missed events and false alarms have different importance to the user (think tornados) then weighting them equally in the denominator isn’t a good idea.
Therefore, numerous other metrics have been derived that can be computed from the confusion matrix. One example is the Hit Rate. The Hit Rate is the ratio of true positives to the sum of the true positives and false negatives. It answers the question: what fraction of the observed ‘yes’ events were correctly forecasted? The range is 0 to 1 and a perfect score is 1. The closer to 1 the better. Mathematically:
Another metric is the False Alarm Rate, which is the number of times the forecast predicted an event would occur, but the event was not observed. Think of the False Alarm Rate as the forecast system crying wolf. It answers the question: what fraction of the observed ‘no’ events were incorrectly predicted as ‘yes’? The range is 0 to 1 with a perfect score of 0. The closer to 0 the better. Mathematically:
Another metric is the Threat Score. It is the number of times when an event is forecasted and the event occurs divided by the number of times the event is forecasted, and the event occurs plus the number of times the event is forecasted, and the event does not occur plus the number of times the event is not forecasted and the event occurs. It answers the question: How well did the forecasted ‘yes’ events correspond to the observed ‘yes’ events? The range is 0 to 1 with a perfect score of 1. The closer to 1 the better. Mathematically:
The threat score is more appropriate than percent correct for rare events.
Now let’s try computing these ourselves. Continuing on with the rainfall example, run the code below to estimate the four different metrics discussed above.
The percent correct is 0.79 (perfect score=1), the hit rate is 0.53 (perfect score =1), the false alarm rate is 0.12 (perfect score=0) and the threat score is 0.39 (perfect score=1). You can see that the percent correct is quite high, but upon further investigation, we see this might be biased due to the large number of predicted and observed ‘no’ events (TN).