VERIFICATION SYSTEMS FOR LONG-RANGE FORECASTS
Experimental scores to be exchanged in the first place
CBS-Ext.(98) adopted procedures that defined the Core Standardized Verification System (SVS) for long-range forecasts, as proposed jointly by CAS, CCl and CBS experts. The Core SVS was designed to provide a straightforward assessment system for all predictions in the medium-range and longer timescales; nevertheless it can be used at the short-range also. Objectives of the SVS are covered in detail in Annex 1. The two prime objectives are:
Verification histories may be produced through a combination of hindcasts and real-time forecasts. However the forecast method should remain consistent throughout the entire history period, with hindcasts using no information that would not have been available for a real-time forecast produced at that time. If real-time forecasts are used within the verification history then they should not be included in the verification record of real-time forecasts.
Climatologies should be calculated consistently within the verification history. Data set statistics, such as means and standard deviations, should be calculated across the period of the verification history and should be applied to verification of subsequent real-time forecasts.
Where bias correction, statistical post-processing or other forms of intervention are applied which result in differences in forecast production methodology between verification history and real-time forecast periods then an attempt may be made to verify the unmodified forecast system in addition to the real-time system with results presented for both.
The SVS is formulated in four parts:
Two diagnostics are incorporated in the Core SVS - Relative Operating Characteristics and Root Mean Square Skill Scores. Both provide standardized values permitting direct intercomparison of results across different predicted variables, geographical regions, forecast ranges, etc. Both may be applied in verification of most forecasts and it is proposed that, except where inappropriate, both diagnostics are used on all occasions.
A number of contingency table-based diagnostics are listed within Annex 4 in addition to Hit and False Alarm Rates, including the Kuiper Score and Percent Correct (both used in assessing deterministic forecasts), and these provide valuable, readily-assimilable information for developers, producers and users of long-range forecasts. They may be considered for inclusion within information supplied to users.
Root Mean Square Skill Scores provide useful data to the developer and producer but are thought to carry less information to the user, particularly those served by the NMHS. Hence provision of Root Mean Square Skill Scores to users is optional.
The key list of parameters in the Core SVS is provided below. Any verification for these key parameters, for either the verification history or for real-time forecasts, should be assessed using both Core SVS techniques wherever possible (given exceptions noted above). Many long-range forecasts are produced which do not include parameters in the key list (for example, there are numerous empirical systems that predict seasonal rainfall over part of, or over an entire, country). The Core SVS diagnostics should be used to assess these forecasts also, but full details of the predictions will need to be provided.
Mean sea level surface pressure
Southern Oscillation Index
In using Relative Operating Characteristics a definition of the binary 'event' being predicted is required. While flexibility in defining the event is proposed, the recommendation is that each event be either above or below normal or a tercile of the climatological distribution.
Additional diagnostics that might aid centres in verification of long-range forecasts are listed in Annex 4.
Verification Data Sets
The key list of data sets to be used in the Core SVS for both climatological and verification information is provided below. The same data should be used for both climatology and verification, although the centres analysis(where available) and the ECMWF and NCEP/NCAR Reanalyses and subsequent analyses may be used when other data are not available. Many seasonal forecasts are produced that may not use the data in either the key climatology or verification data sets (for example, there are numerous systems which predict seasonal rainfall over part of, or over an entire, country). Appropriate data sets should then be used with full details provided.
6. Sea-surface Pressure
When gridded data sets are used, a 2.5° by 2.5° grid is recommended.
Information will be requested for exchange of scores concerning the following details of the forecast system; information labelled * should also be attached to user information:
OBJECTIVES OF THE STANDARDIZED VERIFICATION SYSTEM
The Standardized Verification System has two major objectives:
1. To provide a standardized method whereby forecast producers can exchange information on the quality of longer-range predictions on a regular basis and can also report results to WMO annually as part of a consolidated annual summary;
2. To provide a standardized method whereby forecast producers can add information on the inherent qualities of their forecasts for the information and advice of recipients.
In order to achieve the first major objective, the SVS incorporates two diagnostics and a series of recommended forecast parameters and verification and climatological statistics against which to assess the forecasts which can be applied to real-time forecasts, either on an individual basis, or, preferably, accumulated over a sequence of predictions.
The second major objective is achieved using the same diagnostics, forecast parameters and verification and climatological statistics but applied to historical tests of the system. It is made clear whether the historical tests are based on methods that can be considered to represent a true forecast, had the test been run in real-time or otherwise. Producers will be requested to add this information to issued predictions; recommendations for methods by which this might be done may be formulated later.
Other objectives of the Standardized Verification System are:
3. To encourage both regular verification of forecasts and verification according to international standards;
4. To encourage information on inherent forecast quality to be added to all predictions as a matter of course and to encourage forecast recipients to expect receipt of the information;
5. To encourage producers to use consistent data sets and to encourage production of these data sets;
6. To provide verifications that permit direct intercomparison for forecast quality regardless of predicted variable, method, forecast range, geographical region, or any over consideration;
7. To encourage producers to work towards a common method for presenting forecasts.
RELATIVE OPERATING CHARACTERISTICS
The derivation of Relative Operating Characteristics is given below. For purposes of reporting forecast quality for exchange between centres and for annual submission to WMO the following will be required:
1. For deterministic forecasts Hit Rates and False Alarm Rates together with essential details of the forecast parameter and verification data sets;
2. For probabilistic forecasts Hit Rates and False Alarm Rates for each probability interval used. Frequent practice, as illustrated below, is for probability intervals of 10 per cent to be used. However other intervals may be used as appropriate (for example, for nine-member ensembles an interval of 33.3 % could be more realistic). Additionally the area under the curve should be calculated.
Relative Operating Characteristics (ROC), derived from signal detection theory, are intended to provide information of the characteristics of systems upon which management decisions can be taken. In the case of weather forecasts, the decision might relate to the most appropriate manner in which to use a forecast system for a given purpose. ROC's are useful in contrasting characteristics of deterministic and probabilistic systems.
Take the following 2x2 contingency table for any yes/no forecast for a specific binary event:
The binary 'event' can be defined quite flexibly, e.g. as positive/negative anomalies, anomalies greater/less than a specific amount, values between two limits, etc. If terciles are used then the binary event can be defined in terms of predictions of one tercile against the remaining two.
Using stratification by observed (rather than by forecast) the following can be defined:
Hit Rate = H/(H + M)
False Alarm Rate = FA/(FA + CR)
For deterministic forecasts the Hit Rate and False Alarm Rate only need be calculated; for probabilistic forecasts the procedure outlined below should be followed.
A probabilistic forecast can be converted into a 2x2 table as follows. Tabulate probabilities in, say, 10% ranges stratified against observations, i.e.:
For any threshold, such as 50%, (indicated by the dotted line in the table), the Hit Rate (False Alarm Rate) can be calculated by the sum of O's (NO's) at and above the threshold value divided by S Oi (S NOi) - in other words for a value of 50% the calculation is as if the event is predicted given any forecast probability of 50% or more. So for the above case:
Hit Rate = (O10 + O9 + O8 + O7 + O6) / S Oi
False Alarm Rate = (NO10 + NO9 + NO8 + NO7 + NO6) / S NOi
This calculation can be repeated at each threshold and the points plotted to produce the ROC curve, which, by definition, must pass through the points (0,0) and (100,100) (for events being predicted only for 100% probabilities and for all probabilities exceeding 0% respectively). The further the curve lies towards the upper left-hand corner the better; no-skill forecasts are indicated by a diagonal line.
Areas under ROC curves can be calculated using the Trapezium rule. Areas should be standardized against the total area of the figure such that a perfect forecast system (i.e. one that has a curve through the top-left-hand corner of the figure) has an area of one and a curve lying along the diagonal (no information) has an area of 0.5. Alternatively, but not recommended in the Standard, the 0.5 to 1.0 range can be rescaled to 0 to 1 (thus allowing negative values to be allocated to cases with the curve Iying below the diagonal - such curves can be generated). Not only can the areas be used to contrast different curves but they are also a basis for Monte Carlo significance tests. Monte Carlo testing should be done within the forecast data set itself.
In order to handle spatial forecasts, predictions for each point within the grid should be treated as individual forecasts but with all results combined into the final outcome. Categorical predictions can be treated for each category separately.
ROOT MEAN SQUARE SKILL SCORES
Root Mean Square Skill Scores are calculated from:
[1 - RMS(forecast) ] * 100
RMS (forecast) refers to the RMS error of the forecast. RMS (standard) refers to the RMS error of the standard when verified against the same observations as the forecast - the standard can be either climatology or persistence. When persistence is used, the persistence should be defined in a manner appropriate to the time-scale of the prediction, although it is left to the producer to determine whether persistence over, perhaps, one month or an entire season is used in assessing a seasonal prediction. No portion of the persistence period should overlap into the forecast period and the forecast range should be calculated from no sooner than the time at which any observed information (i.e. information which could not be known at the time of a real forecast) is no longer included. Both of these requirements are placed to ensure that all forecasts and test predictions only use data that were available at the time of the prediction or would have been available at that time had a prediction been made (in the case of historical test).
1. Categorical forecasts
Linear Error in Categorical Space for Categorical Forecasts (LEPSCAT)
2. Probability Forecasts of Binary Predictands
Brier Skill Score with respect to Climatology
Sharpness (measure to be decided)
Continuous Rank Probability Score
3. Probability of Multiple-Category Predictands
Ranked Probability Score
Ranked Probability Skill Score with respect to Climatology
4. Continuous Forecasts in Space
Murphy-Epstein Decomposition (phase error, amplitude error, bias error)
5. Continuous Forecasts in Time
Mean Square Error