
VERIFICATION SYSTEMS FOR LONGRANGE FORECASTS Experimental scores to be exchanged in the first place Preamble CBSExt.(98) adopted procedures that defined the Core Standardized Verification System (SVS) for longrange forecasts, as proposed jointly by CAS, CCl and CBS experts. The Core SVS was designed to provide a straightforward assessment system for all predictions in the mediumrange and longer timescales; nevertheless it can be used at the shortrange also. Objectives of the SVS are covered in detail in Annex 1. The two prime objectives are:
Proposed Principles Verification histories may be produced through a combination of hindcasts and realtime forecasts. However the forecast method should remain consistent throughout the entire history period, with hindcasts using no information that would not have been available for a realtime forecast produced at that time. If realtime forecasts are used within the verification history then they should not be included in the verification record of realtime forecasts. Climatologies should be calculated consistently within the verification history. Data set statistics, such as means and standard deviations, should be calculated across the period of the verification history and should be applied to verification of subsequent realtime forecasts. Where bias correction, statistical postprocessing or other forms of intervention are applied which result in differences in forecast production methodology between verification history and realtime forecast periods then an attempt may be made to verify the unmodified forecast system in addition to the realtime system with results presented for both. Formulation The SVS is formulated in four parts:
Diagnostics Two diagnostics are incorporated in the Core SVS  Relative Operating Characteristics and Root Mean Square Skill Scores. Both provide standardized values permitting direct intercomparison of results across different predicted variables, geographical regions, forecast ranges, etc. Both may be applied in verification of most forecasts and it is proposed that, except where inappropriate, both diagnostics are used on all occasions.
Parameters The key list of parameters in the Core SVS is provided below. Any verification for these key parameters, for either the verification history or for realtime forecasts, should be assessed using both Core SVS techniques wherever possible (given exceptions noted above). Many longrange forecasts are produced which do not include parameters in the key list (for example, there are numerous empirical systems that predict seasonal rainfall over part of, or over an entire, country). The Core SVS diagnostics should be used to assess these forecasts also, but full details of the predictions will need to be provided.
NINO1+2
Mean sea level surface pressure
Southern Oscillation Index
In using Relative Operating Characteristics a definition of the binary 'event' being predicted is required. While flexibility in defining the event is proposed, the recommendation is that each event be either above or below normal or a tercile of the climatological distribution. Additional diagnostics that might aid centres in verification of longrange forecasts are listed in Annex 4. Verification Data Sets The key list of data sets to be used in the Core SVS for both climatological and verification information is provided below. The same data should be used for both climatology and verification, although the centre’s analysis (where available) and the ECMWF and NCEP/NCAR Reanalyses and subsequent analyses may be used when other data are not available. Many seasonal forecasts are produced that may not use the data in either the key climatology or verification data sets (for example, there are numerous systems which predict seasonal rainfall over part of, or over an entire, country). Appropriate data sets should then be used with full details provided.
6. Seasurface Pressure
When gridded data sets are used, a 2.5° by 2.5° grid is recommended. System Details Information will be requested for exchange of scores concerning the following details of the forecast system; information labelled * should also be attached to user information:
_______________
Annexes: 4
ANNEX 1
OBJECTIVES OF THE STANDARDIZED VERIFICATION SYSTEM The Standardized Verification System has two major objectives: 1. To provide a standardized method whereby forecast producers can exchange information on the quality of longerrange predictions on a regular basis and can also report results to WMO annually as part of a consolidated annual summary; 2. To provide a standardized method whereby forecast producers can add information on the inherent qualities of their forecasts for the information and advice of recipients. In order to achieve the first major objective, the SVS incorporates two diagnostics and a series of recommended forecast parameters and verification and climatological statistics against which to assess the forecasts which can be applied to realtime forecasts, either on an individual basis, or, preferably, accumulated over a sequence of predictions. The second major objective is achieved using the same diagnostics, forecast parameters and verification and climatological statistics but applied to historical tests of the system. It is made clear whether the historical tests are based on methods that can be considered to represent a true forecast, had the test been run in realtime or otherwise. Producers will be requested to add this information to issued predictions; recommendations for methods by which this might be done may be formulated later. Other objectives of the Standardized Verification System are: 3. To encourage both regular verification of forecasts and verification according to international standards; 4. To encourage information on inherent forecast quality to be added to all predictions as a matter of course and to encourage forecast recipients to expect receipt of the information; 5. To encourage producers to use consistent data sets and to encourage production of these data sets; 6. To provide verifications that permit direct intercomparison for forecast quality regardless of predicted variable, method, forecast range, geographical region, or any over consideration; 7. To encourage producers to work towards a common method for presenting forecasts. ________________
ANNEX 2
RELATIVE OPERATING CHARACTERISTICS The derivation of Relative Operating Characteristics is given below. For purposes of reporting forecast quality for exchange between centres and for annual submission to WMO the following will be required: 1. For deterministic forecasts Hit Rates and False Alarm Rates together with essential details of the forecast parameter and verification data sets; 2. For probabilistic forecasts Hit Rates and False Alarm Rates for each probability interval used. Frequent practice, as illustrated below, is for probability intervals of 10 per cent to be used. However other intervals may be used as appropriate (for example, for ninemember ensembles an interval of 33.3 % could be more realistic). Additionally the area under the curve should be calculated. Relative Operating Characteristics (ROC), derived from signal detection theory, are intended to provide information of the characteristics of systems upon which management decisions can be taken. In the case of weather forecasts, the decision might relate to the most appropriate manner in which to use a forecast system for a given purpose. ROC's are useful in contrasting characteristics of deterministic and probabilistic systems. Take the following 2x2 contingency table for any yes/no forecast for a specific binary event:
The binary 'event' can be defined quite flexibly, e.g. as positive/negative anomalies, anomalies greater/less than a specific amount, values between two limits, etc. If terciles are used then the binary event can be defined in terms of predictions of one tercile against the remaining two. Using stratification by observed (rather than by forecast) the following can be defined: Hit Rate = H/(H + M) False Alarm Rate = FA/(FA + CR) For deterministic forecasts the Hit Rate and False Alarm Rate only need be calculated; for probabilistic forecasts the procedure outlined below should be followed. A probabilistic forecast can be converted into a 2x2 table as follows. Tabulate probabilities in, say, 10% ranges stratified against observations, i.e.:
For any threshold, such as 50%, (indicated by the dotted line in the table), the Hit Rate (False Alarm Rate) can be calculated by the sum of O's (NO's) at and above the threshold value divided by S O_{i} (S NO_{i})  in other words for a value of 50% the calculation is as if the event is predicted given any forecast probability of 50% or more. So for the above case: Hit Rate = (O_{10} + O_{9} + O_{8} + O_{7} + O_{6}) / S O_{i} False Alarm Rate = (NO_{10} + NO_{9} + NO_{8} + NO_{7} + NO_{6}) / S NO_{i} This calculation can be repeated at each threshold and the points plotted to produce the ROC curve, which, by definition, must pass through the points (0,0) and (100,100) (for events being predicted only for 100% probabilities and for all probabilities exceeding 0% respectively). The further the curve lies towards the upper lefthand corner the better; noskill forecasts are indicated by a diagonal line. Areas under ROC curves can be calculated using the Trapezium rule. Areas should be standardized against the total area of the figure such that a perfect forecast system (i.e. one that has a curve through the toplefthand corner of the figure) has an area of one and a curve lying along the diagonal (no information) has an area of 0.5. Alternatively, but not recommended in the Standard, the 0.5 to 1.0 range can be rescaled to 0 to 1 (thus allowing negative values to be allocated to cases with the curve Iying below the diagonal  such curves can be generated). Not only can the areas be used to contrast different curves but they are also a basis for Monte Carlo significance tests. Monte Carlo testing should be done within the forecast data set itself. In order to handle spatial forecasts, predictions for each point within the grid should be treated as individual forecasts but with all results combined into the final outcome. Categorical predictions can be treated for each category separately. ___________________
ANNEX 3
ROOT MEAN SQUARE SKILL SCORES Root Mean Square Skill Scores are calculated from:
RMS (forecast) refers to the RMS error of the forecast. RMS (standard) refers to the RMS error of the standard when verified against the same observations as the forecast  the standard can be either climatology or persistence. When persistence is used, the persistence should be defined in a manner appropriate to the timescale of the prediction, although it is left to the producer to determine whether persistence over, perhaps, one month or an entire season is used in assessing a seasonal prediction. No portion of the persistence period should overlap into the forecast period and the forecast range should be calculated from no sooner than the time at which any observed information (i.e. information which could not be known at the time of a real forecast) is no longer included. Both of these requirements are placed to ensure that all forecasts and test predictions only use data that were available at the time of the prediction or would have been available at that time had a prediction been made (in the case of historical test).
_________________ ANNEX 4
ADDITIONAL DIAGNOSTICS 1. Categorical forecasts Linear Error in Categorical Space for Categorical Forecasts (LEPSCAT) Bias Post Agreement Percent Correct Kuiper Score 2. Probability Forecasts of Binary Predictands Brier Score Brier Skill Score with respect to Climatology Reliability Sharpness (measure to be decided) Continuous Rank Probability Score 3. Probability of MultipleCategory Predictands Ranked Probability Score Ranked Probability Skill Score with respect to Climatology 4. Continuous Forecasts in Space MurphyEpstein Decomposition (phase error, amplitude error, bias error) Anomaly Correlation 5. Continuous Forecasts in Time Mean Square Error Correlation Bias Anomaly Correlation
_________________ 
