home > User-oriented Evaluation > Research Strategies
User-oriented Evaluation
User-oriented Evaluation
Research Strategies

Improvements in community resilience are ultimately achieved through improvements in all elements of an information chain that links accurate weather information, forecast provision in an appropriate form for a potential user, timely access to that information, understanding of the information, and ability to make use of it and respond appropriately to achieve socioeconomic benefits. Measuring the effectiveness and improvements in all links of the value chain will be important in the HIWeather project.  

Murphy (1993) defined three aspects of a “good” forecast: consistency (represents the forecaster’s best judgement), quality (accuracy and other measures of meteorological performance), and value (enables better decisions). Evaluation research in the HIWeather project will address new challenges in measuring quality and value, including the accuracy of weather and hazard predictions, the decision-making value of warnings, and the final benefit of forecast applications as measured in social, economic, and environmental terms. The closely related cross-cutting verification theme (Section 4.6) focuses on the practical aspects of applying available verification methodologies. 

Evaluation of High Impact Weather forecast quality must answer several types of questions including: 

What is the nature (magnitude, bias, distribution) of the errors in the forecasts and how do the errors depend on the forecast itself? 

·   What improvements should be made to the forecasting system to improve its performance? 

·   To what degree are the forecasts more skilful than a naïve forecast like persistence, climatology or random chance? 

·   Which forecast performs better when more than one is considered? 

·   How do forecast errors propagate, confound, and conflate through the seamless hazard prediction process to the final intended user? 

While the first two questions may be of primary interest to modellers and other researchers who are developing forecasting systems, the other questions must also be addressed in order to demonstrate impact to stakeholders and justify investment in forecasting research and development. 

Standard verification approaches for medium range NWP have limited usefulness for very high resolution (mesoscale and convective scale) forecasts. Several new verification methods have been proposed for evaluating the spatial structures simulated by high resolution models and this remains an active area of research. While most of these spatial methods measure forecast quality, some of them (e.g., variograms) address the realism of the forecast, which may be of particular interest to modellers.  

Spatial verification approaches are now starting to be applied to high resolution ensemble forecasts, but much remains to be done to understand what can be learned from these approaches, both in terms of quantifying ensemble performance, and in calibrating and postprocessing ensembles to improve forecast quality and utility. The utility of spatial verification for evaluating hazard impact forecasts (e.g., flood inundation, fire spread, blizzard extent and intensity, pollution cloud) needs to be explored, especially since graphical advice and warnings are becoming more common. 

Characterisation of timing errors is also very important, not only for model output but also for warnings where there are two additional free parameters, lead time and duration. Little work has been done to quantify timing errors, especially for graphical warnings, in spite of those products becoming increasingly common. Knowledge of the useful lead time for communicating the hazard and taking protective action is important for developing user confidence in the warnings.  

High impact weather often involves extreme values of wind, precipitation, or severe weather which are rare and/or difficult to observe. Some new “extremal dependency” metrics have been proposed for quantifying the accuracy of categorical forecasts for rare events, which are better able to discriminate between the performance of competing forecasts at the far end of the distribution. The utility of these scores for evaluating forecasts of high impact weather and its impacts requires further investigation.  

Evaluation methodologies appropriate for particular high impact weather hazards need further development. For example, in the case of tropical cyclones, track and intensity verification has been done for many years but additional evaluations are needed for storm structure, precipitation, storm surge, landfall time/position/intensity, consistency, uncertainty, and additional information to assist forecasters (e.g., steering flow) and emergency managers. The predicted occurrence and evolution of cyclones at long lead times (genesis, false alarms and missed events) also requires further research.  

Methods for evaluating hazard impacts are even less mature. Users are interested in knowing the quality of downstream predictions for physically linked quantities like flood height, fire spread rate, visibility, road conditions, comfort indices, etc. (see Section 2.3). Metrics for evaluating these quantities need to be developed in close collaboration with key users, linked directly with their decision making needs. 

Observations of impacts are not routine and the meteorological community does not necessarily have access to the sources of observational data needed for evaluation. It will be necessary to partner closely with the relevant agencies and media to share data. It may be possible for the high impact weather community to influence governments and other stakeholders to introduce routine hazard monitoring. Enticing new sources of information, for example from webcams, social media, crowd-based observing networks like WoW and sensor networks external to national meteorological services, should be explored for their utility in evaluating hazard impact forecasts. Observational errors affect the ability to quantify the performance of high impact weather forecasts, especially in extreme environments where observations may be less reliable (e.g., wind-related under-catch of precipitation in gauges, attenuation of radar reflectivity in extreme rainfall). Strategies for accounting for observational error in verification are urgently needed. More robust approaches such as quantile verification or evaluating forecasts in “observations space” should be encouraged. This is also true for evaluating impact predictions, especially when the reliability of the data is less well understood and quality control methodologies are still being developed. 

Forecast value and benefit is related to accuracy, but goes further to measure the societal and economic advantage to users of using the forecasts and warnings in their decisions. These attributes are much harder to measure than the quality of the weather predictions, as users’ decisions are affected by the methods of communication used to convey the forecast (addressed in Section 3.5), their trust in the forecasts, their vulnerability and exposure to risk, and psychological and environmental factors influencing their interpretation and use of forecast and warning information. In addition, competing factors (e.g., economic considerations) may impact the users’ responses. A complication, when verifying value, is that observations are unable to report what the impact of a hazard would have been in the absence of the forecast. Some of the most straightforward forecasts to evaluate are those where, for safety reasons, the hazard is largely mitigated and forecasts are used to reduce the mitigation cost. Both aviation and winter road maintenance are close to this situation. Methods are needed to measure improvements in the protection of life and property in response to forecasts and warnings, where a “do nothing” baseline may not be available or desirable. 

An important goal is measuring the economic benefit that can potentially be gained in various sectors (industry, government, public, etc.) through improvements in the quality, timeliness and communication of forecasts of high impact weather and associated hazards. This goal is of interest to stakeholders in terms of reducing their costs and losses, and NMHSs to justify investment in improved services and supporting infrastructure like supercomputing and observation networks. Partnerships with stakeholders will be needed to link improvements in prediction with (sometimes confidential) knowledge of hazard-related costs and losses to assess economic benefit.    

A related question concerns the individual sensitivity to forecast error, i.e. what are tolerable and acceptable errors (as defined by traditional verification) and how does this vary among sectors? For example, the interplay between hits, misses and false alarms is such that in order to achieve a certain probability of detection one may need to make a decision to warn at a fairly low probability, which introduces a high over-forecasting bias. Depending on the costs and losses associated with false alarms and misses, different warning thresholds will be optimal for different users, The simple cost/loss model used in verification metrics like Relative Economic Value are frequently criticized as overly simplistic, and more work needs to be done to apply more appropriate economic models when linking forecast accuracy with value. 

In the same way that evaluation underpins all of the research pillars in HIWeather, it is also a key unifying expertise across weather and climate. Close collaboration will be maintained with WCRP activity in this area through the Joint Working Group on Verification Research (JWGFVR).


Users Login