Abstract:The recently released suite of AI weather models can produce multi-day, medium-range forecasts within seconds, with a skill on par with state-of-the-art operational forecasts. Traditional AI model evaluation predominantly targets global scores on single levels. Specific prediction tasks, such as severe convective environments, require much more precision on a local scale and with the correct vertical gradients between levels. With a focus on the convective season of global hotspots in 2020, we assess the skill of three top-performing AI models (Pangu-Weather, GraphCast, FourCastNet) for Convective Available Potential Energy (CAPE) and Deep Layer Shear (DLS) at lead-times of up to 10 days against the ERA-5 reanalysis and the IFS operational numerical weather prediction model. Looking at the example of a US tornado outbreak on April 12 and 13, 2020, all models predict elevated CAPE and DLS values multiple days in advance. The spatial structures in the AI models are smoothed in comparison to IFS and ERA-5. The models show differing biases in the prediction of CAPE values, with GraphCast capturing the value distribution the most accurately and FourCastNet showing a consistent underestimation. In seasonal analyses around the globe, we generally see the highest performance by GraphCast and Pangu-Weather, which match or even exceed the performance of IFS. CAPE derived from vertically coarse pressure levels of neural weather models lacks the precision of the vertically fine resolution of numerical models. The promising results here indicate that a direct prediction of CAPE in AI models is likely to be skillful. This would open unprecedented opportunities for fast and inexpensive predictions of severe weather phenomena. By advancing the assessment of AI models towards process-based evaluations we lay the foundation for hazard-driven applications of AI-based weather forecasts.