Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Threet

OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

Feb 18, 2025

Michael Kouremetis, Marissa Dotter, Alex Byrne, Dan Martin, Ethan Michalak, Gianpaolo Russo, Michael Threet, Guido Zarrella

Figure 1 for OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

Figure 2 for OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

Figure 3 for OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

Figure 4 for OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

Abstract:The prospect of artificial intelligence (AI) competing in the adversarial landscape of cyber security has long been considered one of the most impactful, challenging, and potentially dangerous applications of AI. Here, we demonstrate a new approach to assessing AI's progress towards enabling and scaling real-world offensive cyber operations (OCO) tactics in use by modern threat actors. We detail OCCULT, a lightweight operational evaluation framework that allows cyber security experts to contribute to rigorous and repeatable measurement of the plausible cyber security risks associated with any given large language model (LLM) or AI employed for OCO. We also prototype and evaluate three very different OCO benchmarks for LLMs that demonstrate our approach and serve as examples for building benchmarks under the OCCULT framework. Finally, we provide preliminary evaluation results to demonstrate how this framework allows us to move beyond traditional all-or-nothing tests, such as those crafted from educational exercises like capture-the-flag environments, to contextualize our indicators and warnings in true cyber threat scenarios that present risks to modern infrastructure. We find that there has been significant recent advancement in the risks of AI being used to scale realistic cyber threats. For the first time, we find a model (DeepSeek-R1) is capable of correctly answering over 90% of challenging offensive cyber knowledge tests in our Threat Actor Competency Test for LLMs (TACTL) multiple-choice benchmarks. We also show how Meta's Llama and Mistral's Mixtral model families show marked performance improvements over earlier models against our benchmarks where LLMs act as offensive agents in MITRE's high-fidelity offensive and defensive cyber operations simulation environment, CyberLayer.

* 31 pages, 17 figures, 11 tables

Via

Access Paper or Ask Questions

Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

Nov 27, 2023

Alexander Tapley, Marissa Dotter, Michael Doyle, Aidan Fennelly, Dhanuj Gandikota, Savanna Smith, Michael Threet, Tim Welsh

Figure 1 for Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

Figure 2 for Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

Figure 3 for Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

Figure 4 for Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

Abstract:Climate change has resulted in a year over year increase in adverse weather and weather conditions which contribute to increasingly severe fire seasons. Without effective mitigation, these fires pose a threat to life, property, ecology, cultural heritage, and critical infrastructure. To better prepare for and react to the increasing threat of wildfires, more accurate fire modelers and mitigation responses are necessary. In this paper, we introduce SimFire, a versatile wildland fire projection simulator designed to generate realistic wildfire scenarios, and SimHarness, a modular agent-based machine learning wrapper capable of automatically generating land management strategies within SimFire to reduce the overall damage to the area. Together, this publicly available system allows researchers and practitioners the ability to emulate and assess the effectiveness of firefighter interventions and formulate strategic plans that prioritize value preservation and resource allocation optimization. The repositories are available for download at https://github.com/mitrefireline.

* 12 pages, 4 figures including Appendices (A, B). Accepted as a paper in the Proposals track at the "Tackling Climate Change with Machine Learning" workshop at NeurIPS 2023. MITRE Public Release Case Number 23-3920

Via

Access Paper or Ask Questions