Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Keegan Hines

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

May 29, 2025

Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh

Abstract:The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

Via

Access Paper or Ask Questions

Lessons From Red Teaming 100 Generative AI Products

Jan 13, 2025

Blake Bullwinkel, Amanda Minnich, Shiven Chawla, Gary Lopez, Martin Pouliot, Whitney Maxwell, Joris de Gruyter, Katherine Pratt, Saphir Qi, Nina Chikanov(+16 more)

Abstract:In recent years, AI red teaming has emerged as a practice for probing the safety and security of generative AI systems. Due to the nascency of the field, there are many open questions about how red teaming operations should be conducted. Based on our experience red teaming over 100 generative AI products at Microsoft, we present our internal threat model ontology and eight main lessons we have learned: 1. Understand what the system can do and where it is applied 2. You don't have to compute gradients to break an AI system 3. AI red teaming is not safety benchmarking 4. Automation can help cover more of the risk landscape 5. The human element of AI red teaming is crucial 6. Responsible AI harms are pervasive but difficult to measure 7. LLMs amplify existing security risks and introduce new ones 8. The work of securing AI systems will never be complete By sharing these insights alongside case studies from our operations, we offer practical recommendations aimed at aligning red teaming efforts with real world risks. We also highlight aspects of AI red teaming that we believe are often misunderstood and discuss open questions for the field to consider.

Via

Access Paper or Ask Questions

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Mar 20, 2024

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, Emre Kiciman

Abstract:Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.

Via

Access Paper or Ask Questions

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Dec 21, 2023

Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Abstract:Recent remarkable advancements in large language models (LLMs) have led to their widespread adoption in various applications. A key feature of these applications is the combination of LLMs with external content, where user instructions and third-party content are combined to create prompts for LLM processing. These applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise LLM's output, causing their responses to deviate from user expectations. Despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different LLMs is available due to the lack of a benchmark. Furthermore, no effective defense has been proposed. In this work, we introduce the first benchmark, BIPIA, to measure the robustness of various LLMs and defenses against indirect prompt injection attacks. Our experiments reveal that LLMs with greater capabilities exhibit more vulnerable to indirect prompt injection attacks for text tasks, resulting in a higher ASR. We hypothesize that indirect prompt injection attacks are mainly due to the LLMs' inability to distinguish between instructions and external content. Based on this conjecture, we propose four black-box methods based on prompt learning and a white-box defense methods based on fine-tuning with adversarial training to enable LLMs to distinguish between instructions and external content and ignore instructions in the external content. Our experimental results show that our black-box defense methods can effectively reduce ASR but cannot completely thwart indirect prompt injection attacks, while our white-box defense method can reduce ASR to nearly zero with little adverse impact on the LLM's performance on general tasks. We hope that our benchmark and defenses can inspire future work in this important area.

Via

Access Paper or Ask Questions

Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Mar 23, 2023

Avi Schwarzschild, Max Cembalest, Karthik Rao, Keegan Hines, John Dickerson

Figure 1 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Figure 2 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Figure 3 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Figure 4 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Abstract:As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.

Via

Access Paper or Ask Questions

Achieving Downstream Fairness with Geometric Repair

Mar 14, 2022

Kweku Kwegyir-Aggrey, Jessica Dai, John Dickerson, Keegan Hines

Figure 1 for Achieving Downstream Fairness with Geometric Repair

Figure 2 for Achieving Downstream Fairness with Geometric Repair

Figure 3 for Achieving Downstream Fairness with Geometric Repair

Figure 4 for Achieving Downstream Fairness with Geometric Repair

Abstract:Consider a scenario where some upstream model developer must train a fair model, but is unaware of the fairness requirements of a downstream model user or stakeholder. In the context of fair classification, we present a technique that specifically addresses this setting, by post-processing a regressor's scores such they yield fair classifications for any downstream choice in decision threshold. To begin, we leverage ideas from optimal transport to show how this can be achieved for binary protected groups across a broad class of fairness metrics. Then, we extend our approach to address the setting where a protected attribute takes on multiple values, by re-recasting our technique as a convex optimization problem that leverages lexicographic fairness.

Via

Access Paper or Ask Questions

Counterfactual Explanations for Machine Learning: Challenges Revisited

Jun 14, 2021

Sahil Verma, John Dickerson, Keegan Hines

Figure 1 for Counterfactual Explanations for Machine Learning: Challenges Revisited

Abstract:Counterfactual explanations (CFEs) are an emerging technique under the umbrella of interpretability of machine learning (ML) models. They provide ``what if'' feedback of the form ``if an input datapoint were $x'$ instead of $x$, then an ML model's output would be $y'$ instead of $y$.'' Counterfactual explainability for ML models has yet to see widespread adoption in industry. In this short paper, we posit reasons for this slow uptake. Leveraging recent work outlining desirable properties of CFEs and our experience running the ML wing of a model monitoring startup, we identify outstanding obstacles hindering CFE deployment in industry.

* Presented at CHI HCXAI 2021 workshop

Via

Access Paper or Ask Questions

Amortized Generation of Sequential Counterfactual Explanations for Black-box Models

Jun 07, 2021

Sahil Verma, Keegan Hines, John P. Dickerson

Figure 1 for Amortized Generation of Sequential Counterfactual Explanations for Black-box Models

Figure 2 for Amortized Generation of Sequential Counterfactual Explanations for Black-box Models

Figure 3 for Amortized Generation of Sequential Counterfactual Explanations for Black-box Models

Figure 4 for Amortized Generation of Sequential Counterfactual Explanations for Black-box Models

Abstract:Explainable machine learning (ML) has gained traction in recent years due to the increasing adoption of ML-based systems in many sectors. Counterfactual explanations (CFEs) provide ``what if'' feedback of the form ``if an input datapoint were $x'$ instead of $x$, then an ML-based system's output would be $y'$ instead of $y$.'' CFEs are attractive due to their actionable feedback, amenability to existing legal frameworks, and fidelity to the underlying ML model. Yet, current CFE approaches are single shot -- that is, they assume $x$ can change to $x'$ in a single time period. We propose a novel stochastic-control-based approach that generates sequential CFEs, that is, CFEs that allow $x$ to move stochastically and sequentially across intermediate states to a final state $x'$. Our approach is model agnostic and black box. Furthermore, calculation of CFEs is amortized such that once trained, it applies to multiple datapoints without the need for re-optimization. In addition to these primary characteristics, our approach admits optional desiderata such as adherence to the data manifold, respect for causal relations, and sparsity -- identified by past research as desirable properties of CFEs. We evaluate our approach using three real-world datasets and show successful generation of sequential CFEs that respect other counterfactual desiderata.

* 19 pages, 3 figures, 4 tables

Via

Access Paper or Ask Questions

Counterfactual Explanations for Machine Learning: A Review

Oct 20, 2020

Sahil Verma, John Dickerson, Keegan Hines

Figure 1 for Counterfactual Explanations for Machine Learning: A Review

Figure 2 for Counterfactual Explanations for Machine Learning: A Review

Abstract:Machine learning plays a role in many deployed decision systems, often in ways that are difficult or impossible to understand by human stakeholders. Explaining, in a human-understandable way, the relationship between the input and output of machine learning models is essential to the development of trustworthy machine-learning-based systems. A burgeoning body of research seeks to define the goals and methods of explainability in machine learning. In this paper, we seek to review and categorize research on counterfactual explanations, a specific class of explanation that provides a link between what could have happened had input to a model been changed in a particular way. Modern approaches to counterfactual explainability in machine learning draw connections to the established legal doctrine in many countries, making them appealing to fielded systems in high-impact areas such as finance and healthcare. Thus, we design a rubric with desirable properties of counterfactual explanation algorithms and comprehensively evaluate all currently-proposed algorithms against that rubric. Our rubric provides easy comparison and comprehension of the advantages and disadvantages of different approaches and serves as an introduction to major research themes in this field. We also identify gaps and discuss promising research directions in the space of counterfactual explainability.

* 10 pages

Via

Access Paper or Ask Questions

Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Sep 03, 2019

Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan Hines, C. Bayan Bruss, Reza Farivar

Figure 1 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Figure 2 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Figure 3 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Figure 4 for Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Abstract:There has been considerable growth and interest in industrial applications of machine learning (ML) in recent years. ML engineers, as a consequence, are in high demand across the industry, yet improving the efficiency of ML engineers remains a fundamental challenge. Automated machine learning (AutoML) has emerged as a way to save time and effort on repetitive tasks in ML pipelines, such as data pre-processing, feature engineering, model selection, hyperparameter optimization, and prediction result analysis. In this paper, we investigate the current state of AutoML tools aiming to automate these tasks. We conduct various evaluations of the tools on many datasets, in different data segments, to examine their performance, and compare their advantages and disadvantages on different test cases.

Via

Access Paper or Ask Questions