Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hanxun Huang

X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP

May 08, 2025

Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey

Abstract:As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce \textbf{X-Transfer}, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as \textbf{super transferability}--a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through \textbf{surrogate scaling}, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models. The code is publicly available in our \href{https://github.com/HanxunH/XTransferBench}{GitHub repository}.

* ICML 2025

Via

Access Paper or Ask Questions

CURVALID: Geometrically-guided Adversarial Prompt Detection

Mar 05, 2025

Canaan Yung, Hanxun Huang, Sarah Monazam Erfani, Christopher Leckie

Abstract:Adversarial prompts capable of jailbreaking large language models (LLMs) and inducing undesirable behaviours pose a significant obstacle to their safe deployment. Current mitigation strategies rely on activating built-in defence mechanisms or fine-tuning the LLMs, but the fundamental distinctions between adversarial and benign prompts are yet to be understood. In this work, we introduce CurvaLID, a novel defense framework that efficiently detects adversarial prompts by leveraging their geometric properties. It is agnostic to the type of LLM, offering a unified detection framework across diverse adversarial prompts and LLM architectures. CurvaLID builds on the geometric analysis of text prompts to uncover their underlying differences. We theoretically extend the concept of curvature via the Whewell equation into an $n$-dimensional word embedding space, enabling us to quantify local geometric properties, including semantic shifts and curvature in the underlying manifolds. Additionally, we employ Local Intrinsic Dimensionality (LID) to capture geometric features of text prompts within adversarial subspaces. Our findings reveal that adversarial prompts differ fundamentally from benign prompts in terms of their geometric characteristics. Our results demonstrate that CurvaLID delivers superior detection and rejection of adversarial queries, paving the way for safer LLM deployment. The source code can be found at https://github.com/Cancanxxx/CurvaLID

* 29 Pages, 5 figues

Via

Access Paper or Ask Questions

Detecting Backdoor Samples in Contrastive Language Image Pretraining

Feb 03, 2025

Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey

Abstract:Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs. The code is publicly available in our \href{https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples}{GitHub repository}.

* ICLR2025

Via

Access Paper or Ask Questions

Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Nov 20, 2024

Yong Xie, Weijie Zheng, Hanxun Huang, Guangnan Ye, Xingjun Ma

Figure 1 for Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Figure 2 for Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Figure 3 for Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Figure 4 for Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Abstract:As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness evaluation methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks. In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, and show that PMA can outperform the current state-of-the-art individual methods. Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.

Via

Access Paper or Ask Questions

Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Oct 25, 2024

Yige Li, Hanxun Huang, Jiaming Zhang, Xingjun Ma, Yu-Gang Jiang

Figure 1 for Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Figure 2 for Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Figure 3 for Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Figure 4 for Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Abstract:Backdoor attacks covertly implant triggers into deep neural networks (DNNs) by poisoning a small portion of the training data with pre-designed backdoor triggers. This vulnerability is exacerbated in the era of large models, where extensive (pre-)training on web-crawled datasets is susceptible to compromise. In this paper, we introduce a novel two-step defense framework named Expose Before You Defend (EBYD). EBYD unifies existing backdoor defense methods into a comprehensive defense system with enhanced performance. Specifically, EBYD first exposes the backdoor functionality in the backdoored model through a model preprocessing step called backdoor exposure, and then applies detection and removal methods to the exposed model to identify and eliminate the backdoor features. In the first step of backdoor exposure, we propose a novel technique called Clean Unlearning (CUL), which proactively unlearns clean features from the backdoored model to reveal the hidden backdoor features. We also explore various model editing/modification techniques for backdoor exposure, including fine-tuning, model sparsification, and weight perturbation. Using EBYD, we conduct extensive experiments on 10 image attacks and 6 text attacks across 2 vision datasets (CIFAR-10 and an ImageNet subset) and 4 language datasets (SST-2, IMDB, Twitter, and AG's News). The results demonstrate the importance of backdoor exposure for backdoor defense, showing that the exposed models can significantly benefit a range of downstream defense tasks, including backdoor label detection, backdoor trigger recovery, backdoor model detection, and backdoor removal. We hope our work could inspire more research in developing advanced defense frameworks with exposed models. Our code is available at: https://github.com/bboylyg/Expose-Before-You-Defend.

* 19 pages

Via

Access Paper or Ask Questions

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Aug 23, 2024

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Jun Sun

Figure 1 for BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Figure 2 for BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Figure 3 for BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Figure 4 for BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Abstract:Generative Large Language Models (LLMs) have made significant strides across various tasks, but they remain vulnerable to backdoor attacks, where specific triggers in the prompt cause the LLM to generate adversary-desired responses. While most backdoor research has focused on vision or text classification tasks, backdoor attacks in text generation have been largely overlooked. In this work, we introduce \textit{BackdoorLLM}, the first comprehensive benchmark for studying backdoor attacks on LLMs. \textit{BackdoorLLM} features: 1) a repository of backdoor benchmarks with a standardized training pipeline, 2) diverse attack strategies, including data poisoning, weight poisoning, hidden state attacks, and chain-of-thought attacks, 3) extensive evaluations with over 200 experiments on 8 attacks across 7 scenarios and 6 model architectures, and 4) key insights into the effectiveness and limitations of backdoors in LLMs. We hope \textit{BackdoorLLM} will raise awareness of backdoor threats and contribute to advancing AI safety. The code is available at \url{https://github.com/bboylyg/BackdoorLLM}.

Via

Access Paper or Ask Questions

Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers

Aug 03, 2024

Weijie Zheng, Xingjun Ma, Hanxun Huang, Zuxuan Wu, Yu-Gang Jiang

Figure 1 for Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers

Figure 2 for Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers

Figure 3 for Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers

Figure 4 for Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers

Abstract:With the advancement of vision transformers (ViTs) and self-supervised learning (SSL) techniques, pre-trained large ViTs have become the new foundation models for computer vision applications. However, studies have shown that, like convolutional neural networks (CNNs), ViTs are also susceptible to adversarial attacks, where subtle perturbations in the input can fool the model into making false predictions. This paper studies the transferability of such an adversarial vulnerability from a pre-trained ViT model to downstream tasks. We focus on \emph{sample-wise} transfer attacks and propose a novel attack method termed \emph{Downstream Transfer Attack (DTA)}. For a given test image, DTA leverages a pre-trained ViT model to craft the adversarial example and then applies the adversarial example to attack a fine-tuned version of the model on a downstream dataset. During the attack, DTA identifies and exploits the most vulnerable layers of the pre-trained model guided by a cosine similarity loss to craft highly transferable attacks. Through extensive experiments with pre-trained ViTs by 3 distinct pre-training methods, 3 fine-tuning schemes, and across 10 diverse downstream datasets, we show that DTA achieves an average attack success rate (ASR) exceeding 90\%, surpassing existing methods by a huge margin. When used with adversarial training, the adversarial examples generated by our DTA can significantly improve the model's robustness to different downstream transfer attacks.

Via

Access Paper or Ask Questions

Multi-Trigger Backdoor Attacks: More Triggers, More Threats

Jan 27, 2024

Yige Li, Xingjun Ma, Jiabo He, Hanxun Huang, Yu-Gang Jiang

Figure 1 for Multi-Trigger Backdoor Attacks: More Triggers, More Threats

Figure 2 for Multi-Trigger Backdoor Attacks: More Triggers, More Threats

Figure 3 for Multi-Trigger Backdoor Attacks: More Triggers, More Threats

Figure 4 for Multi-Trigger Backdoor Attacks: More Triggers, More Threats

Abstract:Backdoor attacks have emerged as a primary threat to (pre-)training and deployment of deep neural networks (DNNs). While backdoor attacks have been extensively studied in a body of works, most of them were focused on single-trigger attacks that poison a dataset using a single type of trigger. Arguably, real-world backdoor attacks can be much more complex, e.g., the existence of multiple adversaries for the same dataset if it is of high value. In this work, we investigate the practical threat of backdoor attacks under the setting of \textbf{multi-trigger attacks} where multiple adversaries leverage different types of triggers to poison the same dataset. By proposing and investigating three types of multi-trigger attacks, including parallel, sequential, and hybrid attacks, we provide a set of important understandings of the coexisting, overwriting, and cross-activating effects between different triggers on the same dataset. Moreover, we show that single-trigger attacks tend to cause overly optimistic views of the security of current defense techniques, as all examined defense methods struggle to defend against multi-trigger attacks. Finally, we create a multi-trigger backdoor poisoning dataset to help future evaluation of backdoor attacks and defenses. Although our work is purely empirical, we hope it can help steer backdoor research toward more realistic settings.

Via

Access Paper or Ask Questions

LDReg: Local Dimensionality Regularized Self-Supervised Learning

Jan 19, 2024

Hanxun Huang, Ricardo J. G. B. Campello, Sarah Monazam Erfani, Xingjun Ma, Michael E. Houle, James Bailey

Figure 1 for LDReg: Local Dimensionality Regularized Self-Supervised Learning

Figure 2 for LDReg: Local Dimensionality Regularized Self-Supervised Learning

Figure 3 for LDReg: Local Dimensionality Regularized Self-Supervised Learning

Figure 4 for LDReg: Local Dimensionality Regularized Self-Supervised Learning

Abstract:Representations learned via self-supervised learning (SSL) can be susceptible to dimensional collapse, where the learned representation subspace is of extremely low dimensionality and thus fails to represent the full data distribution and modalities. Dimensional collapse also known as the "underfilling" phenomenon is one of the major causes of degraded performance on downstream tasks. Previous work has investigated the dimensional collapse problem of SSL at a global level. In this paper, we demonstrate that representations can span over high dimensional space globally, but collapse locally. To address this, we propose a method called $\textit{local dimensionality regularization (LDReg)}$. Our formulation is based on the derivation of the Fisher-Rao metric to compare and optimize local distance distributions at an asymptotically small radius for each data point. By increasing the local intrinsic dimensionality, we demonstrate through a range of experiments that LDReg improves the representation quality of SSL. The results also show that LDReg can regularize dimensionality at both local and global levels.

* ICLR 2024

Via

Access Paper or Ask Questions

Distilling Cognitive Backdoor Patterns within an Image

Jan 26, 2023

Hanxun Huang, Xingjun Ma, Sarah Erfani, James Bailey

Abstract:This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.

* ICLR2023

Via

Access Paper or Ask Questions