Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuanpu Cao

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Sep 15, 2025

Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, Jinghui Chen

Figure 1 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Figure 2 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Figure 3 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Figure 4 for Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Abstract:Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Via

Access Paper or Ask Questions

GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors

Mar 05, 2025

Yaopei Zeng, Yuanpu Cao, Lu Lin

Figure 1 for GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors

Figure 2 for GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors

Figure 3 for GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors

Figure 4 for GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors

Abstract:The growing accessibility of diffusion models has revolutionized image editing but also raised significant concerns about unauthorized modifications, such as misinformation and plagiarism. Existing countermeasures largely rely on adversarial perturbations designed to disrupt diffusion model outputs. However, these approaches are found to be easily neutralized by simple image preprocessing techniques, such as compression and noise addition. To address this limitation, we propose GuardDoor, a novel and robust protection mechanism that fosters collaboration between image owners and model providers. Specifically, the model provider participating in the mechanism fine-tunes the image encoder to embed a protective backdoor, allowing image owners to request the attachment of imperceptible triggers to their images. When unauthorized users attempt to edit these protected images with this diffusion model, the model produces meaningless outputs, reducing the risk of malicious image editing. Our method demonstrates enhanced robustness against image preprocessing operations and is scalable for large-scale deployment. This work underscores the potential of cooperative frameworks between model providers and image owners to safeguard digital content in the era of generative AI.

Via

Access Paper or Ask Questions

TruthFlow: Truthful LLM Generation via Representation Flow Correction

Feb 06, 2025

Hanyu Wang, Bochuan Cao, Yuanpu Cao, Jinghui Chen

Figure 1 for TruthFlow: Truthful LLM Generation via Representation Flow Correction

Figure 2 for TruthFlow: Truthful LLM Generation via Representation Flow Correction

Figure 3 for TruthFlow: Truthful LLM Generation via Representation Flow Correction

Figure 4 for TruthFlow: Truthful LLM Generation via Representation Flow Correction

Abstract:Large language models (LLMs) are known to struggle with consistently generating truthful responses. While various representation intervention techniques have been proposed, these methods typically apply a universal representation correction vector to all input queries, limiting their effectiveness against diverse queries in practice. In this study, we introduce TruthFlow, a novel method that leverages the Flow Matching technique for query-specific truthful representation correction. Specifically, TruthFlow first uses a flow model to learn query-specific correction vectors that transition representations from hallucinated to truthful states. Then, during inference, the trained flow model generates these correction vectors to enhance the truthfulness of LLM outputs. Experimental results demonstrate that TruthFlow significantly improves performance on open-ended generation tasks across various advanced LLMs evaluated on TruthfulQA. Moreover, the trained TruthFlow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.

Via

Access Paper or Ask Questions

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Oct 28, 2024

Yaopei Zeng, Yuanpu Cao, Bochuan Cao, Yurui Chang, Jinghui Chen, Lu Lin

Figure 1 for AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Figure 2 for AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Figure 3 for AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Figure 4 for AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Abstract:Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.

Via

Access Paper or Ask Questions

Adversarially Robust Industrial Anomaly Detection Through Diffusion Model

Aug 09, 2024

Yuanpu Cao, Lu Lin, Jinghui Chen

Figure 1 for Adversarially Robust Industrial Anomaly Detection Through Diffusion Model

Figure 2 for Adversarially Robust Industrial Anomaly Detection Through Diffusion Model

Figure 3 for Adversarially Robust Industrial Anomaly Detection Through Diffusion Model

Figure 4 for Adversarially Robust Industrial Anomaly Detection Through Diffusion Model

Abstract:Deep learning-based industrial anomaly detection models have achieved remarkably high accuracy on commonly used benchmark datasets. However, the robustness of those models may not be satisfactory due to the existence of adversarial examples, which pose significant threats to the practical deployment of deep anomaly detectors. Recently, it has been shown that diffusion models can be used to purify the adversarial noises and thus build a robust classifier against adversarial attacks. Unfortunately, we found that naively applying this strategy in anomaly detection (i.e., placing a purifier before an anomaly detector) will suffer from a high anomaly miss rate since the purifying process can easily remove both the anomaly signal and the adversarial perturbations, causing the later anomaly detector failed to detect anomalies. To tackle this issue, we explore the possibility of performing anomaly detection and adversarial purification simultaneously. We propose a simple yet effective adversarially robust anomaly detection method, \textit{AdvRAD}, that allows the diffusion model to act both as an anomaly detector and adversarial purifier. We also extend our proposed method for certified robustness to $l_2$ norm bounded perturbations. Through extensive experiments, we show that our proposed method exhibits outstanding (certified) adversarial robustness while also maintaining equally strong anomaly detection performance on par with the state-of-the-art methods on industrial anomaly detection benchmark datasets.

Via

Access Paper or Ask Questions

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

May 28, 2024

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, Jinghui Chen

Figure 1 for Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Figure 2 for Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Figure 3 for Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Figure 4 for Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Abstract:Researchers have been studying approaches to steer the behavior of Large Language Models (LLMs) and build personalized LLMs tailored for various applications. While fine-tuning seems to be a direct solution, it requires substantial computational resources and may significantly affect the utility of the original LLM. Recent endeavors have introduced more lightweight strategies, focusing on extracting "steering vectors" to guide the model's output toward desired behaviors by adjusting activations within specific layers of the LLM's transformer architecture. However, such steering vectors are directly extracted from the activations of human preference data and thus often lead to suboptimal results and occasional failures, especially in alignment-related scenarios. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs, thereby offering a more precise representation of the target behavior. By carefully adjusting the direction and magnitude of the steering vector, we enabled personalized control over the desired behavior across a spectrum of intensities. Extensive experimentation across various open-ended generation tasks, particularly focusing on steering AI personas, has validated the efficacy of our approach. Moreover, we comprehensively investigate critical alignment-concerning scenarios, such as managing truthfulness, mitigating hallucination, and addressing jailbreaking attacks. Remarkably, our method can still demonstrate outstanding steering effectiveness across these scenarios. Furthermore, we showcase the transferability of our steering vectors across different models/LoRAs and highlight the synergistic benefits of applying multiple vectors simultaneously.

Via

Access Paper or Ask Questions

WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

May 22, 2024

Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lu Lin, Prasenjit Mitra, Jinghui Chen

Figure 1 for WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

Figure 2 for WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

Figure 3 for WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

Figure 4 for WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response

Abstract:The recent breakthrough in large language models (LLMs) such as ChatGPT has revolutionized production processes at an unprecedented pace. Alongside this progress also comes mounting concerns about LLMs' susceptibility to jailbreaking attacks, which leads to the generation of harmful or unsafe content. While safety alignment measures have been implemented in LLMs to mitigate existing jailbreak attempts and force them to become increasingly complicated, it is still far from perfect. In this paper, we analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses. Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query and encourage benign content regarding the games to precede the anticipated harmful content in the response, creating a context that is hardly covered by any corpus used for safety alignment. Extensive experiments demonstrate that WordGame attack can break the guardrails of the current leading proprietary and open-source LLMs, including the latest Claude-3, GPT-4, and Llama-3 models. Further ablation studies on such simultaneous obfuscation in query and response provide evidence of the merits of the attack strategy beyond an individual attack.

Via

Access Paper or Ask Questions

Federated Learning with Projected Trajectory Regularization

Dec 22, 2023

Tiejin Chen, Yuanpu Cao, Yujia Wang, Cho-Jui Hsieh, Jinghui Chen

Abstract:Federated learning enables joint training of machine learning models from distributed clients without sharing their local data. One key challenge in federated learning is to handle non-identically distributed data across the clients, which leads to deteriorated model training performances. Prior works in this line of research mainly focus on utilizing last-step global model parameters/gradients or the linear combinations of the past model parameters/gradients, which do not fully exploit the potential of global information from the model training trajectory. In this paper, we propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data heterogeneity issue, which proposes a unique way to better extract the essential global information from the model training trajectory. Specifically, FedPTR allows local clients or the server to optimize an auxiliary (synthetic) dataset that mimics the learning dynamics of the recent model update and utilizes it to project the next-step model trajectory for local training regularization. We conduct rigorous theoretical analysis for our proposed framework under nonconvex stochastic settings to verify its fast convergence under heterogeneous data distributions. Experiments on various benchmark datasets and non-i.i.d. settings validate the effectiveness of our proposed framework.

* 9 pages

Via

Access Paper or Ask Questions

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Nov 15, 2023

Yuanpu Cao, Bochuan Cao, Jinghui Chen

Figure 1 for Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Figure 2 for Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Figure 3 for Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Figure 4 for Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Abstract:Recent developments in Large Language Models (LLMs) have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or red-teaming can easily expose the potential weaknesses of the unaligned models, thereby precluding their release/use. (2) non-persistence, the unaligned LLMs can be easily repaired through re-alignment, i.e., fine-tuning again with aligned data points. In this work, we show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. We also provide a novel understanding on the relationship between the backdoor persistence and the activation pattern and further provide guidelines for potential trigger design. Through extensive experiments, we demonstrate that our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.

Via

Access Paper or Ask Questions

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Sep 18, 2023

Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen

Abstract:Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source large language models, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100\% to around 10\% or less.

* 16 Pages, 5 Figures, 3 Tables

Via

Access Paper or Ask Questions