Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Handing Wang

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

May 22, 2025

Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin

Abstract:Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities. However, the expanded input space introduces new attack surfaces. Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision. As MLLMs increasingly incorporate cross-modal consistency and alignment mechanisms, such explicit attacks become easier to detect and block. In this work, we propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via least significant bit steganography and couples them with seemingly benign, image-related textual prompts. To further enhance attack effectiveness across diverse MLLMs, we incorporate adversarial suffixes generated by a surrogate model and introduce a template optimization module that iteratively refines both the prompt and embedding based on model feedback. On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.

Via

Access Paper or Ask Questions

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

May 12, 2025

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

Abstract:Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs often undergo the shallow safety alignment, where the first few tokens largely determine whether the response will be harmful. Through comprehensive observations, we find that safety-aligned LLMs and various defense strategies generate highly similar initial tokens in their refusal responses, which we define as safety trigger tokens. Building on this insight, we propose \texttt{D-STT}, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM to trigger the model's learned safety patterns. In this process, the safety trigger is constrained to a single token, which effectively preserves model usability by introducing minimum intervention in the decoding process. Extensive experiments across diverse jailbreak attacks and benign prompts demonstrate that \ours significantly reduces output harmfulness while preserving model usability and incurring negligible response time overhead, outperforming ten baseline methods.

Via

Access Paper or Ask Questions

ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

Apr 23, 2025

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

Abstract:Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as ''high-quality'' data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.

* 19 pages, 6 figure, Multiobjective Alignment of LLMs

Via

Access Paper or Ask Questions

Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

Apr 15, 2025

Jiangtao Liu, Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin

Abstract:Recent advancements in Text-to-Image (T2I) generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt checkers and post-hoc image checkers, are vulnerable to sophisticated adversarial attacks. In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. By iteratively optimizing tokens near these boundaries, TCBS-Attack generates semantically coherent adversarial prompts capable of bypassing multiple defensive layers in T2I models. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 45\% and an ASR-1 of 21\% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.

Via

Access Paper or Ask Questions

Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective

Jul 17, 2024

Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin

Abstract:Adversarial training (AT) has become an effective defense method against adversarial examples (AEs) and it is typically framed as a bi-level optimization problem. Among various AT methods, fast AT (FAT), which employs a single-step attack strategy to guide the training process, can achieve good robustness against adversarial attacks at a low cost. However, FAT methods suffer from the catastrophic overfitting problem, especially on complex tasks or with large-parameter models. In this work, we propose a FAT method termed FGSM-PCO, which mitigates catastrophic overfitting by averting the collapse of the inner optimization problem in the bi-level optimization process. FGSM-PCO generates current-stage AEs from the historical AEs and incorporates them into the training process using an adaptive mechanism. This mechanism determines an appropriate fusion ratio according to the performance of the AEs on the training model. Coupled with a loss function tailored to the training framework, FGSM-PCO can alleviate catastrophic overfitting and help the recovery of an overfitted model to effective training. We evaluate our algorithm across three models and three datasets to validate its effectiveness. Comparative empirical studies against other FAT algorithms demonstrate that our proposed method effectively addresses unresolved overfitting issues in existing algorithms.

Via

Access Paper or Ask Questions

Exploring Knowledge Transfer in Evolutionary Many-task Optimization: A Complex Network Perspective

Jul 12, 2024

Yudong Yang, Kai Wu, Xiangyi Teng, Handing Wang, He Yu, Jing Liu

Abstract:The field of evolutionary many-task optimization (EMaTO) is increasingly recognized for its ability to streamline the resolution of optimization challenges with repetitive characteristics, thereby conserving computational resources. This paper tackles the challenge of crafting efficient knowledge transfer mechanisms within EMaTO, a task complicated by the computational demands of individual task evaluations. We introduce a novel framework that employs a complex network to comprehensively analyze the dynamics of knowledge transfer between tasks within EMaTO. By extracting and scrutinizing the knowledge transfer network from existing EMaTO algorithms, we evaluate the influence of network modifications on overall algorithmic efficacy. Our findings indicate that these networks are diverse, displaying community-structured directed graph characteristics, with their network density adapting to different task sets. This research underscores the viability of integrating complex network concepts into EMaTO to refine knowledge transfer processes, paving the way for future advancements in the domain.

* 9 pages, accepted by GECCO 2024 poster

Via

Access Paper or Ask Questions

Interpreting Multi-objective Evolutionary Algorithms via Sokoban Level Generation

Jun 15, 2024

Qingquan Zhang, Yuchen Li, Yuhang Lin, Handing Wang, Jialin Liu

Figure 1 for Interpreting Multi-objective Evolutionary Algorithms via Sokoban Level Generation

Figure 2 for Interpreting Multi-objective Evolutionary Algorithms via Sokoban Level Generation

Figure 3 for Interpreting Multi-objective Evolutionary Algorithms via Sokoban Level Generation

Abstract:This paper presents an interactive platform to interpret multi-objective evolutionary algorithms. Sokoban level generation is selected as a showcase for its widespread use in procedural content generation. By balancing the emptiness and spatial diversity of Sokoban levels, we illustrate the improved two-archive algorithm, Two_Arch2, a well-known multi-objective evolutionary algorithm. Our web-based platform integrates Two_Arch2 into an interface that visually and interactively demonstrates the evolutionary process in real-time. Designed to bridge theoretical optimisation strategies with practical game generation applications, the interface is also accessible to both researchers and beginners to multi-objective evolutionary algorithms or procedural content generation on a website. Through dynamic visualisations and interactive gameplay demonstrations, this web-based platform also has potential as an educational tool.

Via

Access Paper or Ask Questions

GLHF: General Learned Evolutionary Algorithm Via Hyper Functions

May 06, 2024

Xiaobin Li, Kai Wu, Yujian Betterest Li, Xiaoyu Zhang, Handing Wang, Jing Liu

Abstract:Pretrained Optimization Models (POMs) leverage knowledge gained from optimizing various tasks, providing efficient solutions for new optimization challenges through direct usage or fine-tuning. Despite the inefficiencies and limited generalization abilities observed in current POMs, our proposed model, the general pre-trained optimization model (GPOM), addresses these shortcomings. GPOM constructs a population-based pretrained Black-Box Optimization (BBO) model tailored for continuous optimization. Evaluation on the BBOB benchmark and two robot control tasks demonstrates that GPOM outperforms other pretrained BBO models significantly, especially for high-dimensional tasks. Its direct optimization performance exceeds that of state-of-the-art evolutionary algorithms and POMs. Furthermore, GPOM exhibits robust generalization capabilities across diverse task distributions, dimensions, population sizes, and optimization horizons.

Via

Access Paper or Ask Questions

Pre-trained transformer for adversarial purification

May 27, 2023

Kai Wu, Yujian Betterest Li, Xiaoyu Zhang, Handing Wang, Jing Liu

Abstract:With more and more deep neural networks being deployed as various daily services, their reliability is essential. It's frightening that deep neural networks are vulnerable and sensitive to adversarial attacks, the most common one of which for the services is evasion-based. Recent works usually strengthen the robustness by adversarial training or leveraging the knowledge of an amount of clean data. However, in practical terms, retraining and redeploying the model need a large computational budget, leading to heavy losses to the online service. In addition, when adversarial examples of a certain attack are detected, only limited adversarial examples are available for the service provider, while much clean data may not be accessible. Given the mentioned problems, we propose a new scenario, RaPiD (Rapid Plug-in Defender), which is to rapidly defend against a certain attack for the frozen original service model with limitations of few clean and adversarial examples. Motivated by the generalization and the universal computation ability of pre-trained transformer models, we come up with a new defender method, CeTaD, which stands for Considering Pre-trained Transformers as Defenders. In particular, we evaluate the effectiveness and the transferability of CeTaD in the case of one-shot adversarial examples and explore the impact of different parts of CeTaD as well as training data conditions. CeTaD is flexible, able to be embedded into an arbitrary differentiable model, and suitable for various types of attacks.

Via

Access Paper or Ask Questions

B2Opt: Learning to Optimize Black-box Optimization with Little Budget

Apr 24, 2023

Xiaobin Li, Kai Wu, Xiaoyu Zhang, Handing Wang, Jing Liu

Abstract:Learning to optimize (L2O) has emerged as a powerful framework for black-box optimization (BBO). L2O learns the optimization strategies from the target task automatically without human intervention. This paper focuses on obtaining better performance when handling high-dimensional and expensive BBO with little function evaluation cost, which is the core challenge of black-box optimization. However, current L2O-based methods are weak for this due to a large number of evaluations on expensive black-box functions during training and poor representation of optimization strategy. To achieve this, 1) we utilize the cheap surrogate functions of the target task to guide the design of the optimization strategies; 2) drawing on the mechanism of evolutionary algorithm (EA), we propose a novel framework called B2Opt, which has a stronger representation of optimization strategies. Compared to the BBO baselines, B2Opt can achieve 3 to $10^6$ times performance improvement with less function evaluation cost. We test our proposal in high-dimensional synthetic functions and two real-world applications. We also find that deep B2Opt performs better than shallow ones.

Via

Access Paper or Ask Questions