Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zeguan Xiao

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Jun 11, 2025

Zeguan Xiao, Yun Chen, Guanhua Chen

Abstract:Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap" -- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with POET, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all positions, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.

Via

Access Paper or Ask Questions

Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers

Jun 05, 2025

Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Xuetao Wei, Hailiang Huang, Yun Chen, Guanhua Chen

Abstract:Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.

Via

Access Paper or Ask Questions

SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Jul 02, 2024

Yan Yang, Zeguan Xiao, Xin Lu, Hongru Wang, Hailiang Huang, Guanhua Chen, Yun Chen

Figure 1 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Figure 2 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Figure 3 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Figure 4 for SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

Abstract:The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.

Via

Access Paper or Ask Questions

MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

Jun 13, 2024

Hanqing Wang, Zeguan Xiao, Yixia Li, Shuo Wang, Guanhua Chen, Yun Chen

Figure 1 for MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

Figure 2 for MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

Figure 3 for MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

Figure 4 for MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

Abstract:Efficient finetuning of large language models (LLMs) aims to adapt the LLMs with reduced computation and memory cost. Previous LoRA-based approaches initialize the low-rank matrices with gaussian distribution and zero values, while keeping the original weight matrices frozen. However, the trainable model parameters optimized in an unguided subspace might have interference with the well-learned subspace of the pretrained weight matrix. In this paper, we propose MiLoRA, a simple yet effective LLM finetuning approach that only updates the minor singular components of the weight matrix while keeping the principle singular components frozen. It is observed that the minor matrix corresponds to the noisy or long-tail information, while the principle matrix contains important knowledge. The MiLoRA initializes the low-rank matrices within a subspace that is orthogonal to the principle matrix, thus the pretrained knowledge is expected to be well preserved. During finetuning, MiLoRA makes the most use of the less-optimized subspace for learning the finetuning dataset. Extensive experiments on commonsense reasoning, math reasoning and instruction following benchmarks present the superior performance of our method.

Via

Access Paper or Ask Questions

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Mar 13, 2024

Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen

Figure 1 for Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Figure 2 for Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Figure 3 for Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Figure 4 for Tastle: Distract Large Language Models for Automatic Jailbreak Attack

Abstract:Large language models (LLMs) have achieved significant advances in recent days. Extensive efforts have been made before the public release of LLMs to align their behaviors with human values. The primary goal of alignment is to ensure their helpfulness, honesty and harmlessness. However, even meticulously aligned LLMs remain vulnerable to malicious manipulations such as jailbreaking, leading to unintended behaviors. The jailbreak is to intentionally develop a malicious prompt that escapes from the LLM security restrictions to produce uncensored detrimental contents. Previous works explore different jailbreak methods for red teaming LLMs, yet they encounter challenges regarding to effectiveness and scalability. In this work, we propose Tastle, a novel black-box jailbreak framework for automated red teaming of LLMs. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs, motivated by the research about the distractibility and over-confidence phenomenon of LLMs. Extensive experiments of jailbreaking both open-source and proprietary LLMs demonstrate the superiority of our framework in terms of effectiveness, scalability and transferability. We also evaluate the effectiveness of existing jailbreak defense methods against our attack and highlight the crucial need to develop more effective and practical defense strategies.

Via

Access Paper or Ask Questions

BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification

Oct 01, 2021

Zeguan Xiao, Jiarun Wu, Qingliang Chen, Congjian Deng

Figure 1 for BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification

Figure 2 for BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification

Figure 3 for BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification

Figure 4 for BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification

Abstract:Graph-based Aspect-based Sentiment Classification (ABSC) approaches have yielded state-of-the-art results, expecially when equipped with contextual word embedding from pre-training language models (PLMs). However, they ignore sequential features of the context and have not yet made the best of PLMs. In this paper, we propose a novel model, BERT4GCN, which integrates the grammatical sequential features from the PLM of BERT, and the syntactic knowledge from dependency graphs. BERT4GCN utilizes outputs from intermediate layers of BERT and positional information between words to augment GCN (Graph Convolutional Network) to better encode the dependency graphs for the downstream classification. Experimental results demonstrate that the proposed BERT4GCN outperforms all state-of-the-art baselines, justifying that augmenting GCN with the grammatical features from intermediate layers of BERT can significantly empower ABSC models.

* To appear in EMNLP 2021, 8 pages, 2 figures

Via

Access Paper or Ask Questions