Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaohang Tang

Robust Multi-Objective Controlled Decoding of Large Language Models

Mar 11, 2025

Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Abstract:Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predefined weights or optimize for averages, sacrificing one objective for another and leading to unbalanced outcomes. To address this, we introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimizes for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy, solving for the Nash equilibrium. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically. We also introduce a practical RMOD variant designed for efficient decoding with contemporary LLMs, incurring minimal computational overhead compared to non-robust Multi-Objective Decoding (MOD) methods. Our experimental results showcase the effectiveness of RMOD in generating responses equitably aligned with diverse objectives, outperforming baselines up to 20%.

* 24 pages, 9 figures

Via

Access Paper or Ask Questions

Adversarial Robust Decision Transformer: Enhancing Robustness of RvS via Minimax Returns-to-go

Jul 25, 2024

Xiaohang Tang, Afonso Marques, Parameswaran Kamalaruban, Ilija Bogunovic

Abstract:Decision Transformer (DT), as one of the representative Reinforcement Learning via Supervised Learning (RvS) methods, has achieved strong performance in offline learning tasks by leveraging the powerful Transformer architecture for sequential decision-making. However, in adversarial environments, these methods can be non-robust, since the return is dependent on the strategies of both the decision-maker and adversary. Training a probabilistic model conditioned on observed return to predict action can fail to generalize, as the trajectories that achieve a return in the dataset might have done so due to a weak and suboptimal behavior adversary. To address this, we propose a worst-case-aware RvS algorithm, the Adversarial Robust Decision Transformer (ARDT), which learns and conditions the policy on in-sample minimax returns-to-go. ARDT aligns the target return with the worst-case return learned through minimax expectile regression, thereby enhancing robustness against powerful test-time adversaries. In experiments conducted on sequential games with full data coverage, ARDT can generate a maximin (Nash Equilibrium) strategy, the solution with the largest adversarial robustness. In large-scale sequential games and continuous adversarial RL environments with partial data coverage, ARDT demonstrates significantly superior robustness to powerful test-time adversaries and attains higher worst-case returns compared to contemporary DT methods.

* Preprint

Via

Access Paper or Ask Questions

Can Word Sense Distribution Detect Semantic Changes of Words?

Oct 16, 2023

Xiaohang Tang, Yi Zhou, Taichi Aida, Procheta Sen, Danushka Bollegala

Abstract:Semantic Change Detection (SCD) of words is an important task for various NLP applications that must make time-sensitive predictions. Some words are used over time in novel ways to express new meanings, and these new meanings establish themselves as novel senses of existing words. On the other hand, Word Sense Disambiguation (WSD) methods associate ambiguous words with sense ids, depending on the context in which they occur. Given this relationship between WSD and SCD, we explore the possibility of predicting whether a target word has its meaning changed between two corpora collected at different time steps, by comparing the distributions of senses of that word in each corpora. For this purpose, we use pretrained static sense embeddings to automatically annotate each occurrence of the target word in a corpus with a sense id. Next, we compute the distribution of sense ids of a target word in a given corpus. Finally, we use different divergence or distance measures to quantify the semantic change of the target word across the two given corpora. Our experimental results on SemEval 2020 Task 1 dataset show that word sense distributions can be accurately used to predict semantic changes of words in English, German, Swedish and Latin.

* EMNLP 2023
* Accepted to Findings of EMNLP 2023

Via

Access Paper or Ask Questions

Learning Dynamic Contextualised Word Embeddings via Template-based Temporal Adaptation

Aug 23, 2022

Xiaohang Tang, Yi Zhou, Danushka Bollegala

Figure 1 for Learning Dynamic Contextualised Word Embeddings via Template-based Temporal Adaptation

Figure 2 for Learning Dynamic Contextualised Word Embeddings via Template-based Temporal Adaptation

Figure 3 for Learning Dynamic Contextualised Word Embeddings via Template-based Temporal Adaptation

Figure 4 for Learning Dynamic Contextualised Word Embeddings via Template-based Temporal Adaptation

Abstract:Dynamic contextualised word embeddings represent the temporal semantic variations of words. We propose a method for learning dynamic contextualised word embeddings by time-adapting a pretrained Masked Language Model (MLM) using time-sensitive templates. Given two snapshots $C_1$ and $C_2$ of a corpora taken respectively at two distinct timestamps $T_1$ and $T_2$, we first propose an unsupervised method to select (a) pivot terms related to both $C_1$ and $C_2$, and (b) anchor terms that are associated with a specific pivot term in each individual snapshot. We then generate prompts by filling manually compiled templates using the extracted pivot and anchor terms. Moreover, we propose an automatic method to learn time-sensitive templates from $C_1$ and $C_2$, without requiring any human supervision. Next, we use the generated prompts to adapt a pretrained MLM to $T_2$ by fine-tuning it on the prompts. Experimental results show that our proposed method significantly reduces the perplexity of test sentences selected from $T_2$, thereby outperforming the current state-of-the-art dynamic contextualised word embedding methods.

Via

Access Paper or Ask Questions

Average-Reward Reinforcement Learning with Trust Region Methods

Jun 07, 2021

Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, Qianchuan Zhao

Figure 1 for Average-Reward Reinforcement Learning with Trust Region Methods

Figure 2 for Average-Reward Reinforcement Learning with Trust Region Methods

Figure 3 for Average-Reward Reinforcement Learning with Trust Region Methods

Figure 4 for Average-Reward Reinforcement Learning with Trust Region Methods

Abstract:Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.

* Accepted by IJCAI2021

Via

Access Paper or Ask Questions