Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiming Liu

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Oct 02, 2025

Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu

Abstract:Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy's score function ('$k_n$ in reward') or as a direct loss function through which gradients are propagated ('$k_n$ as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '$k_1$ in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the '$k_2$ as loss' formulation is, in fact, gradient-equivalent to '$k_1$ in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted '$k_3$ as loss' (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of '$k_n$ as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

Via

Access Paper or Ask Questions

SemSteDiff: Generative Diffusion Model-based Coverless Semantic Steganography Communication

Sep 05, 2025

Song Gao, Rui Meng, Xiaodong Xu, Haixiao Gao, Yiming Liu, Chenyuan Feng, Ping Zhang, Tony Q. S. Quek, Dusit Niyato

Abstract:Semantic communication (SemCom), as a novel paradigm for future communication systems, has recently attracted much attention due to its superiority in communication efficiency. However, similar to traditional communication, it also suffers from eavesdropping threats. Intelligent eavesdroppers could launch advanced semantic analysis techniques to infer secret semantic information. Therefore, some researchers have designed Semantic Steganography Communication (SemSteCom) scheme to confuse semantic eavesdroppers. However, the state-of-the-art SemSteCom schemes for image transmission rely on the pre-selected cover image, which limits the universality. To address this issue, we propose a Generative Diffusion Model-based Coverless Semantic Steganography Communication (SemSteDiff) scheme to hide secret images into generated stego images. The semantic related private and public keys enable legitimate receiver to decode secret images correctly while the eavesdropper without completely true key-pairs fail to obtain them. Simulation results demonstrate the effectiveness of the plug-and-play design in different Joint Source-Channel Coding (JSCC) frameworks. The comparison results under different eavesdroppers' threats show that, when Signal-to-Noise Ratio (SNR) = 0 dB, the peak signal-to-noise ratio (PSNR) of the legitimate receiver is 4.14 dB higher than that of the eavesdropper.

* 13 pages, 11 figures

Via

Access Paper or Ask Questions

Generative Diffusion Models for Wireless Networks: Fundamental, Architecture, and State-of-the-Art

Jul 22, 2025

Dayu Fan, Rui Meng, Xiaodong Xu, Yiming Liu, Guoshun Nan, Chenyuan Feng, Shujun Han, Song Gao, Bingxuan Xu, Dusit Niyato(+2 more)

Abstract:With the rapid development of Generative Artificial Intelligence (GAI) technology, Generative Diffusion Models (GDMs) have shown significant empowerment potential in the field of wireless networks due to advantages, such as noise resistance, training stability, controllability, and multimodal generation. Although there have been multiple studies focusing on GDMs for wireless networks, there is still a lack of comprehensive reviews on their technological evolution. Motivated by this, we systematically explore the application of GDMs in wireless networks. Firstly, starting from mathematical principles, we analyze technical advantages of GDMs and present six representative models. Furthermore, we propose the multi-layer wireless network architecture including sensing layer, transmission layer, application layer, and security plane. We also introduce the core mechanisms of GDM at each of the layers. Subsequently, we conduct a rigorous review on existing GDM-based schemes, with a focus on analyzing their innovative points, the role of GDMs, strengths, and weaknesses. Ultimately, we extract key challenges and provide potential solutions, with the aim of providing directional guidance for future research in this field.

* 30 pages, 11 figures

Via

Access Paper or Ask Questions

NegVQA: Can Vision Language Models Understand Negation?

May 28, 2025

Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy

Abstract:Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

* Published at ACL 2025 Findings

Via

Access Paper or Ask Questions

Learning a General Model: Folding Clothing with Topological Dynamics

Apr 29, 2025

Yiming Liu, Lijun Han, Enlin Gu, Hesheng Wang

Abstract:The high degrees of freedom and complex structure of garments present significant challenges for clothing manipulation. In this paper, we propose a general topological dynamics model to fold complex clothing. By utilizing the visible folding structure as the topological skeleton, we design a novel topological graph to represent the clothing state. This topological graph is low-dimensional and applied for complex clothing in various folding states. It indicates the constraints of clothing and enables predictions regarding clothing movement. To extract graphs from self-occlusion, we apply semantic segmentation to analyze the occlusion relationships and decompose the clothing structure. The decomposed structure is then combined with keypoint detection to generate the topological graph. To analyze the behavior of the topological graph, we employ an improved Graph Neural Network (GNN) to learn the general dynamics. The GNN model can predict the deformation of clothing and is employed to calculate the deformation Jacobi matrix for control. Experiments using jackets validate the algorithm's effectiveness to recognize and fold complex clothing with self-occlusion.

Via

Access Paper or Ask Questions

RIS-Assisted Joint Sensing and Communications via Fractionally Constrained Fractional Programming

Mar 13, 2025

Yiming Liu, Kareem M. Attiah, Wei Yu

Abstract:This paper studies an uplink dual-functional sensing and communication system aided by a reconfigurable intelligent surface (RIS), whose reflection pattern is optimally configured to trade-off sensing and communication functionalities. Specifically, the Bayesian Cram\'er-Rao lower bound (BCRLB) for estimating the azimuth angle of a sensing user is minimized while ensuring the signal-to-interference-plus-noise ratio constraints for communication users. We show that this problem can be formulated as a novel fractionally constrained fractional programming (FCFP) problem. To deal with this highly nontrivial problem, we extend a quadratic transform technique, originally proposed to handle optimization problems containing ratio structures only in objectives, to the scenario where the constraints also contain ratio structures. First, we consider the case where the fading coefficient is known. Using the quadratic transform, the FCFP problem is turned into a sequence of subproblems that are convex except for the constant-modulus constraints which can be tackled using a penalty-based method. To further reduce the computational complexity, we leverage the constant-modulus conditions and propose a novel linear transform. This new transform enables the FCFP problem to be turned into a sequence of linear programming (LP) subproblems, which can be solved with linear complexity in the dimension of reflecting elements. Then, we consider the case where the fading coefficient is unknown. A modified BCRLB is used to make the problem more tractable, and the proposed quadratic transform-based algorithm is used to solve the problem. Finally, numerical results unveil nontrivial and effective reflection patterns that the RIS can be configured to generate to facilitate both functionalities.

* The paper has been submitted to IEEE Transactions on Wireless Communications for review and possible publication

Via

Access Paper or Ask Questions

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Jan 06, 2025

Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei(+2 more)

Figure 1 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Figure 2 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Figure 3 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Figure 4 for Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Abstract:The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

* Project page: https://yuhui-zh15.github.io/AutoConverter-Website/

Via

Access Paper or Ask Questions

Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Oct 04, 2024

Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S. Yu, Xuming Hu

Figure 1 for Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Figure 2 for Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Figure 3 for Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Figure 4 for Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Abstract:Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key, resulting in similar differences across prompts under different watermark keys. Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts, while Water-Probe demonstrates a minimal false positive rate for non-watermarked LLMs. Finally, we propose that the key to enhancing the imperceptibility of watermarked LLMs is to increase the randomness of watermark key selection. Based on this, we introduce the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.

* 25 pages, 5 figures, 8 tables

Via

Access Paper or Ask Questions

Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Sep 24, 2024

Fengrun Zhang, Wangjin Zhou, Yiming Liu, Wang Geng, Yahui Shan, Chen Zhang

Figure 1 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Figure 2 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Figure 3 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Figure 4 for Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification

Abstract:There has been an increasing research interest in cross-age speaker verification~(CASV). However, existing speaker verification systems perform poorly in CASV due to the great individual differences in voice caused by aging. In this paper, we propose a disentangled representation learning framework for CASV based on mutual information~(MI) minimization. In our method, a backbone model is trained to disentangle the identity- and age-related embeddings from speaker information, and an MI estimator is trained to minimize the correlation between age- and identity-related embeddings via MI minimization, resulting in age-invariant speaker embeddings. Furthermore, by using the age gaps between positive and negative samples, we propose an aging-aware MI minimization loss function that allows the backbone model to focus more on the vocal changes with large age gaps. Experimental results show that the proposed method outperforms other methods on multiple Cross-Age test sets of Vox-CA.

* Interspeech 2024

Via

Access Paper or Ask Questions

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Sep 12, 2024

Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu

Figure 1 for Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Figure 2 for Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Figure 3 for Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Figure 4 for Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Abstract:This study presents an innovative Zero-Shot any-to-any Singing Voice Conversion (SVC) method, leveraging a novel clustering-based phoneme representation to effectively separate content, timbre, and singing style. This approach enables precise voice characteristic manipulation. We discovered that datasets with fewer recordings per artist are more susceptible to timbre leakage. Extensive testing on over 10,000 hours of singing and user feedback revealed our model significantly improves sound quality and timbre accuracy, aligning with our objectives and advancing voice conversion technology. Furthermore, this research advances zero-shot SVC and sets the stage for future work on discrete speech representation, emphasizing the preservation of rhyme.

Via

Access Paper or Ask Questions