Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hexuan Deng

REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models

May 26, 2025

Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang

Abstract:Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but tends to lose the reflection ability and harm the performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 35% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for simpler ones without losing reflection ability. Codes are available at https://github.com/hexuandeng/REA-RL.

* Work in Progress

Via

Access Paper or Ask Questions

Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning

May 22, 2025

Jun Rao, Xuebo Liu, Hexuan Deng, Zepeng Lin, Zixiong Yu, Jiansheng Wei, Xiaojun Meng, Min Zhang

Abstract:In the realm of data selection for reasoning tasks, existing approaches predominantly rely on externally predefined static metrics such as difficulty and diversity, which are often designed for supervised fine-tuning (SFT) and lack adaptability to continuous training processes. A critical limitation of these methods is their inability to dynamically align with the evolving capabilities of models during online training, a gap that becomes increasingly pronounced with the rise of dynamic training paradigms and online reinforcement learning (RL) frameworks (e.g., R1 models). To address this, we introduce SAI-DPO, an algorithm that dynamically selects training data by continuously assessing a model's stage-specific reasoning abilities across different training phases. By integrating real-time model performance feedback, SAI-DPO adaptively adapts data selection to the evolving strengths and weaknesses of the model, thus enhancing both data utilization efficiency and final task performance. Extensive experiments on three state-of-the-art models and eight mathematical reasoning benchmarks, including challenging competition-level datasets (e.g., AIME24 and AMC23), demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with particularly notable improvements of 10 and 15 points on AIME24 and AMC23, respectively. These results highlight the superiority of dynamic, model-adaptive data selection over static, externally defined strategies in advancing reasoning.

Via

Access Paper or Ask Questions

DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Nov 21, 2024

Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu

Figure 1 for DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Figure 2 for DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Figure 3 for DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Figure 4 for DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization

Abstract:Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs. Structured pruning reduces model size and speeds up inference but often causes uneven degradation across domains, leading to biased performance. To address this, we propose DRPruning, which incorporates distributionally robust optimization to restore balanced performance across domains, along with further improvements to enhance robustness. Experiments in monolingual and multilingual settings show that our method surpasses similarly sized models in pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning. We further provide analysis demonstrating the robustness of our method towards various domains and distribution shifts. Furthermore, our method automatically determines optimal reference losses and data ratios, suggesting potential for broader applications. Our code is available at https://github.com/hexuandeng/DRPruning.

* Work in Progress

Via

Access Paper or Ask Questions

NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Oct 28, 2024

Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu

Figure 1 for NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Figure 2 for NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Figure 3 for NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Figure 4 for NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Abstract:Despite their remarkable abilities in various tasks, large language models (LLMs) still struggle with real-time information (e.g., new facts and terms) due to the knowledge cutoff in their development process. However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also analyze which types of terms are more challenging and why LLMs struggle with new terms, paving the way for future research. Finally, we construct NewTerm 2022 and 2023 to evaluate the new terms updated each year and will continue updating annually. The benchmark and codes can be found at https://github.com/hexuandeng/NewTerm.

* Accepted to NeurIPS 2024 Datasets and Benchmarks Track

Via

Access Paper or Ask Questions

Holistic Exploration on Universal Decompositional Semantic Parsing: Architecture, Data Augmentation, and LLM Paradigm

Jul 25, 2023

Hexuan Deng, Xin Zhang, Meishan Zhang, Xuebo Liu, Min Zhang

Abstract:In this paper, we conduct a holistic exploration of the Universal Decompositional Semantic (UDS) Parsing. We first introduce a cascade model for UDS parsing that decomposes the complex parsing task into semantically appropriate subtasks. Our approach outperforms the prior models, while significantly reducing inference time. We also incorporate syntactic information and further optimized the architecture. Besides, different ways for data augmentation are explored, which further improve the UDS Parsing. Lastly, we conduct experiments to investigate the efficacy of ChatGPT in handling the UDS task, revealing that it excels in attribute parsing but struggles in relation parsing, and using ChatGPT for data augmentation yields suboptimal results. Our code is available at https://github.com/hexuandeng/HExp4UDS.

* 12 pages, 7 figures, 3 tables

Via

Access Paper or Ask Questions

Improving Simultaneous Machine Translation with Monolingual Data

Dec 02, 2022

Hexuan Deng, Liang Ding, Xuebo Liu, Meishan Zhang, Dacheng Tao, Min Zhang

Figure 1 for Improving Simultaneous Machine Translation with Monolingual Data

Figure 2 for Improving Simultaneous Machine Translation with Monolingual Data

Figure 3 for Improving Simultaneous Machine Translation with Monolingual Data

Figure 4 for Improving Simultaneous Machine Translation with Monolingual Data

Abstract:Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.

* Accepted by AAAI 2023. Extended version includes supplementary material. 10 pages, 4 figures, 8 tables

Via

Access Paper or Ask Questions