Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roy Xie

Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning

Jan 06, 2026

Xuan Yang, Furong Jia, Roy Xie, Xiong Xi, Hengwei Bian, Jian Li, Monica Agrawal

Abstract:Current Large Language Model reasoning systems process queries independently, discarding valuable cross-instance signals such as shared reasoning patterns and consistency constraints. We introduce Batch-of-Thought (BoT), a training-free method that processes related queries jointly to enable cross-instance learning. By performing comparative analysis across batches, BoT identifies high-quality reasoning templates, detects errors through consistency checks, and amortizes computational costs. We instantiate BoT within a multi-agent reflection architecture (BoT-R), where a Reflector performs joint evaluation to unlock mutual information gain unavailable in isolated processing. Experiments across three model families and six benchmarks demonstrate that BoT-R consistently improves accuracy and confidence calibration while reducing inference costs by up to 61%. Our theoretical and experimental analysis reveals when and why batch-aware reasoning benefits LLM systems.

Via

Access Paper or Ask Questions

Interleaved Reasoning for Large Language Models via Reinforcement Learning

May 26, 2025

Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra

Abstract:Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

Via

Access Paper or Ask Questions

Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

May 05, 2025

Junlin Wang, Roy Xie, Shang Zhu, Jue Wang, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou

Figure 1 for Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

Figure 2 for Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

Figure 3 for Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

Figure 4 for Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

Abstract:Building helpful and harmless large language models (LLMs) requires effective model alignment approach based on human instructions and feedback, which necessitates high-quality human-labeled data. Constructing such datasets is often expensive and hard to scale, and may face potential limitations on diversity and generalization. To address these challenges, we introduce Mixture of Agents Alignment (MoAA), that leverages the collective strengths of various language models to provide high-quality data for model alignment. By employing MoAA, we enhance both supervised fine-tuning and preference optimization, leading to improved performance compared to using a single model alone to generate alignment data (e.g. using GPT-4o alone). Evaluation results show that our approach can improve win rate of LLaMA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on AlpacaEval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that MoAA enables a self-improvement pipeline, where models finetuned on MoA-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.

* ICML 2025

Via

Access Paper or Ask Questions

Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

Feb 03, 2025

Roy Xie, Junlin Wang, Paul Rosu, Chunyuan Deng, Bolun Sun, Zihao Lin, Bhuwan Dhingra

Figure 1 for Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

Figure 2 for Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

Figure 3 for Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

Figure 4 for Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

Abstract:Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" - detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.

* Project Website: https://royxie.com/when-to-stop-project/

Via

Access Paper or Ask Questions

Language Models are Symbolic Learners in Arithmetic

Oct 21, 2024

Chunyuan Deng, Zhiqi Li, Roy Xie, Ruidi Chang, Hanjie Chen

Abstract:Large Language Models (LLMs) are thought to struggle with arithmetic learning due to the inherent differences between language modeling and numerical computation, but concrete evidence has been lacking. This work responds to this claim through a two-side experiment. We first investigate whether LLMs leverage partial products during arithmetic learning. We find that although LLMs can identify some partial products after learning, they fail to leverage them for arithmetic tasks, conversely. We then explore how LLMs approach arithmetic symbolically by breaking tasks into subgroups, hypothesizing that difficulties arise from subgroup complexity and selection. Our results show that when subgroup complexity is fixed, LLMs treat a collection of different arithmetic operations similarly. By analyzing position-level accuracy across different training sizes, we further observe that it follows a U-shaped pattern: LLMs quickly learn the easiest patterns at the first and last positions, while progressively learning the more difficult patterns in the middle positions. This suggests that LLMs select subgroup following an easy-to-hard paradigm during learning. Our work confirms that LLMs are pure symbolic learners in arithmetic tasks and underscores the importance of understanding them deeply through subgroup-level quantification.

Via

Access Paper or Ask Questions

ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods

Jun 23, 2024

Roy Xie, Junlin Wang, Ruomin Huang, Minxing Zhang, Rong Ge, Jian Pei, Neil Zhenqiang Gong, Bhuwan Dhingra

Figure 1 for ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods

Figure 2 for ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods

Figure 3 for ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods

Figure 4 for ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods

Abstract:The rapid scaling of large language models (LLMs) has raised concerns about the transparency and fair use of the pretraining data used for training them. Detecting such content is challenging due to the scale of the data and limited exposure of each instance during training. We propose ReCaLL (Relative Conditional Log-Likelihood), a novel membership inference attack (MIA) to detect LLMs' pretraining data by leveraging their conditional language modeling capabilities. ReCaLL examines the relative change in conditional log-likelihoods when prefixing target data points with non-member context. Our empirical findings show that conditioning member data on non-member prefixes induces a larger decrease in log-likelihood compared to non-member data. We conduct comprehensive experiments and show that ReCaLL achieves state-of-the-art performance on the WikiMIA dataset, even with random and synthetic prefixes, and can be further improved using an ensemble approach. Moreover, we conduct an in-depth analysis of LLMs' behavior with different membership contexts, providing insights into how LLMs leverage membership information for effective inference at both the sequence and token level.

Via

Access Paper or Ask Questions

Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Jun 10, 2024

Junlin Wang, Tianyi Yang, Roy Xie, Bhuwan Dhingra

Figure 1 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Figure 2 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Figure 3 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Figure 4 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Abstract:With the proliferation of LLM-integrated applications such as GPT-s, millions are deployed, offering valuable services through proprietary instruction prompts. These systems, however, are prone to prompt extraction attacks through meticulously designed queries. To help mitigate this problem, we introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios, employing a dual approach to evaluate the effectiveness of existing defenses and the resilience of the models. The benchmark encompasses 14 categories of prompt extraction attacks, with additional compounded attacks that closely mimic the strategies of potential attackers, alongside a diverse collection of defense templates. This array is, to our knowledge, the most extensive compilation of prompt theft attacks and defense mechanisms to date. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected. This paper aims to establish a more systematic benchmark for assessing LLM robustness against prompt extraction attacks, offering insights into their causes and potential countermeasures. Resources of Raccoon are publicly available at https://github.com/M0gician/RaccoonBench.

Via

Access Paper or Ask Questions

LLM-Resistant Math Word Problem Generation via Adversarial Attacks

Feb 27, 2024

Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra

Figure 1 for LLM-Resistant Math Word Problem Generation via Adversarial Attacks

Figure 2 for LLM-Resistant Math Word Problem Generation via Adversarial Attacks

Figure 3 for LLM-Resistant Math Word Problem Generation via Adversarial Attacks

Figure 4 for LLM-Resistant Math Word Problem Generation via Adversarial Attacks

Abstract:Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis on math problems and investigate the cause of failure to guide future research on LLM's mathematical capability.

* Code is available at https://github.com/ruoyuxie/adversarial_mwps_generation

Via

Access Paper or Ask Questions

Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Feb 27, 2024

Roy Xie, Orevaoghene Ahia, Yulia Tsvetkov, Antonios Anastasopoulos

Figure 1 for Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Figure 2 for Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Figure 3 for Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Figure 4 for Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers

Abstract:Identifying linguistic differences between dialects of a language often requires expert knowledge and meticulous human analysis. This is largely due to the complexity and nuance involved in studying various dialects. We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialect classifiers, even in the absence of human experts. We explore both post-hoc and intrinsic approaches to interpretability, conduct experiments on Mandarin, Italian, and Low Saxon, and experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.

* Code is available at https://github.com/ruoyuxie/interpretable_dialect_classifier

Via

Access Paper or Ask Questions