Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yufan Zhuang

Text Generation Beyond Discrete Token Sampling

May 20, 2025

Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao

Abstract:In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.

Via

Access Paper or Ask Questions

Self-Taught Agentic Long Context Understanding

Feb 21, 2025

Yufan Zhuang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Jingbo Shang, Zicheng Liu, Emad Barsoum

Abstract:Answering complex, long-context questions remains a major challenge for large language models (LLMs) as it requires effective question clarifications and context retrieval. We propose Agentic Long-Context Understanding (AgenticLU), a framework designed to enhance an LLM's understanding of such queries by integrating targeted self-clarification with contextual grounding within an agentic workflow. At the core of AgenticLU is Chain-of-Clarifications (CoC), where models refine their understanding through self-generated clarification questions and corresponding contextual groundings. By scaling inference as a tree search where each node represents a CoC step, we achieve 97.8% answer recall on NarrativeQA with a search depth of up to three and a branching factor of eight. To amortize the high cost of this search process to training, we leverage the preference pairs for each step obtained by the CoC workflow and perform two-stage model finetuning: (1) supervised finetuning to learn effective decomposition strategies, and (2) direct preference optimization to enhance reasoning quality. This enables AgenticLU models to generate clarifications and retrieve relevant context effectively and efficiently in a single inference pass. Extensive experiments across seven long-context tasks demonstrate that AgenticLU significantly outperforms state-of-the-art prompting methods and specialized long-context LLMs, achieving robust multi-hop reasoning while sustaining consistent performance as context length grows.

Via

Access Paper or Ask Questions

Vector-ICL: In-context Learning with Continuous Vector Representations

Oct 08, 2024

Yufan Zhuang, Chandan Singh, Liyuan Liu, Jingbo Shang, Jianfeng Gao

Figure 1 for Vector-ICL: In-context Learning with Continuous Vector Representations

Figure 2 for Vector-ICL: In-context Learning with Continuous Vector Representations

Figure 3 for Vector-ICL: In-context Learning with Continuous Vector Representations

Figure 4 for Vector-ICL: In-context Learning with Continuous Vector Representations

Abstract:Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities on textual data. We explore whether these capabilities can be extended to continuous vectors from diverse domains, obtained from black-box pretrained encoders. By aligning input data with an LLM's embedding space through lightweight projectors, we observe that LLMs can effectively process and learn from these projected vectors, which we term Vector-ICL. In particular, we find that pretraining projectors with general language modeling objectives enables Vector-ICL, while task-specific finetuning further enhances performance. In our experiments across various tasks and modalities, including text reconstruction, numerical function regression, text classification, summarization, molecule captioning, time-series classification, graph classification, and fMRI decoding, Vector-ICL often surpasses both few-shot ICL and domain-specific model or tuning. We further conduct analyses and case studies, indicating the potential of LLMs to process vector representations beyond traditional token-based paradigms.

Via

Access Paper or Ask Questions

Data Contamination Can Cross Language Barriers

Jun 19, 2024

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang

Abstract:The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{https://github.com/ShangDataLab/Deep-Contam}.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

Learning a Decision Tree Algorithm with Transformers

Feb 06, 2024

Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao

Abstract:Decision trees are renowned for their interpretability capability to achieve high predictive performance, especially on tabular data. Traditionally, they are constructed through recursive algorithms, where they partition the data at every node in a tree. However, identifying the best partition is challenging, as decision trees optimized for local segments may not bring global generalization. To address this, we introduce MetaTree, which trains a transformer-based model on filtered outputs from classical algorithms to produce strong decision trees for classification. Specifically, we fit both greedy decision trees and optimized decision trees on a large number of datasets. We then train MetaTree to produce the trees that achieve strong generalization performance. This training enables MetaTree to not only emulate these algorithms, but also to intelligently adapt its strategy according to the context, thereby achieving superior generalization performance.

Via

Access Paper or Ask Questions

Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

Oct 05, 2022

Yufan Zhuang, Zihan Wang, Fangbo Tao, Jingbo Shang

Figure 1 for Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

Figure 2 for Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

Figure 3 for Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

Figure 4 for Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

Abstract:We propose Waveformer that learns attention mechanism in the wavelet coefficient space, requires only linear time complexity, and enjoys universal approximating power. Specifically, we first apply forward wavelet transform to project the input sequences to multi-resolution orthogonal wavelet bases, then conduct nonlinear transformations (in this case, a random feature kernel) in the wavelet coefficient space, and finally reconstruct the representation in input space via backward wavelet transform. We note that other non-linear transformations may be used, hence we name the learning paradigm Wavelet transformatIon for Sequence lEarning (WISE). We emphasize the importance of backward reconstruction in the WISE paradigm -- without it, one would be mixing information from both the input space and coefficient space through skip connections, which shall not be considered as mathematically sound. Compared with Fourier transform in recent works, wavelet transform is more efficient in time complexity and better captures local and positional information; we further support this through our ablation studies. Extensive experiments on seven long-range understanding datasets from the Long Range Arena benchmark and code understanding tasks demonstrate that (1) Waveformer achieves competitive and even better accuracy than a number of state-of-the-art Transformer variants and (2) WISE can boost accuracies of various attention approximation methods without increasing the time complexity. These together showcase the superiority of learning attention in a wavelet coefficient space over the input space.

Via

Access Paper or Ask Questions

Reconfigurable Intelligent Surface Assisted OFDM Relaying: Subcarrier Matching with Balanced SNR

Mar 03, 2022

Tong Zhang, Shuai Wang, Yufan Zhuang, Changsheng You, Miaowen Wen, Yik-Chung Wu

Figure 1 for Reconfigurable Intelligent Surface Assisted OFDM Relaying: Subcarrier Matching with Balanced SNR

Figure 2 for Reconfigurable Intelligent Surface Assisted OFDM Relaying: Subcarrier Matching with Balanced SNR

Figure 3 for Reconfigurable Intelligent Surface Assisted OFDM Relaying: Subcarrier Matching with Balanced SNR

Figure 4 for Reconfigurable Intelligent Surface Assisted OFDM Relaying: Subcarrier Matching with Balanced SNR

Abstract:Reconfigurable intelligent surface (RIS) is a promising solution to enhance the performance of wireless communications via reconfiguring the wireless propagation environment. In this paper, we investigate the joint design of RIS passive beamforming and subcarrier matching in RIS-assisted orthogonal frequency division multiplexing (OFDM) dual-hop relaying systems under two cases, depending on the presence of the RIS reflected link from the source to the destination in the first hop. Accordingly, we formulate a mixed-integer nonlinear programming (MINIP) problem to maximize the sum achievable rate over all subcarriers by jointly optimizing the RIS passive beamforming and subcarrier matching. To solve this challenging problem, we first develop a branch-and-bound (BnB)-based alternating optimization algorithm to obtain a near-optimal solution by alternatively optimizing the subcarrier matching by the BnB method and the RIS passive beamforming by using semidefinite relaxation techniques. Then, a low-complexity difference-of-convex penalty-based algorithm is proposed to reduce the computation complexity in the BnB method. To further reduce the computational complexity, we utilize the learning-to-optimize approach to learn the joint design obtained from optimization techniques, which is more amenable to practical implementations. Lastly, computer simulations are presented to evaluate the performance of the proposed algorithms in the two cases. Simulation results demonstrate that the RIS-assisted OFDM relaying system achieves sustainable achievable rate gain as compared to that without RIS, and that with random passive beamforming, since RIS passive beamforming can be leveraged to recast the subcarrier matching among different subcarriers and balance the signal-to-noise ratio within each subcarrier pair.

* Submitted to IEEE

Via

Access Paper or Ask Questions

Data-Driven and SE-assisted AI Model Signal-Awareness Enhancement and Introspection

Nov 10, 2021

Sahil Suneja, Yufan Zhuang, Yunhui Zheng, Jim Laredo, Alessandro Morari

Figure 1 for Data-Driven and SE-assisted AI Model Signal-Awareness Enhancement and Introspection

Figure 2 for Data-Driven and SE-assisted AI Model Signal-Awareness Enhancement and Introspection

Figure 3 for Data-Driven and SE-assisted AI Model Signal-Awareness Enhancement and Introspection

Figure 4 for Data-Driven and SE-assisted AI Model Signal-Awareness Enhancement and Introspection

Abstract:AI modeling for source code understanding tasks has been making significant progress, and is being adopted in production development pipelines. However, reliability concerns, especially whether the models are actually learning task-related aspects of source code, are being raised. While recent model-probing approaches have observed a lack of signal awareness in many AI-for-code models, i.e. models not capturing task-relevant signals, they do not offer solutions to rectify this problem. In this paper, we explore data-driven approaches to enhance models' signal-awareness: 1) we combine the SE concept of code complexity with the AI technique of curriculum learning; 2) we incorporate SE assistance into AI models by customizing Delta Debugging to generate simplified signal-preserving programs, augmenting them to the training dataset. With our techniques, we achieve up to 4.8x improvement in model signal awareness. Using the notion of code complexity, we further present a novel model learning introspection approach from the perspective of the dataset.

* Submitted September 2021

Via

Access Paper or Ask Questions

Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation

Sep 07, 2021

Yufan Zhuang, Sahil Suneja, Veronika Thost, Giacomo Domeniconi, Alessandro Morari, Jim Laredo

Figure 1 for Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation

Figure 2 for Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation

Figure 3 for Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation

Figure 4 for Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation

Abstract:Identifying vulnerable code is a precautionary measure to counter software security breaches. Tedious expert effort has been spent to build static analyzers, yet insecure patterns are barely fully enumerated. This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program, in order to improve prediction performance. Compared with a generic GNN, our enhancements include a synthesis of multiple representations learned from the several parsed graphs of a program, and a new training loss metric that leverages the fine granularity of labeling. Our model outperforms multiple text, image and graph-based approaches, across two real-world datasets.

* Submitted June 2020

Via

Access Paper or Ask Questions

Probing Model Signal-Awareness via Prediction-Preserving Input Minimization

Nov 25, 2020

Yunhui Zheng, Sahil Suneja, Yufan Zhuang, Alessandro Morari, Jim Laredo

Figure 1 for Probing Model Signal-Awareness via Prediction-Preserving Input Minimization

Figure 2 for Probing Model Signal-Awareness via Prediction-Preserving Input Minimization

Figure 3 for Probing Model Signal-Awareness via Prediction-Preserving Input Minimization

Figure 4 for Probing Model Signal-Awareness via Prediction-Preserving Input Minimization

Abstract:This work explores the signal awareness of AI models for source code understanding. Using a software vulnerability detection use-case, we evaluate the models' ability to capture the correct vulnerability signals to produce their predictions. Our prediction-preserving input minimization (P2IM) approach systematically reduces the original source code to a minimal snippet which a model needs to maintain its prediction. The model's reliance on incorrect signals is then uncovered when a vulnerability in the original code is missing in the minimal snippet, both of which the model however predicts as being vulnerable. We apply P2IM on three state-of-the-art neural network models across multiple datasets, and measure their signal awareness using a new metric we propose- Signal-aware Recall (SAR). The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric, highlighting that the models are presumably picking up a lot of noise or dataset nuances while learning their vulnerability detection logic.

Via

Access Paper or Ask Questions