Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuhang Wu

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

May 28, 2025

Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

Abstract:In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: $\scoreGamma$ measures basic reasoning accuracy, while $\scoreDelta$ quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) $\scoreDelta$'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.

Via

Access Paper or Ask Questions

Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

Apr 15, 2025

Nanshan Jia, Chenfei Yuan, Yuhang Wu, Zeyu Zheng

Abstract:The fast development of Large Language Models (LLMs) offers growing opportunities to further improve sequential recommendation systems. Yet for some practitioners, integrating LLMs to their existing base recommendation systems raises questions about model interpretability, transparency and related safety. To partly alleviate challenges from these questions, we propose guided embedding refinement, a method that carries out a guided and interpretable usage of LLM to enhance the embeddings associated with the base recommendation system. Instead of directly using LLMs as the backbone of sequential recommendation systems, we utilize them as auxiliary tools to emulate the sales logic of recommendation and generate guided embeddings that capture domain-relevant semantic information on interpretable attributes. Benefiting from the strong generalization capabilities of the guided embedding, we construct refined embedding by using the guided embedding and reduced-dimension version of the base embedding. We then integrate the refined embedding into the recommendation module for training and inference. A range of numerical experiments demonstrate that guided embedding is adaptable to various given existing base embedding models, and generalizes well across different recommendation tasks. The numerical results show that the refined embedding not only improves recommendation performance, achieving approximately $10\%$ to $50\%$ gains in Mean Reciprocal Rank (MRR), Recall rate, and Normalized Discounted Cumulative Gain (NDCG), but also enhances interpretability, as evidenced by case studies.

Via

Access Paper or Ask Questions

Uncertainty Quantification for LLM-Based Survey Simulations

Feb 25, 2025

Chengpiao Huang, Yuhang Wu, Kaizheng Wang

Abstract:We investigate the reliable use of simulated survey responses from large language models (LLMs) through the lens of uncertainty quantification. Our approach converts synthetic data into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid average-case coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.

* 30 pages, 6 figures, 10 tables

Via

Access Paper or Ask Questions

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Jan 31, 2025

Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li

Figure 1 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Figure 2 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Figure 3 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Figure 4 for Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Abstract:Modern foundation models often undergo iterative ``bootstrapping'' in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model's performance improves--raising a crucial question: how should the total budget on generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework to analyze budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies--particularly exponential growth policies--exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.

Via

Access Paper or Ask Questions

Black-box Optimization with Simultaneous Statistical Inference for Optimal Performance

Jan 14, 2025

Teng Lian, Jian-Qiang Hu, Yuhang Wu, Zeyu Zheng

Abstract:Black-box optimization is often encountered for decision-making in complex systems management, where the knowledge of system is limited. Under these circumstances, it is essential to balance the utilization of new information with computational efficiency. In practice, decision-makers often face the dual tasks of optimization and statistical inference for the optimal performance, in order to achieve it with a high reliability. Our goal is to address the dual tasks in an online fashion. Wu et al (2022) [arXiv preprint: 2210.06737] point out that the sample average of performance estimates generated by the optimization algorithm needs not to admit a central limit theorem. We propose an algorithm that not only tackles this issue, but also provides an online consistent estimator for the variance of the performance. Furthermore, we characterize the convergence rate of the coverage probabilities of the asymptotic confidence intervals.

Via

Access Paper or Ask Questions

Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice

Dec 14, 2024

Junliang Li, Kai Ye, Haolan Kang, Mingxuan Liang, Yuhang Wu, Zhenhua Liu, Huiping Zhuang, Rui Huang, Yongquan Chen

Abstract:In recent years, as robotics has advanced, human-robot collaboration has gained increasing importance. However, current robots struggle to fully and accurately interpret human intentions from voice commands alone. Traditional gripper and suction systems often fail to interact naturally with humans, lack advanced manipulation capabilities, and are not adaptable to diverse tasks, especially in unstructured environments. This paper introduces the Embodied Dexterous Grasping System (EDGS), designed to tackle object grasping in cluttered environments for human-robot interaction. We propose a novel approach to semantic-object alignment using a Vision-Language Model (VLM) that fuses voice commands and visual information, significantly enhancing the alignment of multi-dimensional attributes of target objects in complex scenarios. Inspired by human hand-object interactions, we develop a robust, precise, and efficient grasping strategy, incorporating principles like the thumb-object axis, multi-finger wrapping, and fingertip interaction with an object's contact mechanics. We also design experiments to assess Referring Expression Representation Enrichment (RERE) in referring expression segmentation, demonstrating that our system accurately detects and matches referring expressions. Extensive experiments confirm that EDGS can effectively handle complex grasping tasks, achieving stability and high success rates, highlighting its potential for further development in the field of Embodied AI.

Via

Access Paper or Ask Questions

UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Oct 16, 2024

Jiacheng Cai, Jiahao Yu, Yangguang Shao, Yuhang Wu, Xinyu Xing

Figure 1 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Figure 2 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Figure 3 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Figure 4 for UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

Abstract:Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse. Traditional fingerprinting methods often require significant computational overhead or white-box verification access. In this paper, we introduce UTF, a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens. Under-trained tokens are tokens that the model has not fully learned during its training phase. By utilizing these tokens, we perform supervised fine-tuning to embed specific input-output pairs into the model. This process allows the LLM to produce predetermined outputs when presented with certain inputs, effectively embedding a unique fingerprint. Our method has minimal overhead and impact on model's performance, and does not require white-box access to target model's ownership identification. Compared to existing fingerprinting methods, UTF is also more effective and robust to fine-tuning and random guess.

Via

Access Paper or Ask Questions

Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Jul 31, 2024

Ziya Zhou, Yuhang Wu, Zhiyue Wu, Xinyue Zhang, Ruibin Yuan, Yinghao Ma, Lu Wang, Emmanouil Benetos, Wei Xue, Yike Guo

Figure 1 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 2 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 3 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Figure 4 for Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Abstract:Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step reasoning perspective, which is a critical aspect in the conditioned, editable, and interactive human-computer co-creation process. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing. We identify that current LLMs exhibit poor performance in song-level multi-step music reasoning, and typically fail to leverage learned music knowledge when addressing complex musical tasks. An analysis of LLMs' responses highlights distinctly their pros and cons. Our findings suggest achieving advanced musical capability is not intrinsically obtained by LLMs, and future research should focus more on bridging the gap between music knowledge and reasoning, to improve the co-creation experience for musicians.

* Accepted by ISMIR2024

Via

Access Paper or Ask Questions

Evaluating and Analyzing Relationship Hallucinations in LVLMs

Jun 24, 2024

Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, Rongrong Ji

Figure 1 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Figure 2 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Figure 3 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Figure 4 for Evaluating and Analyzing Relationship Hallucinations in LVLMs

Abstract:The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.

* ICML2024

Via

Access Paper or Ask Questions

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

Jun 14, 2024

Yuhang Wu, Wenmeng Yu, Yean Cheng, Yan Wang, Xiaohan Zhang, Jiazheng Xu, Ming Ding, Yuxiao Dong

Abstract:Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on https://alignmmbench.github.io.

Via

Access Paper or Ask Questions