Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yu Lu

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Apr 29, 2025

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, Yi Yang

Abstract:Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)' enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method's superiority: it outperforms state-of-the-art approaches while requiring only 0.5% training data and 1% trainable parameters compared to conventional baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.

* Project Page: https://river-zhang.github.io/ICEdit-gh-pages/

Via

Access Paper or Ask Questions

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

Apr 07, 2025

Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan

Abstract:Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition ($0$ to $2^{64}$), probing two core properties: commutativity ($A+B=B+A$) and compositional generalization (via isomorphic symbolic mappings, e.g., $7 \rightarrow y$). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to $\leq$7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of $A+B \neq B+A$) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

Via

Access Paper or Ask Questions

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Mar 04, 2025

Zitang Zhou, Ke Mei, Yu Lu, Tianyi Wang, Fengyun Rao

Abstract:This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insights with machine-generated descriptions to identify key transitions and assess alignment across multiple dimensions. Additionally, we introduce a novel evaluation framework with tasks and metrics to assess the multi-dimensional alignment of video and music, including rhythm, emotion, theme, and cultural context. Our extensive experiments demonstrate that HarmonySet, along with the proposed evaluation framework, significantly improves the ability of multimodal models to capture and analyze the intricate relationships between video and music.

* Accepted at CVPR 2025. Project page: https://harmonyset.github.io/

Via

Access Paper or Ask Questions

UBER: Uncertainty-Based Evolution with Large Language Models for Automatic Heuristic Design

Dec 30, 2024

Zijie Chen, Zhanchao Zhou, Yu Lu, Renjun Xu, Lili Pan, Zhenzhong Lan

Figure 1 for UBER: Uncertainty-Based Evolution with Large Language Models for Automatic Heuristic Design

Figure 2 for UBER: Uncertainty-Based Evolution with Large Language Models for Automatic Heuristic Design

Figure 3 for UBER: Uncertainty-Based Evolution with Large Language Models for Automatic Heuristic Design

Figure 4 for UBER: Uncertainty-Based Evolution with Large Language Models for Automatic Heuristic Design

Abstract:NP-hard problem-solving traditionally relies on heuristics, but manually crafting effective heuristics for complex problems remains challenging. While recent work like FunSearch has demonstrated that large language models (LLMs) can be leveraged for heuristic design in evolutionary algorithm (EA) frameworks, their potential is not fully realized due to its deficiency in exploitation and exploration. We present UBER (Uncertainty-Based Evolution for Refinement), a method that enhances LLM+EA methods for automatic heuristic design by integrating uncertainty on top of the FunSearch framework. UBER introduces two key innovations: an Uncertainty-Inclusive Evolution Process (UIEP) for adaptive exploration-exploitation balance, and a principled Uncertainty-Inclusive Island Reset (UIIS) strategy for maintaining population diversity. Through extensive experiments on challenging NP-complete problems, UBER demonstrates significant improvements over FunSearch. Our work provides a new direction for the synergy of LLMs and EA, advancing the field of automatic heuristic design.

Via

Access Paper or Ask Questions

Energy-Efficient RIS-Aided Cell-Free Massive MIMO Systems: Application, Opportunities, and Challenges

Dec 23, 2024

Yu Lu, Jiayi Zhang, Enyu Shi, Peng Zhang, Derrick Wing Kwan Ng, Dusit Niyato, Bo Ai

Figure 1 for Energy-Efficient RIS-Aided Cell-Free Massive MIMO Systems: Application, Opportunities, and Challenges

Figure 2 for Energy-Efficient RIS-Aided Cell-Free Massive MIMO Systems: Application, Opportunities, and Challenges

Figure 3 for Energy-Efficient RIS-Aided Cell-Free Massive MIMO Systems: Application, Opportunities, and Challenges

Figure 4 for Energy-Efficient RIS-Aided Cell-Free Massive MIMO Systems: Application, Opportunities, and Challenges

Abstract:Reconfigurable intelligent surfaces (RIS)-assisted cell-free massive multiple-input multiple-output (CF mMIMO) systems have emerged as a promising technology for sixth-generation communication systems. These systems capitalize on RIS to minimize power consumption, thereby achieving consistent performance and enhancing communication quality through the establishment and shaping of auxiliary signal propagation pathways between access points (APs) and users. However, integrating RIS into existing CF mMIMO infrastructures presents several technical challenges. This study delves into the signal transmission scheme and deployment architecture of RIS-aided CF mMIMO systems, addressing inherent challenges such as interference induced by RIS and the increased complexity in beam alignment. Furthermore, we address the complexities arising from the joint optimization of the reflection phase of RIS and beamforming technology at the APs, intending to fully exploit the reflection capabilities of RISs and beamforming technology to maximize the energy efficiency (EE) of the system. To overcome these challenges, we propose cooperation communication to suppress RIS-induced interference, beam tracking, and joint optimization to improve system EE. We also present specific examples of cooperative communication under the constraint of electromagnetic interference and the beam tracking of a mobile system. Additionally, we emphasize important research directions for RIS-aided CF mMIMO systems, aiming to inspire future investigations.

Via

Access Paper or Ask Questions

Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Nov 26, 2024

Liyun Zhang, Dian Ding, Yu Lu, Yi-Chao Chen, Guangtao Xue

Figure 1 for Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Figure 2 for Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Figure 3 for Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Figure 4 for Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting

Abstract:Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMOCAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.

Via

Access Paper or Ask Questions

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Nov 05, 2024

Ziliang Gan, Yu Lu, Dong Zhang, Haohan Li, Che Liu, Jian Liu, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu(+2 more)

Figure 1 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Figure 2 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Figure 3 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Figure 4 for MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Abstract:In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.

* Project Page: https://hithink-research.github.io/MME-Finance/

Via

Access Paper or Ask Questions

Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas

Oct 18, 2024

Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yu Lu, Yaochu Jin, Lili Pan, Zhenzhong Lan

Abstract:Scientific innovation is pivotal for humanity, and harnessing large language models (LLMs) to generate research ideas could transform discovery. However, existing LLMs often produce simplistic and repetitive suggestions due to their limited ability in acquiring external knowledge for innovation. To address this problem, we introduce an enhanced planning and search methodology designed to boost the creative potential of LLM-based systems. Our approach involves an iterative process to purposely plan the retrieval of external knowledge, progressively enriching the idea generation with broader and deeper insights. Validation through automated and human assessments indicates that our framework substantially elevates the quality of generated ideas, particularly in novelty and diversity. The number of unique novel ideas produced by our framework is 3.4 times higher than without it. Moreover, our method outperforms the current state-of-the-art, generating at least 2.5 times more top-rated ideas based on 170 seed papers in a Swiss Tournament evaluation.

Via

Access Paper or Ask Questions

LLMs + Persona-Plug = Personalized LLMs

Sep 18, 2024

Jiongnan Liu, Yutao Zhu, Shuting Wang, Xiaochi Wei, Erxue Min, Yu Lu, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou

Figure 1 for LLMs + Persona-Plug = Personalized LLMs

Figure 2 for LLMs + Persona-Plug = Personalized LLMs

Figure 3 for LLMs + Persona-Plug = Personalized LLMs

Figure 4 for LLMs + Persona-Plug = Personalized LLMs

Abstract:Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user's relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user's overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, \ours{}. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.

Via

Access Paper or Ask Questions

Sparse Signal Reconstruction for Overdispersed Low-photon Count Biomedical Imaging Using $\ell_p$ Total Variation

Aug 29, 2024

Yu Lu, Roummel F. Marcia

$Figure 1 for Sparse Signal Reconstruction for Overdispersed Low-photon Count Biomedical Imaging Using $\ell_p$ Total Variation$

$Figure 2 for Sparse Signal Reconstruction for Overdispersed Low-photon Count Biomedical Imaging Using $\ell_p$ Total Variation$

$Figure 3 for Sparse Signal Reconstruction for Overdispersed Low-photon Count Biomedical Imaging Using $\ell_p$ Total Variation$

$Figure 4 for Sparse Signal Reconstruction for Overdispersed Low-photon Count Biomedical Imaging Using $\ell_p$ Total Variation$

Abstract:The negative binomial model, which generalizes the Poisson distribution model, can be found in applications involving low-photon signal recovery, including medical imaging. Recent studies have explored several regularization terms for the negative binomial model, such as the $\ell_p$ quasi-norm with $0 < p < 1$, $\ell_1$ norm, and the total variation (TV) quasi-seminorm for promoting sparsity in signal recovery. These penalty terms have been shown to improve image reconstruction outcomes. In this paper, we investigate the $\ell_p$ quasi-seminorm, both isotropic and anisotropic $\ell_p$ TV quasi-seminorms, within the framework of the negative binomial statistical model. This problem can be formulated as an optimization problem, which we solve using a gradient-based approach. We present comparisons between the negative binomial and Poisson statistical models using the $\ell_p$ TV quasi-seminorm as well as common penalty terms. Our experimental results highlight the efficacy of the proposed method.

* 5 pages, Accepted by the IEEE International Symposium on Biomedical Imaging (ISBI)

Via

Access Paper or Ask Questions