Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yumeng Zhang

on behalf of EUCanImage working group

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

Jun 08, 2026

Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia, Bowen Liu, Shuo Nie, Weijie Zhu, Yumeng Zhang

Abstract:Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

Via

Access Paper or Ask Questions

Sparse Channel Estimation for Pixel Antennas: Addressing the Pilot Rank Deficiency

May 18, 2026

Yiting Chen, Yumeng Zhang, Hongyu Li

Abstract:Composed of multiple interconnected pixels controlled by on/off RF switches, the pixel antenna can generate reconfigurable radiation patterns that can be further exploited to construct diverse pilot sequences for effective channel estimation. However, such pilot sequences inherently have rank deficiency, making it difficult to effectively and efficiently acquire the full channel state information (CSI) across all available radiation patterns. To tackle this difficulty, we consider a sparse environment with a limited number of propagation paths for a pixel antenna system, where a user equipped with a pixel antenna transmits only a limited number of pilots to recover the CSI under all radiation patterns. The proposed algorithm exploits the limited number of propagation paths that are invariant with the pixel antenna patterns, and then formulates the full channel estimation as a sparse recovery problem in the angular domain solved by Generalized Approximate Message Passing (GAMP). Moreover, to mitigate the rank deficiency of pilot sequences, we additionally incorporate a Multipath Matching Pursuit (MMP) algorithm for robust initialization. The overall proposed scheme, termed MMP-GAMP, achieves higher estimation accuracy than other algorithm baselines, while requiring lower pilot overhead.

Via

Access Paper or Ask Questions

Distribution Corrected Offline Data Distillation for Large Language Models

May 13, 2026

Yumeng Zhang, Zhengbang Yang, Yevin Nikhel Goonatilake, Zhuangdi Zhu

Abstract:Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.

Via

Access Paper or Ask Questions

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

Apr 27, 2026

Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang

Abstract:Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

* CVPR2026

Via

Access Paper or Ask Questions

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Mar 19, 2026

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

Abstract:While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

* 31 pages, 12 figures

Via

Access Paper or Ask Questions

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Feb 22, 2026

Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu, Yumeng Zhang, Zehong Lin, Jun Zhang

Abstract:Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

* Code will be released at https://github.com/Aoko955/Flash-VAED

Via

Access Paper or Ask Questions

ERNIE 5.0 Technical Report

Feb 04, 2026

Haifeng Wang, Hua Wu, Tian Wu, Yu Sun, Jing Liu, Dianhai Yu, Yanjun Ma, Jingzhou He, Zhongjun He, Dou Hong(+425 more)

Abstract:In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

Via

Access Paper or Ask Questions

Robust Multicentre Detection and Classification of Colorectal Liver Metastases on CT: Application of Foundation Models

Jan 12, 2026

Shruti Atul Mali, Zohaib Salahuddin, Yumeng Zhang, Andre Aichert, Xian Zhong, Henry C. Woodruff, Maciej Bobowicz, Katrine Riklund, Juozas Kupčinskas, Lorenzo Faggioni(+3 more)

Abstract:Colorectal liver metastases (CRLM) are a major cause of cancer-related mortality, and reliable detection on CT remains challenging in multi-centre settings. We developed a foundation model-based AI pipeline for patient-level classification and lesion-level detection of CRLM on contrast-enhanced CT, integrating uncertainty quantification and explainability. CT data from the EuCanImage consortium (n=2437) and an external TCIA cohort (n=197) were used. Among several pretrained models, UMedPT achieved the best performance and was fine-tuned with an MLP head for classification and an FCOS-based head for lesion detection. The classification model achieved an AUC of 0.90 and a sensitivity of 0.82 on the combined test set, with a sensitivity of 0.85 on the external cohort. Excluding the most uncertain 20 percent of cases improved AUC to 0.91 and balanced accuracy to 0.86. Decision curve analysis showed clinical benefit for threshold probabilities between 0.30 and 0.40. The detection model identified 69.1 percent of lesions overall, increasing from 30 percent to 98 percent across lesion size quartiles. Grad-CAM highlighted lesion-corresponding regions in high-confidence cases. These results demonstrate that foundation model-based pipelines can support robust and interpretable CRLM detection and classification across heterogeneous CT data.

Via

Access Paper or Ask Questions

A Uniform Pilot and Data Payload Optimization Framework for OTFS-Based ISAC

Dec 31, 2025

Borui Du, Yumeng Zhang, Christos Masouros, Bruno Clerckx

Abstract:The orthogonal time frequency space (OTFS) signal is considered a promising solution for high-mobility wireless environments. It manages Doppler effects by utilizing delay-Doppler (DD) domain processing. However, the relatively long OTFS frame duration could introduce considerable sensing or communication latency when radar and communication are performed separately. By operating in a dual-functional radar and communication (DFRC) mode, the OTFS system performs sensing and data transmission simultaneously, thereby reducing the resulting latency. Nevertheless, the optimal OTFS DFRC signal strategy remains insufficiently explored. This paper investigates the optimal signal design for OTFS DFRC systems, focusing on pilot symbol design and data symbol power allocation. Specifically, we derive a channel capacity lower bound metric for communication that considers channel estimation errors in OTFS. For sensing, we derive an integrated sidelobe level (ISL), accounting for the randomness of the data symbols alongside the deterministic pilot symbols. Leveraging the above metrics, we formulate an optimization problem that balances radar and communication performance, and then solve it using an alternating optimization framework. We validate the proposed signal through numerical analysis and Monte Carlo simulations. Our analysis shows that OTFS DFRC enforces a deterministic pilot signal that is characterized by a concentrated peak in the DD domain, which furnishes a common structure in the DD domain facilitating sensing and channel estimation, with data multiplexed in other DD grids, thereby unifying sensing and communication within a single OTFS signal. Compared with conventional OTFS signals, the proposed OTFS DFRC signal expands the achievable sensing-communication performance region, delivering at least a 9.45 dB ISL suppression for sensing and a 4.82 dB SINR ratio gain for communication.

Via

Access Paper or Ask Questions

Robust Design for Multi-Antenna LEO Satellite Communications with Fractional Delay and Doppler Shifts: An RSMA-OTFS Approach

Dec 16, 2025

Yunnuo Xu, Yumeng Zhang, Yijie Mao, Bruno Clerck, Yun Hee Kim, Yujun Li

Abstract:Low-Earth-orbit (LEO) satellite communication systems face challenges due to high satellite mobility, which hinders the reliable acquisition of instantaneous channel state information at the transmitter (CSIT) and subsequently degrades multi-user transmission performance. This paper investigates a downlink multi-user multi-antenna system, and tackles the above challenges by introducing orthogonal time frequency space (OTFS) modulation and rate-splitting multiple access (RSMA) transmission. Specifically, OTFS enables stable characterization of time-varying channels by representing them in the delay-Doppler domain. However, realistic propagation introduces various inter-symbol and inter-user interference due to non-orthogonal yet practical rectangular pulse shaping, fractional delays, Doppler shifts, and imperfect (statistical) CSIT. In this context, RSMA offers promising robustness for interference mitigation and CSIT imperfections, and hence is integrated with OTFS to provide a comprehensive solution. A compact cross-domain input-output relationship for RSMA-OTFS is established, and an ergodic sum-rate maximization problem is formulated and solved using a weighted minimum mean-square-error based alternating optimization algorithm that does not depend on channel sparsity. Simulation results reveal that the considered practical propagation effects significantly degrade performance if unaddressed. Furthermore, the RSMA-OTFS scheme demonstrates improved ergodic sum-rate and robustness against CSIT uncertainty across various user deployments and CSIT qualities.

Via

Access Paper or Ask Questions