Abstract:Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs' visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and estimate optimized steering directions for their outputs. This steering strengthens LVLM's fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAST reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art performance while adding little inference cost and preserving other foundational capabilities.
Abstract:Remote sensing images are frequently degraded by adverse weather conditions, particularly clouds and haze, which severely impair downstream applications. Existing restoration methods typically rely on computationally heavy architectures or sequential pipelines (e.g., detail enhancement followed by color rendition) that suffer from mutual interference and artifact accumulation. Furthermore, recent unified grid-based approaches utilize fixed, isotropic interpolation kernels, neglecting the intrinsic low-dimensional manifold of natural images and inevitably causing edge blur. To address these limitations, we propose 6th Grid-Net, a highly efficient and unified remote sensing image restoration framework tailored for resource-constrained edge devices. Specifically, we construct a novel six-dimensional fusion tensor that seamlessly integrates the color rendition capabilities of 3D LUTs with the spatial-luminance detail preservation of bilateral grids. To overcome the drawbacks of standard trilinear interpolation, we introduce a manifold-adaptive high-dimensional sampling mechanism. This mechanism dynamically adjusts the interpolation kernel based on local edge orientation, texture strength, and color similarity, enabling simultaneous global color stylization and local edge refinement in a single forward pass. Additionally, an edge-aware grid smoothing constraint and dynamic quantization are incorporated to suppress ghosting artifacts and significantly compress the model size. Extensive experiments on multiple benchmark datasets demonstrate that 6th Grid-Net achieves state-of-the-art restoration quality across various degradation scenarios.
Abstract:Over decades, Markov chain Monte Carlo (MCMC) methods have been widely studied, with a typical application being the quantification of posterior uncertainties in Bayesian system identification of structural dynamic models. To address the issue of excessively low sampling efficiency in generic MCMC methods when applied to specific problems, researchers developed several MCMC algorithms that integrate trainable neural networks to replace and enhance their critical components. Later, meta-learning MCMC methods emerged to reduce training time. However, they require considerable similarity between test and training tasks, while their sampling efficiency is constrained by trade-off-simplified network designs. This paper proposes the Adaptive Principal-Component (PC) Meta-learning Stochastic Gradient Hamiltonian Monte Carlo (APM-SGHMC) algorithm. It adaptively rotates coordinate axes in the parameter space to align with the PC directions of the current posterior samples, ensuring rotation-invariance of sampling performance with respect to the posterior distribution. By incorporating translation-invariance, scale-invariance, and rotation-invariance in a unified framework, APM-SGHMC enables universal samplers to acquire generalizable knowledge across diverse Bayesian system identification tasks using minimalistic tasks while eliminating the constraints imposed by network design trade-offs on sampling efficiency. Practical feasibility issues are also addressed. Two Bayesian system identification case studies demonstrate its effectiveness and universality: our method overcomes the case-by-case limitations of traditional data-driven approaches, achieving zero-shot generalization across structurally distinct models without retraining and maintaining consistent superior performance across all scenarios.
Abstract:This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.
Abstract:In this paper, we review the NTIRE 2026 challenge on single-image reflection removal (SIRR) in the Wild. SIRR is a fundamental task in image restoration. Despite progress in academic research, most methods are tested on synthetic images or limited real-world images, creating a gap in real-world applications. In this challenge, we provide participants with the OpenRR-5k dataset, which requires them to process real-world images that cover a range of reflection scenarios and intensities, with the goal of generating clean images without reflections. The challenge attracted more than 100 registrations, with 11 of them participating in the final testing phase. The top-ranked methods advanced the state-of-the-art reflection removal performance and earned unanimous recognition from the five experts in the field. The proposed OpenRR-5k dataset is available at https://huggingface.co/datasets/qiuzhangTiTi/OpenRR-5k, and the homepage of this challenge is at https://github.com/caijie0620/OpenRR-5k. Due to page limitations, this article only presents partial content; the full report and detailed analyses are available in the extended arXiv version.
Abstract:This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.
Abstract:With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.
Abstract:Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.
Abstract:Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/ChangHua0/TextOVSR.
Abstract:Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints -- a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at https://github.com/leavlong/FedBPrompt.