Fudan university
Abstract:Automatic view positioning is crucial for cardiac computed tomography (CT) examinations, including disease diagnosis and surgical planning. However, it is highly challenging due to individual variability and large 3D search space. Existing work needs labor-intensive and time-consuming manual annotations to train view-specific models, which are limited to predicting only a fixed set of planes. However, in real clinical scenarios, the challenge of positioning semantic 2D slices with any orientation into varying coordinate space in arbitrary 3D volume remains unsolved. We thus introduce a novel framework, AVP-AP, the first to use Atlas Prompting for self-supervised Automatic View Positioning in the 3D CT volume. Specifically, this paper first proposes an atlas prompting method, which generates a 3D canonical atlas and trains a network to map slices into their corresponding positions in the atlas space via a self-supervised manner. Then, guided by atlas prompts corresponding to the given query images in a reference CT, we identify the coarse positions of slices in the target CT volume using rigid transformation between the 3D atlas and target CT volume, effectively reducing the search space. Finally, we refine the coarse positions by maximizing the similarity between the predicted slices and the query images in the feature space of a given foundation model. Our framework is flexible and efficient compared to other methods, outperforming other methods by 19.8% average structural similarity (SSIM) in arbitrary view positioning and achieving 9% SSIM in two-chamber view compared to four radiologists. Meanwhile, experiments on a public dataset validate our framework's generalizability.
Abstract:Accurate and stable feature matching is critical for computer vision tasks, particularly in applications such as Simultaneous Localization and Mapping (SLAM). While recent learning-based feature matching methods have demonstrated promising performance in challenging spatiotemporal scenarios, they still face inherent trade-offs between accuracy and computational efficiency in specific settings. In this paper, we propose a lightweight feature matching network designed to establish sparse, stable, and consistent correspondence between multiple frames. The proposed method eliminates the dependency on manual annotations during training and mitigates feature drift through a hybrid self-supervised paradigm. Extensive experiments validate three key advantages: (1) Our method operates without dependency on external prior knowledge and seamlessly incorporates its hybrid training mechanism into original datasets. (2) Benchmarked against state-of-the-art deep learning-based methods, our approach maintains equivalent computational efficiency at low-resolution scales while achieving a 2-10x improvement in computational efficiency for high-resolution inputs. (3) Comparative evaluations demonstrate that the proposed hybrid self-supervised scheme effectively mitigates feature drift in long-term tracking while maintaining consistent representation across image sequences.
Abstract:Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a "mist" that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL models, though as a prevalent practice, hinders grounding capabilities and has yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis could benefit the 3D-VL community towards faithful developments.
Abstract:World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. The code is available at https://github.com/Mia-Cong/SWIFT.git.
Abstract:We consider high angular resolution detection using distributed mobile platforms implemented with so-called partly calibrated arrays, where position errors between subarrays exist and the counterparts within each subarray are ideally calibrated. Since position errors between antenna arrays affect the coherent processing of measurements from these arrays, it is commonly believed that its angular resolution is influenced. A key question is whether and how much the angular resolution of partly calibrated arrays is affected by the position errors, in comparison with ideally calibrated arrays. To address this fundamental problem, we theoretically illustrate that partly calibrated arrays approximately achieve high angular resolution. Our analysis uses a special characteristic of Cramer-Rao lower bound (CRB) w.r.t. the source separation: When the source separation increases, the CRB first declines rapidly, then plateaus out, and the turning point is close to the angular resolution limit. This means that the turning point of CRB can be used to indicate angular resolution. We then theoretically analyze the declining and plateau phases of CRB, and explain that the turning point of CRB in partly calibrated arrays is close to the angular resolution limit of distributed arrays without errors, demonstrating high resolution ability. This work thus provides a theoretical guarantee for the high-resolution performance of distributed antenna arrays in mobile platforms.
Abstract:We present a novel preference learning framework to capture participant preferences efficiently within limited interaction rounds. It involves three main contributions. First, we develop a variational Bayesian approach to infer the participant's preference model by estimating posterior distributions and managing uncertainty from limited information. Second, we propose an adaptive questioning policy that maximizes cumulative uncertainty reduction, formulating questioning as a finite Markov decision process and using Monte Carlo Tree Search to prioritize promising question trajectories. By considering long-term effects and leveraging the efficiency of the Bayesian approach, the policy avoids shortsightedness. Third, we apply the framework to Multiple Criteria Decision Aiding, with pairwise comparison as the preference information and an additive value function as the preference model. We integrate the reparameterization trick to address high-variance issues, enhancing robustness and efficiency. Computational studies on real-world and synthetic datasets demonstrate the framework's practical usability, outperforming baselines in capturing preferences and achieving superior uncertainty reduction within limited interactions.
Abstract:A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we integrate the advantages of SSMs for better efficiency-performance trade-off and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code is released at https://github.com/AuroraZengfh/MambaIC.
Abstract:Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89x speedup and 1.97-2.58x memory reduction in inference compared to existing quantization methods.
Abstract:Recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable success. However, fundamental misalignments between T2I and SR targets result in a dilemma between inference speed and detail fidelity. Specifically, T2I tasks prioritize multi-step inversion to synthesize coherent outputs aligned with textual prompts and shrink the latent space to reduce generating complexity. Contrariwise, SR tasks preserve most information from low-resolution input while solely restoring high-frequency details, thus necessitating sufficient latent space and fewer inference steps. To bridge the gap, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without enlarging the model size. Regarding step-distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.
Abstract:Despite Greece's pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.