Northeast Normal University
Abstract:Traditional Simultaneous Localization and Mapping (SLAM) systems often face limitations including coarse rendering quality, insufficient recovery of scene details, and poor robustness in dynamic environments. 3D Gaussian Splatting (3DGS), with its efficient explicit representation and high-quality rendering capabilities, offers a new reconstruction paradigm for SLAM. This survey comprehensively reviews key technical approaches for integrating 3DGS with SLAM. We analyze performance optimization of representative methods across four critical dimensions: rendering quality, tracking accuracy, reconstruction speed, and memory consumption, delving into their design principles and breakthroughs. Furthermore, we examine methods for enhancing the robustness of 3DGS-SLAM in complex environments such as motion blur and dynamic environments. Finally, we discuss future challenges and development trends in this area. This survey aims to provide a technical reference for researchers and foster the development of next-generation SLAM systems characterized by high fidelity, efficiency, and robustness.
Abstract:Ultrasound (US) imaging exhibits substantial heterogeneity across anatomical structures and acquisition protocols, posing significant challenges to the development of generalizable analysis models. Most existing methods are task-specific, limiting their suitability as clinically deployable foundation models. To address this limitation, the Foundation Model Challenge for Ultrasound Image Analysis (FM\_UIA~2026) introduces a large-scale multi-task benchmark comprising 27 subtasks across segmentation, classification, detection, and regression. In this paper, we present the official baseline for FM\_UIA~2026 based on a unified Multi-Head Multi-Task Learning (MH-MTL) framework that supports all tasks within a single shared network. The model employs an ImageNet-pretrained EfficientNet--B4 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) to capture multi-scale contextual information. A task-specific routing strategy enables global tasks to leverage high-level semantic features, while dense prediction tasks exploit spatially detailed FPN representations. Training incorporates a composite loss with task-adaptive learning rate scaling and a cosine annealing schedule. Validation results demonstrate the feasibility and robustness of this unified design, establishing a strong and extensible baseline for ultrasound foundation model research. The code and dataset are publicly available at \href{https://github.com/lijiake2408/Foundation-Model-Challenge-for-Ultrasound-Image-Analysis}{GitHub}.
Abstract:Reliably ensuring Large Language Models (LLMs) follow complex instructions is a critical challenge, as existing benchmarks often fail to reflect real-world use or isolate compliance from task success. We introduce MOSAIC (MOdular Synthetic Assessment of Instruction Compliance), a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints to enable a granular and independent analysis of this capability. Our evaluation of five LLMs from different families based on this new benchmark demonstrates that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position. The analysis reveals model-specific weaknesses, uncovers synergistic and conflicting interactions between instructions, and identifies distinct positional biases such as primacy and recency effects. These granular insights are critical for diagnosing model failures and developing more reliable LLMs for systems that demand strict adherence to complex instructions.
Abstract:Missed and delayed diagnosis remains a major challenge in rare disease care. At the initial clinical encounters, physicians assess rare disease risk using only limited information under high uncertainty. When high-risk patients are not recognised at this stage, targeted diagnostic testing is often not initiated, resulting in missed diagnosis. Existing primary care triage processes are structurally insufficient to reliably identify patients with rare diseases at initial clinical presentation and universal screening is needed to reduce diagnostic delay. Here we present RareAlert, an early screening system which predict patient-level rare disease risk from routinely available primary-visit information. RareAlert integrates reasoning generated by ten LLMs, calibrates and weights these signals using machine learning, and distils the aligned reasoning into a single locally deployable model. To develop and evaluate RareAlert, we curated RareBench, a real-world dataset of 158,666 cases covering 33 Orphanet disease categories and more than 7,000 rare conditions, including both rare and non-rare presentations. The results showed that rare disease identification can be reconceptualised as a universal uncertainty resolution process applied to the general patient population. On an independent test set, RareAlert, a Qwen3-4B based model trained with calibrated reasoning signals, achieved an AUC of 0.917, outperforming the best machine learning ensemble and all evaluated LLMs, including GPT-5, DeepSeek-R1, Claude-3.7-Sonnet, o3-mini, Gemini-2.5-Pro, and Qwen3-235B. These findings demonstrate the diversity in LLM medical reasoning and the effectiveness of aligning such reasoning in highly uncertain clinical tasks. By incorporating calibrated reasoning into a single model, RareAlert enables accurate, privacy-preserving, and scalable rare disease risk screening suitable for large-scale local deployment.
Abstract:The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: https://zenodo.org/records/17972831.
Abstract:This paper presents a method for estimating significant wave height (Hs) from sparse S_pectral P_oint using a T_ransformer-based approach (SPT). Based on empirical observations that only a minority of spectral points with strong power contribute to wave energy, the proposed SPT effectively integrates geometric and spectral characteristics of ocean surface waves to estimate Hs through multi-dimensional feature representation. The experiment reveals an intriguing phenomenon: the learned features of SPT align well with physical dispersion relations, where the contribution-score map of selected points is concentrated along dispersion curves. Compared to conventional vision networks that process image sequences and full spectra, SPT demonstrates superior performance in Hs regression while consuming significantly fewer computational resources. On a consumer-grade GPU, SPT completes the training of regression model for 1080 sea clutter image sequences within 4 minutes, showcasing its potential to reduce deployment costs for radar wave-measuring systems. The open-source implementation of SPT will be available at https://github.com/joeyee/spt
Abstract:Large Language Models (LLMs) often generate substantively relevant content but fail to adhere to formal constraints, leading to outputs that are conceptually correct but procedurally flawed. Traditional prompt refinement approaches focus on rephrasing the description of the primary task an LLM has to perform, neglecting the granular constraints that function as acceptance criteria for its response. We propose a novel multi-agentic workflow that decouples optimization of the primary task description from its constraints, using quantitative scores as feedback to iteratively rewrite and improve them. Our evaluation demonstrates this method produces revised prompts that yield significantly higher compliance scores from models like Llama 3.1 8B and Mixtral-8x 7B.
Abstract:Vehicle-Infrastructure Collaborative Perception (VICP) is pivotal for resolving occlusion in autonomous driving, yet the trade-off between communication bandwidth and feature redundancy remains a critical bottleneck. While intermediate fusion mitigates data volume compared to raw sharing, existing frameworks typically rely on spatial compression or static confidence maps, which inefficiently transmit spatially redundant features from non-critical background regions. To address this, we propose Risk-intent Selective detection (RiSe), an interaction-aware framework that shifts the paradigm from identifying visible regions to prioritizing risk-critical ones. Specifically, we introduce a Potential Field-Trajectory Correlation Model (PTCM) grounded in potential field theory to quantitatively assess kinematic risks. Complementing this, an Intention-Driven Area Prediction Module (IDAPM) leverages ego-motion priors to proactively predict and filter key Bird's-Eye-View (BEV) areas essential for decision-making. By integrating these components, RiSe implements a semantic-selective fusion scheme that transmits high-fidelity features only from high-interaction regions, effectively acting as a feature denoiser. Extensive experiments on the DeepAccident dataset demonstrate that our method reduces communication volume to 0.71\% of full feature sharing while maintaining state-of-the-art detection accuracy, establishing a competitive Pareto frontier between bandwidth efficiency and perception performance.
Abstract:Multimodal large language models (MLLMs) show promising performance on medical visual question answering (VQA) and report generation, but these generation and explanation abilities do not reliably transfer to disease-specific classification. We evaluated MLLM architectures on knee osteoarthritis (OA) radiograph classification, which remains underrepresented in existing medical MLLM benchmarks, even though knee OA affects an estimated 300 to 400 million people worldwide. Through systematic ablation studies manipulating the vision encoder, the connector, and the large language model (LLM) across diverse training strategies, we measured each component's contribution to diagnostic accuracy. In our classification task, a trained vision encoder alone could outperform full MLLM pipelines in classification accuracy and fine-tuning the LLM provided no meaningful improvement over prompt-based guidance. And LoRA fine-tuning on a small, class-balanced dataset (500 images) gave better results than training on a much larger but class-imbalanced set (5,778 images), indicating that data balance and quality can matter more than raw scale for this task. These findings suggest that for domain-specific medical classification, LLMs are more effective as interpreters and report generators rather than as primary classifiers. Therefore, the MLLM architecture appears less suitable for medical image diagnostic classification tasks that demand high certainty. We recommend prioritizing vision encoder optimization and careful dataset curation when developing clinically applicable systems.
Abstract:Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.