Abstract:Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments. We introduce a unified Human-Aware VLN (HA-VLN) benchmark that merges these paradigms under explicit social-awareness constraints. Our contributions include: 1. A standardized task definition that balances discrete-continuous navigation with personal-space requirements; 2. An enhanced human motion dataset (HAPS 2.0) and upgraded simulators capturing realistic multi-human interactions, outdoor contexts, and refined motion-language alignment; 3. Extensive benchmarking on 16,844 human-centric instructions, revealing how multi-human dynamics and partial observability pose substantial challenges for leading VLN agents; 4. Real-world robot tests validating sim-to-real transfer in crowded indoor spaces; and 5. A public leaderboard supporting transparent comparisons across discrete and continuous tasks. Empirical results show improved navigation success and fewer collisions when social context is integrated, underscoring the need for human-centric design. By releasing all datasets, simulators, agent code, and evaluation tools, we aim to advance safer, more capable, and socially responsible VLN research.
Abstract:Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization. However, previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions. More recent findings suggest that LS also induces overconfidence in misclassifications, yet the precise mechanism remained unclear. In this work, we decompose the loss term introduced by LS, revealing two key components: (i) a regularization term that functions only when the prediction is correct, and (ii) an error-enhancement term that emerges under misclassifications. This latter term compels the model to reinforce incorrect predictions with exaggerated certainty, further collapsing the feature space. To address these issues, we propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions by penalizing the top-1 logit instead of the ground-truth logit. Through feature analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Extensive experiments on image classification and downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization.
Abstract:For high-level geo-spatial applications and intelligent robotics, accurate global pose information is of crucial importance. Map-aided localization is an important and universal approach to overcome the limitations of global navigation satellite system (GNSS) in challenging environments. However, current solutions face challenges in terms of mapping flexibility, storage burden and re-localization performance. In this work, we present SF-Loc, a lightweight visual mapping and map-aided localization system, whose core idea is the map representation based on sparse frames with dense (though downsampled) depth, termed as visual structure frames. In the mapping phase, multi-sensor dense bundle adjustment (MS-DBA) is applied to construct geo-referenced visual structure frames. The local co-visbility is checked to keep the map sparsity and achieve incremental mapping. In the localization phase, coarse-to-fine vision-based localization is performed, in which multi-frame information and the map distribution are fully integrated. To be specific, the concept of spatially smoothed similarity (SSS) is proposed to overcome the place ambiguity, and pairwise frame matching is applied for efficient and robust pose estimation. Experimental results on both public and self-made datasets verify the effectiveness of the system. In complex urban road scenarios, the map size is down to 3 MB per kilometer and stable decimeter-level re-localization can be achieved. The code will be made open-source soon (https://github.com/GREAT-WHU/SF-Loc).
Abstract:In the study, we aim to investigate current LLMs' mastery of medical factual knowledge with a dynamic evaluation schema, which can automatically generate multiple test samples for each medical factual knowledge point. Test samples produced directly by LLMs always introduce factual errors and lack diversity in the manner of knowledge expression. To overcome the drawbacks, here we propose a novel evaluation method, Predicate-text Dual Transformation (PretextTrans), by introducing predicate transformations into the dynamic evaluation schema. Specifically, each medical knowledge point is firstly transformed into a predicate expression; then, the predicate expression derives a series of variants through predicate transformations; lastly, the produced predicate variants are transformed back into textual expressions, resulting in a series of test samples with both factual reliability and expression diversity. Using the proposed PretextTrans method, we systematically investigate 12 well-known LLMs' mastery of medical factual knowledge based on two medical datasets. The comparison results show that current LLMs still have significant deficiencies in fully mastering medical knowledge, which may illustrate why current LLMs still perform unsatisfactorily in real-world medical scenarios despite having achieved considerable performance on public benchmarks. Our proposed method serves as an effective solution for evaluation of LLMs in medical domain and offers valuable insights for developing medical-specific LLMs.
Abstract:Visual-inertial systems have been widely studied and applied in the last two decades, mainly due to their low cost and power consumption, small footprint, and high availability. Such a trend simultaneously leads to a large amount of visual-inertial calibration methods being presented, as accurate spatiotemporal parameters between sensors are a prerequisite for visual-inertial fusion. In our previous work, i.e., iKalibr, a continuous-time-based visual-inertial calibration method was proposed as a part of one-shot multi-sensor resilient spatiotemporal calibration. While requiring no artificial target brings considerable convenience, computationally expensive pose estimation is demanded in initialization and batch optimization, limiting its availability. Fortunately, this could be vastly improved for the RGBDs with additional depth information, by employing mapping-free ego-velocity estimation instead of mapping-based pose estimation. In this paper, we present the continuous-time ego-velocity estimation-based RGBD-inertial spatiotemporal calibration, termed as iKalibr-RGBD, which is also targetless but computationally efficient. The general pipeline of iKalibr-RGBD is inherited from iKalibr, composed of a rigorous initialization procedure and several continuous-time batch optimizations. The implementation of iKalibr-RGBD is open-sourced at (https://github.com/Unsigned-Long/iKalibr) to benefit the research community.
Abstract:Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, which target a balance between diversity and quality via temperature tuning and tail truncation (e.g., top-k and top-p sampling). Considering the high dynamic range of the candidate next-token given different prefixes, recent studies propose to adaptively truncate the tail of LLM's predicted distribution. Although improved results haven been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated truncation parameters and exemplar text. In this paper, we propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
Abstract:Aided inertial navigation system (INS), typically consisting of an inertial measurement unit (IMU) and an exteroceptive sensor, has been widely accepted as a feasible solution for navigation. Compared with vision-aided and LiDAR-aided INS, radar-aided INS could achieve better performance in adverse weather conditions since the radar utilizes low-frequency measuring signals with less attenuation effect in atmospheric gases and rain. For such a radar-aided INS, accurate spatiotemporal transformation is a fundamental prerequisite to achieving optimal information fusion. In this work, we present RIs-Calib: a spatiotemporal calibrator for multiple 3D radars and IMUs based on continuous-time estimation, which enables accurate spatiotemporal calibration and does not require any additional artificial infrastructure or prior knowledge. Our approach starts with a rigorous and robust procedure for state initialization, followed by batch optimizations, where all parameters can be refined to global optimal states steadily. We validate and evaluate RIs-Calib on both simulated and real-world experiments, and the results demonstrate that RIs-Calib is capable of accurate and consistent calibration. We open-source our implementations at (https://github.com/Unsigned-Long/RIs-Calib) to benefit the research community.
Abstract:The integrated inertial system, typically integrating an IMU and an exteroceptive sensor such as radar, LiDAR, and camera, has been widely accepted and applied in modern robotic applications for ego-motion estimation, motion control, or autonomous exploration. To improve system accuracy, robustness, and further usability, both multiple and various sensors are generally resiliently integrated, which benefits the system performance regarding failure tolerance, perception capability, and environment compatibility. For such systems, accurate and consistent spatiotemporal calibration is required to maintain a unique spatiotemporal framework for multi-sensor fusion. Considering most existing calibration methods (i) are generally oriented to specific integrated inertial systems, (ii) often only focus on spatial determination, (iii) usually require artificial targets, lacking convenience and usability, we propose iKalibr: a unified targetless spatiotemporal calibration framework for resilient integrated inertial systems, which overcomes the above issues, and enables both accurate and consistent calibration. Altogether four commonly employed sensors are supported in iKalibr currently, namely IMU, radar, LiDAR, and camera. The proposed method starts with a rigorous and efficient dynamic initialization, where all parameters in the estimator would be accurately recovered. Following that, several continuous-time-based batch optimizations would be carried out to refine initialized parameters to global optimal ones. Sufficient real-world experiments were conducted to verify the feasibility and evaluate the calibration performance of iKalibr. The results demonstrate that iKalibr can achieve accurate resilient spatiotemporal calibration. We open-source our implementations at (https://github.com/Unsigned-Long/iKalibr) to benefit the research community.
Abstract:Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.
Abstract:Large language models (LLMs) have excelled across domains, also delivering notable performance on the medical evaluation benchmarks, such as MedQA. However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios. In this paper, we aim to explore the causes of this gap by employing a multifaceted examination schema to systematically probe the actual mastery of medical knowledge by current LLMs. Specifically, we develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge at multiple facets (comparison, rectification, discrimination, and verification) concurrently. Based on the MultifacetEval framework, we construct two multifaceted evaluation datasets: MultiDiseK (by producing questions from a clinical disease knowledge base) and MultiMedQA (by rephrasing each question from a medical benchmark MedQA into multifaceted questions). The experimental results on these multifaceted datasets demonstrate that the extent of current LLMs in mastering medical knowledge is far below their performance on existing medical benchmarks, suggesting that they lack depth, precision, and comprehensiveness in mastering medical knowledge. Consequently, current LLMs are not yet ready for application in real-world medical tasks. The codes and datasets are available at https://github.com/THUMLP/MultifacetEval.