Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ye Yan

PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

Mar 13, 2025

Sen Wang, Dongliang Zhou, Liang Xie, Chao Xu, Ye Yan, Erwei Yin

Abstract:Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.

* This paper was accepted by Neural Networks

Via

Access Paper or Ask Questions

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Mar 24, 2024

Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin

Figure 1 for Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Figure 2 for Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Figure 3 for Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Figure 4 for Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Abstract:Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.

* The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
* To appear in LREC-COLING 2024

Via

Access Paper or Ask Questions

HDA-LVIO: A High-Precision LiDAR-Visual-Inertial Odometry in Urban Environments with Hybrid Data Association

Mar 11, 2024

Jian Shi, Wei Wang, Mingyang Qi, Xin Li, Ye Yan

Abstract:To enhance localization accuracy in urban environments, an innovative LiDAR-Visual-Inertial odometry, named HDA-LVIO, is proposed by employing hybrid data association. The proposed HDA_LVIO system can be divided into two subsystems: the LiDAR-Inertial subsystem (LIS) and the Visual-Inertial subsystem (VIS). In the LIS, the LiDAR pointcloud is utilized to calculate the Iterative Closest Point (ICP) error, serving as the measurement value of Error State Iterated Kalman Filter (ESIKF) to construct the global map. In the VIS, an incremental method is firstly employed to adaptively extract planes from the global map. And the centroids of these planes are projected onto the image to obtain projection points. Then, feature points are extracted from the image and tracked along with projection points using Lucas-Kanade (LK) optical flow. Next, leveraging the vehicle states from previous intervals, sliding window optimization is performed to estimate the depth of feature points. Concurrently, a method based on epipolar geometric constraints is proposed to address tracking failures for feature points, which can improve the accuracy of depth estimation for feature points by ensuring sufficient parallax within the sliding window. Subsequently, the feature points and projection points are hybridly associated to construct reprojection error, serving as the measurement value of ESIKF to estimate vehicle states. Finally, the localization accuracy of the proposed HDA-LVIO is validated using public datasets and data from our equipment. The results demonstrate that the proposed algorithm achieves obviously improvement in localization accuracy compared to various existing algorithms.

Via

Access Paper or Ask Questions

Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments

Nov 06, 2023

Lu Yue, Dongliang Zhou, Liang Xie, Feitian Zhang, Ye Yan, Erwei Yin

Abstract:The task of vision-and-language navigation in continuous environments (VLN-CE) aims at training an autonomous agent to perform low-level actions to navigate through 3D continuous surroundings using visual observations and language instructions. The significant potential of VLN-CE for mobile robots has been demonstrated across a large number of studies. However, most existing works in VLN-CE focus primarily on transferring the standard discrete vision-and-language navigation (VLN) methods to continuous environments, overlooking the problem of collisions. Such oversight often results in the agent deviating from the planned path or, in severe instances, the agent being trapped in obstacle areas and failing the navigational task. To address the above-mentioned issues, this paper investigates various collision scenarios within VLN-CE and proposes a classification method to predicate the underlying causes of collisions. Furthermore, a new VLN-CE algorithm, named Safe-VLN, is proposed to bolster collision avoidance capabilities including two key components, i.e., a waypoint predictor and a navigator. In particular, the waypoint predictor leverages a simulated 2D LiDAR occupancy mask to prevent the predicted waypoints from being situated in obstacle-ridden areas. The navigator, on the other hand, employs the strategy of `re-selection after collision' to prevent the robot agent from becoming ensnared in a cycle of perpetual collisions. The proposed Safe-VLN is evaluated on the R2R-CE, the results of which demonstrate an enhanced navigational performance and a statistically significant reduction in collision incidences.

Via

Access Paper or Ask Questions

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Aug 24, 2023

Yibo Cui, Liang Xie, Yakun Zhang, Meishan Zhang, Ye Yan, Erwei Yin

Abstract:Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.

* ICCV 2023 Oral

Via

Access Paper or Ask Questions

Planimation

Aug 11, 2020

Gang Chen, Yi Ding, Hugo Edwards, Chong Hin Chau, Sai Hou, Grace Johnson, Mohammed Sharukh Syed, Haoyuan Tang, Yue Wu, Ye Yan(+2 more)

Abstract:Planimation is a modular and extensible open source framework to visualise sequential solutions of planning problems specified in PDDL. We introduce a preliminary declarative PDDL-like animation profile specification, expressive enough to synthesise animations of arbitrary initial states and goals of a benchmark with just a single profile.

* Best ICAPS 19 - Systen Demo Award - technical report

Via

Access Paper or Ask Questions