Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jia Pan

Aerial Vision-and-Language Navigation with Grid-based View Selection and Map Construction

Mar 14, 2025

Ganlong Zhao, Guanbin Li, Jia Pan, Yizhou Yu

Abstract:Aerial Vision-and-Language Navigation (Aerial VLN) aims to obtain an unmanned aerial vehicle agent to navigate aerial 3D environments following human instruction. Compared to ground-based VLN, aerial VLN requires the agent to decide the next action in both horizontal and vertical directions based on the first-person view observations. Previous methods struggle to perform well due to the longer navigation path, more complicated 3D scenes, and the neglect of the interplay between vertical and horizontal actions. In this paper, we propose a novel grid-based view selection framework that formulates aerial VLN action prediction as a grid-based view selection task, incorporating vertical action prediction in a manner that accounts for the coupling with horizontal actions, thereby enabling effective altitude adjustments. We further introduce a grid-based bird's eye view map for aerial space to fuse the visual information in the navigation history, provide contextual scene information, and mitigate the impact of obstacles. Finally, a cross-modal transformer is adopted to explicitly align the long navigation history with the instruction. We demonstrate the superiority of our method in extensive experiments.

* Under Submission

Via

Access Paper or Ask Questions

Generative Artificial Intelligence in Robotic Manipulation: A Survey

Mar 05, 2025

Kun Zhang, Peng Yun, Jun Cen, Junhao Cai, Didi Zhu, Hangjie Yuan, Chao Zhao, Tao Feng, Michael Yu Wang, Qifeng Chen(+3 more)

Abstract:This survey provides a comprehensive review on recent advancements of generative learning models in robotic manipulation, addressing key challenges in the field. Robotic manipulation faces critical bottlenecks, including significant challenges in insufficient data and inefficient data acquisition, long-horizon and complex task planning, and the multi-modality reasoning ability for robust policy learning performance across diverse environments. To tackle these challenges, this survey introduces several generative model paradigms, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), diffusion models, probabilistic flow models, and autoregressive models, highlighting their strengths and limitations. The applications of these models are categorized into three hierarchical layers: the Foundation Layer, focusing on data generation and reward generation; the Intermediate Layer, covering language, code, visual, and state generation; and the Policy Layer, emphasizing grasp generation and trajectory generation. Each layer is explored in detail, along with notable works that have advanced the state of the art. Finally, the survey outlines future research directions and challenges, emphasizing the need for improved efficiency in data utilization, better handling of long-horizon tasks, and enhanced generalization across diverse robotic scenarios. All the related resources, including research papers, open-source data, and projects, are collected for the community in https://github.com/GAI4Manipulation/AwesomeGAIManipulation

Via

Access Paper or Ask Questions

Col-OLHTR: A Novel Framework for Multimodal Online Handwritten Text Recognition

Feb 10, 2025

Chenyu Liu, Jinshui Hu, Baocai Yin, Jia Pan, Bing Yin, Jun Du, Qingfeng Liu

Abstract:Online Handwritten Text Recognition (OLHTR) has gained considerable attention for its diverse range of applications. Current approaches usually treat OLHTR as a sequence recognition task, employing either a single trajectory or image encoder, or multi-stream encoders, combined with a CTC or attention-based recognition decoder. However, these approaches face several drawbacks: 1) single encoders typically focus on either local trajectories or visual regions, lacking the ability to dynamically capture relevant global features in challenging cases; 2) multi-stream encoders, while more comprehensive, suffer from complex structures and increased inference costs. To tackle this, we propose a Collaborative learning-based OLHTR framework, called Col-OLHTR, that learns multimodal features during training while maintaining a single-stream inference process. Col-OLHTR consists of a trajectory encoder, a Point-to-Spatial Alignment (P2SA) module, and an attention-based decoder. The P2SA module is designed to learn image-level spatial features through trajectory-encoded features and 2D rotary position embeddings. During training, an additional image-stream encoder-decoder is collaboratively trained to provide supervision for P2SA features. At inference, the extra streams are discarded, and only the P2SA module is used and merged before the decoder, simplifying the process while preserving high performance. Extensive experimental results on several OLHTR benchmarks demonstrate the state-of-the-art (SOTA) performance, proving the effectiveness and robustness of our design.

* ICASSP 2025

Via

Access Paper or Ask Questions

Optimizing Efficiency of Mixed Traffic through Reinforcement Learning: A Topology-Independent Approach and Benchmark

Jan 28, 2025

Chuyang Xiao, Dawei Wang, Xinzheng Tang, Jia Pan, Yuexin Ma

Figure 1 for Optimizing Efficiency of Mixed Traffic through Reinforcement Learning: A Topology-Independent Approach and Benchmark

Figure 2 for Optimizing Efficiency of Mixed Traffic through Reinforcement Learning: A Topology-Independent Approach and Benchmark

Figure 3 for Optimizing Efficiency of Mixed Traffic through Reinforcement Learning: A Topology-Independent Approach and Benchmark

Figure 4 for Optimizing Efficiency of Mixed Traffic through Reinforcement Learning: A Topology-Independent Approach and Benchmark

Abstract:This paper presents a mixed traffic control policy designed to optimize traffic efficiency across diverse road topologies, addressing issues of congestion prevalent in urban environments. A model-free reinforcement learning (RL) approach is developed to manage large-scale traffic flow, using data collected by autonomous vehicles to influence human-driven vehicles. A real-world mixed traffic control benchmark is also released, which includes 444 scenarios from 20 countries, representing a wide geographic distribution and covering a variety of scenarios and road topologies. This benchmark serves as a foundation for future research, providing a realistic simulation environment for the development of effective policies. Comprehensive experiments demonstrate the effectiveness and adaptability of the proposed method, achieving better performance than existing traffic control methods in both intersection and roundabout scenarios. To the best of our knowledge, this is the first project to introduce a real-world complex scenarios mixed traffic control benchmark. Videos and code of our work are available at https://sites.google.com/berkeley.edu/mixedtrafficplus/home

* accepted to ICRA 2025

Via

Access Paper or Ask Questions

LP-ICP: General Localizability-Aware Point Cloud Registration for Robust Localization in Extreme Unstructured Environments

Jan 05, 2025

Haosong Yue, Qingyuan Xu, Fei Chen, Jia Pan, Weihai Chen

Figure 1 for LP-ICP: General Localizability-Aware Point Cloud Registration for Robust Localization in Extreme Unstructured Environments

Figure 2 for LP-ICP: General Localizability-Aware Point Cloud Registration for Robust Localization in Extreme Unstructured Environments

Figure 3 for LP-ICP: General Localizability-Aware Point Cloud Registration for Robust Localization in Extreme Unstructured Environments

Figure 4 for LP-ICP: General Localizability-Aware Point Cloud Registration for Robust Localization in Extreme Unstructured Environments

Abstract:The Iterative Closest Point (ICP) algorithm is a crucial component of LiDAR-based SLAM algorithms. However, its performance can be negatively affected in unstructured environments that lack features and geometric structures, leading to low accuracy and poor robustness in localization and mapping. It is known that degeneracy caused by the lack of geometric constraints can lead to errors in 6-DOF pose estimation along ill-conditioned directions. Therefore, there is a need for a broader and more fine-grained degeneracy detection and handling method. This paper proposes a new point cloud registration framework, LP-ICP, that combines point-to-line and point-to-plane distance metrics in the ICP algorithm, with localizability detection and handling. LP-ICP consists of a localizability detection module and an optimization module. The localizability detection module performs localizability analysis by utilizing the correspondences between edge points (with low local smoothness) to lines and planar points (with high local smoothness) to planes between the scan and the map. The localizability contribution of individual correspondence constraints can be applied to a broader range. The optimization module adds additional soft and hard constraints to the optimization equations based on the localizability category. This allows the pose to be constrained along ill-conditioned directions, with updates either tending towards the constraint value or leaving the initial estimate unchanged. This improves accuracy and reduces fluctuations. The proposed method is extensively evaluated through experiments on both simulation and real-world datasets, demonstrating higher or comparable accuracy than the state-of-the-art methods. The dataset and code of this paper will also be open-sourced at https://github.com/xuqingyuan2000/LP-ICP.

* 18 Pages, 8 Figures Submitted to IEEE Transactions on Automation Science and Engineering

Via

Access Paper or Ask Questions

Understanding Particles From Video: Property Estimation of Granular Materials via Visuo-Haptic Learning

Dec 03, 2024

Zeqing Zhang, Guangze Zheng, Xuebo Ji, Guanqi Chen, Ruixing Jia, Wentao Chen, Guanhua Chen, Liangjun Zhang, Jia Pan

Figure 1 for Understanding Particles From Video: Property Estimation of Granular Materials via Visuo-Haptic Learning

Figure 2 for Understanding Particles From Video: Property Estimation of Granular Materials via Visuo-Haptic Learning

Figure 3 for Understanding Particles From Video: Property Estimation of Granular Materials via Visuo-Haptic Learning

Figure 4 for Understanding Particles From Video: Property Estimation of Granular Materials via Visuo-Haptic Learning

Abstract:Granular materials (GMs) are ubiquitous in daily life. Understanding their properties is also important, especially in agriculture and industry. However, existing works require dedicated measurement equipment and also need large human efforts to handle a large number of particles. In this paper, we introduce a method for estimating the relative values of particle size and density from the video of the interaction with GMs. It is trained on a visuo-haptic learning framework inspired by a contact model, which reveals the strong correlation between GM properties and the visual-haptic data during the probe-dragging in the GMs. After training, the network can map the visual modality well to the haptic signal and implicitly characterize the relative distribution of particle properties in its latent embeddings, as interpreted in that contact model. Therefore, we can analyze GM properties using the trained encoder, and only visual information is needed without extra sensory modalities and human efforts for labeling. The presented GM property estimator has been extensively validated via comparison and ablation experiments. The generalization capability has also been evaluated and a real-world application on the beach is also demonstrated. Experiment videos are available at \url{https://sites.google.com/view/gmwork/vhlearning} .

* IEEE Robotics and Automation Letters, with ICRA 2025

Via

Access Paper or Ask Questions

A Haptic-Based Proximity Sensing System for Buried Object in Granular Material

Nov 26, 2024

Zeqing Zhang, Ruixing Jia, Youcan Yan, Ruihua Han, Shijie Lin, Qian Jiang, Liangjun Zhang, Jia Pan

Figure 1 for A Haptic-Based Proximity Sensing System for Buried Object in Granular Material

Figure 2 for A Haptic-Based Proximity Sensing System for Buried Object in Granular Material

Figure 3 for A Haptic-Based Proximity Sensing System for Buried Object in Granular Material

Figure 4 for A Haptic-Based Proximity Sensing System for Buried Object in Granular Material

Abstract:The proximity perception of objects in granular materials is significant, especially for applications like minesweeping. However, due to particles' opacity and complex properties, existing proximity sensors suffer from high costs from sophisticated hardware and high user-cost from unintuitive results. In this paper, we propose a simple yet effective proximity sensing system for underground stuff based on the haptic feedback of the sensor-granules interaction. We study and employ the unique characteristic of particles -- failure wedge zone, and combine the machine learning method -- Gaussian process regression, to identify the force signal changes induced by the proximity of objects, so as to achieve near-field perception. Furthermore, we design a novel trajectory to control the probe searching in granules for a wide range of perception. Also, our proximity sensing system can adaptively determine optimal parameters for robustness operation in different particles. Experiments demonstrate our system can perceive underground objects over 0.5 to 7 cm in advance among various materials.

* The 40th International Symposium of Robotics Research (ISRR). Long Beach, California, USA, December 8-12 2024

Via

Access Paper or Ask Questions

EROAM: Event-based Camera Rotational Odometry and Mapping in Real-time

Nov 17, 2024

Wanli Xing, Shijie Lin, Linhan Yang, Zeqing Zhang, Yanjun Du, Maolin Lei, Yipeng Pan, Jia Pan

Abstract:This paper presents EROAM, a novel event-based rotational odometry and mapping system that achieves real-time, accurate camera rotation estimation. Unlike existing approaches that rely on event generation models or contrast maximization, EROAM employs a spherical event representation by projecting events onto a unit sphere and introduces Event Spherical Iterative Closest Point (ES-ICP), a novel geometric optimization framework designed specifically for event camera data. The spherical representation simplifies rotational motion formulation while enabling continuous mapping for enhanced spatial resolution. Combined with parallel point-to-line optimization, EROAM achieves efficient computation without compromising accuracy. Extensive experiments on both synthetic and real-world datasets show that EROAM significantly outperforms state-of-the-art methods in terms of accuracy, robustness, and computational efficiency. Our method maintains consistent performance under challenging conditions, including high angular velocities and extended sequences, where other methods often fail or show significant drift. Additionally, EROAM produces high-quality panoramic reconstructions with preserved fine structural details.

Via

Access Paper or Ask Questions

DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Nov 11, 2024

Shu-Tong Niu, Jun Du, Ruo-Yu Wang, Gao-Bin Yang, Tian Gao, Jia Pan, Yu Hu

Figure 1 for DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Figure 2 for DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Figure 3 for DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Figure 4 for DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Abstract:We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries during decoding. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. We also perform the evaluation on the open LibriCSS dataset, achieving a new state-of-the-art performance on single-channel speech recognition.

Via

Access Paper or Ask Questions

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Oct 17, 2024

Hanbo Cheng, Limin Lin, Chenyu Liu, Pengcheng Xia, Pengfei Hu, Jiefeng Ma, Jun Du, Jia Pan

Abstract:Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly at https://github.com/Hanbo-Cheng/DAWN-pytorch.

Via

Access Paper or Ask Questions