Abstract:Cooperative path planning, a crucial aspect of multi-agent systems research, serves a variety of sectors, including military, agriculture, and industry. Many existing algorithms, however, come with certain limitations, such as simplified kinematic models and inadequate support for multiple group scenarios. Focusing on the planning problem associated with a nonholonomic Ackermann model for Unmanned Ground Vehicles (UGV), we propose a leaderless, hierarchical Search-Based Cooperative Motion Planning (SCMP) method. The high-level utilizes a binary conflict search tree to minimize runtime, while the low-level fabricates kinematically feasible, collision-free paths that are shape-constrained. Our algorithm can adapt to scenarios featuring multiple groups with different shapes, outlier agents, and elaborate obstacles. We conduct algorithm comparisons, performance testing, simulation, and real-world testing, verifying the effectiveness and applicability of our algorithm. The implementation of our method will be open-sourced at https://github.com/WYCUniverStar/SCMP.
Abstract:Graph foundation models (GFMs) have recently gained significant attention. However, the unique data processing and evaluation setups employed by different studies hinder a deeper understanding of their progress. Additionally, current research tends to focus on specific subsets of graph learning tasks, such as structural tasks, node-level tasks, or classification tasks. As a result, they often incorporate specialized modules tailored to particular task types, losing their applicability to other graph learning tasks and contradicting the original intent of foundation models to be universal. Therefore, to enhance consistency, coverage, and diversity across domains, tasks, and research interests within the graph learning community in the evaluation of GFMs, we propose GFMBench-a systematic and comprehensive benchmark comprising 26 datasets. Moreover, we introduce LangGFM, a novel GFM that relies entirely on large language models. By revisiting and exploring the effective graph textualization principles, as well as repurposing successful techniques from graph augmentation and graph self-supervised learning within the language space, LangGFM achieves performance on par with or exceeding the state of the art across GFMBench, which can offer us new perspectives, experiences, and baselines to drive forward the evolution of GFMs.
Abstract:Large Language Models (LLMs) could struggle to fully understand legal theories and perform complex legal reasoning tasks. In this study, we introduce a challenging task (confusing charge prediction) to better evaluate LLMs' understanding of legal theories and reasoning capabilities. We also propose a novel framework: Multi-Agent framework for improving complex Legal Reasoning capability (MALR). MALR employs non-parametric learning, encouraging LLMs to automatically decompose complex legal tasks and mimic human learning process to extract insights from legal rules, helping LLMs better understand legal theories and enhance their legal reasoning abilities. Extensive experiments on multiple real-world datasets demonstrate that the proposed framework effectively addresses complex reasoning issues in practical scenarios, paving the way for more reliable applications in the legal domain.
Abstract:LiDAR sensors are critical for autonomous driving and robotics applications due to their ability to provide accurate range measurements and their robustness to lighting conditions. However, airborne particles, such as fog, rain, snow, and dust, will degrade its performance and it is inevitable to encounter these inclement environmental conditions outdoors. It would be a straightforward approach to remove them by supervised semantic segmentation. But annotating these particles point wisely is too laborious. To address this problem and enhance the perception under inclement conditions, we develop two dynamic filtering methods called Dynamic Multi-threshold Noise Removal (DMNR) and DMNR-H by accurate analysis of the position distribution and intensity characteristics of noisy points and clean points on publicly available WADS and DENSE datasets. Both DMNR and DMNR-H outperform state-of-the-art unsupervised methods by a significant margin on the two datasets and are slightly better than supervised deep learning-based methods. Furthermore, our methods are more robust to different LiDAR sensors and airborne particles, such as snow and fog.
Abstract:Deep normal estimators have made great strides on synthetic benchmarks. Unfortunately, their performance dramatically drops on the real scan data since they are supervised only on synthetic datasets. The point-wise annotation of ground truth normals is vulnerable to inefficiency and inaccuracies, which totally makes it impossible to build perfect real datasets for supervised deep learning. To overcome the challenge, we propose a multi-sample consensus paradigm for unsupervised normal estimation. The paradigm consists of multi-candidate sampling, candidate rejection, and mode determination. The latter two are driven by neighbor point consensus and candidate consensus respectively. Two primary implementations of the paradigm, MSUNE and MSUNE-Net, are proposed. MSUNE minimizes a candidate consensus loss in mode determination. As a robust optimization method, it outperforms the cutting-edge supervised deep learning methods on real data at the cost of longer runtime for sampling enough candidate normals for each query point. MSUNE-Net, the first unsupervised deep normal estimator as far as we know, significantly promotes the multi-sample consensus further. It transfers the three online stages of MSUNE to offline training. Thereby its inference time is 100 times faster. Besides that, more accurate inference is achieved, since the candidates of query points from similar patches can form a sufficiently large candidate set implicitly in MSUNE-Net. Comprehensive experiments demonstrate that the two proposed unsupervised methods are noticeably superior to some supervised deep normal estimators on the most common synthetic dataset. More importantly, they show better generalization ability and outperform all the SOTA conventional and deep methods on three real datasets: NYUV2, KITTI, and a dataset from PCV [1].
Abstract:Most of the existing audio-driven 3D facial animation methods suffered from the lack of detailed facial expression and head pose, resulting in unsatisfactory experience of human-robot interaction. In this paper, a novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. To synthesize real and detailed expression, a hierarchical decomposition strategy is proposed to encode the audio signal into both a global latent feature and a local vertex-wise control feature. Then the local and global audio features combined with vertex spatial features are used to predict the final consistent facial animation via a graph convolutional neural network by fusing the intrinsic spatial topology structure of the face model and the corresponding semantic feature of the audio. To accomplish pose-controllable animation, we introduce a novel pose attribute augmentation method by utilizing the 2D talking face technique. Experimental results indicate that the proposed method can produce more realistic facial expressions and head posture movements. Qualitative and quantitative experiments show that the proposed method achieves competitive performance against state-of-the-art methods.
Abstract:Previous works on key information extraction from visually rich documents (VRDs) mainly focus on labeling the text within each bounding box (i.e., semantic entity), while the relations in-between are largely unexplored. In this paper, we adapt the popular dependency parsing model, the biaffine parser, to this entity relation extraction task. Being different from the original dependency parsing model which recognizes dependency relations between words, we identify relations between groups of words with layout information instead. We have compared different representations of the semantic entity, different VRD encoders, and different relation decoders. The results demonstrate that our proposed model achieves 65.96% F1 score on the FUNSD dataset. As for the real-world application, our model has been applied to the in-house customs data, achieving reliable performance in the production setting.
Abstract:It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.
Abstract:Multi-agent path finding in formation has many potential real-world applications like mobile warehouse robots. However, previous multi-agent path finding (MAPF) methods hardly take formation into consideration. Furthermore, they are usually centralized planners and require the whole state of the environment. Other decentralized partially observable approaches to MAPF are reinforcement learning (RL) methods. However, these RL methods encounter difficulties when learning path finding and formation problem at the same time. In this paper, we propose a novel decentralized partially observable RL algorithm that uses a hierarchical structure to decompose the multi objective task into unrelated ones. It also calculates a theoretical weight that makes every task reward has equal influence on the final RL value function. Additionally, we introduce a communication method that helps agents cooperate with each other. Experiments in simulation show that our method outperforms other end-to-end RL methods and our method can naturally scale to large world sizes where centralized planner struggles. We also deploy and validate our method in a real world scenario.
Abstract:Multi-person pose estimation is one of the mainstream tasks of computer vision. Existing methods include the top-down methods which need additional human detector and the bottom-up methods which need to complete heuristic grouping after predicting all human keypoints. They all need to deal with the grouping and detection of keypoints separately, resulting in low efficiency. In this work, we propose an end-to-end network framework for multi-person pose regression to predict the instance-aware keypoints directly. This framework uses a cascaded manner: the first stage provides basic estimation. Then we propose the OKSFilter which is used to remove low-quality predictions, so that the second stage could focus on better results for further optimization. In addition, in order to quantify the quality of the predicted poses, we also propose the pose scoring module(PSM), so that when using non-maximum suppression(NMS) in the inference, the correct type and high-quality poses are preserved. We have verified on the COCO keypoint benchmark. The experiments show that our multi-person pose regression network is feasible and effective, and the two newly proposed modules are helpful to improve the performance of the model.