Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yixing Gao

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics

Jan 13, 2025

Tze Ho Elden Tse, Runyang Feng, Linfang Zheng, Jiho Park, Yixing Gao, Jihie Kim, Ales Leonardis, Hyung Jin Chang

Abstract:With the availability of egocentric 3D hand-object interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.

* Accepted to AAAI 2025

Via

Access Paper or Ask Questions

JointLoc: A Real-time Visual Localization Framework for Planetary UAVs Based on Joint Relative and Absolute Pose Estimation

May 13, 2024

Xubo Luo, Xue Wan, Yixing Gao, Yaolin Tian, Wei Zhang, Leizheng Shu

Abstract:Unmanned aerial vehicles (UAVs) visual localization in planetary aims to estimate the absolute pose of the UAV in the world coordinate system through satellite maps and images captured by on-board cameras. However, since planetary scenes often lack significant landmarks and there are modal differences between satellite maps and UAV images, the accuracy and real-time performance of UAV positioning will be reduced. In order to accurately determine the position of the UAV in a planetary scene in the absence of the global navigation satellite system (GNSS), this paper proposes JointLoc, which estimates the real-time UAV position in the world coordinate system by adaptively fusing the absolute 2-degree-of-freedom (2-DoF) pose and the relative 6-degree-of-freedom (6-DoF) pose. Extensive comparative experiments were conducted on a proposed planetary UAV image cross-modal localization dataset, which contains three types of typical Martian topography generated via a simulation engine as well as real Martian UAV images from the Ingenuity helicopter. JointLoc achieved a root-mean-square error of 0.237m in the trajectories of up to 1,000m, compared to 0.594m and 0.557m for ORB-SLAM2 and ORB-SLAM3 respectively. The source code will be available at https://github.com/LuoXubo/JointLoc.

* 8 pages

Via

Access Paper or Ask Questions

Revealing the Two Sides of Data Augmentation: An Asymmetric Distillation-based Win-Win Solution for Open-Set Recognition

Apr 28, 2024

Yunbing Jia, Xiaoyu Kong, Fan Tang, Yixing Gao, Weiming Dong, Yi Yang

Figure 1 for Revealing the Two Sides of Data Augmentation: An Asymmetric Distillation-based Win-Win Solution for Open-Set Recognition

Figure 2 for Revealing the Two Sides of Data Augmentation: An Asymmetric Distillation-based Win-Win Solution for Open-Set Recognition

Figure 3 for Revealing the Two Sides of Data Augmentation: An Asymmetric Distillation-based Win-Win Solution for Open-Set Recognition

Figure 4 for Revealing the Two Sides of Data Augmentation: An Asymmetric Distillation-based Win-Win Solution for Open-Set Recognition

Abstract:In this paper, we reveal the two sides of data augmentation: enhancements in closed-set recognition correlate with a significant decrease in open-set recognition. Through empirical investigation, we find that multi-sample-based augmentations would contribute to reducing feature discrimination, thereby diminishing the open-set criteria. Although knowledge distillation could impair the feature via imitation, the mixed feature with ambiguous semantics hinders the distillation. To this end, we propose an asymmetric distillation framework by feeding teacher model extra raw data to enlarge the benefit of teacher. Moreover, a joint mutual information loss and a selective relabel strategy are utilized to alleviate the influence of hard mixed samples. Our method successfully mitigates the decline in open-set and outperforms SOTAs by 2%~3% AUROC on the Tiny-ImageNet dataset and experiments on large-scale dataset ImageNet-21K demonstrate the generalization of our method.

Via

Access Paper or Ask Questions

DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Aug 05, 2023

Runyang Feng, Yixing Gao, Tze Ho Elden Tse, Xueqing Ma, Hyung Jin Chang

Figure 1 for DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Figure 2 for DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Figure 3 for DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Figure 4 for DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Abstract:Denoising diffusion probabilistic models that were initially proposed for realistic image generation have recently shown success in various perception tasks (e.g., object detection and image segmentation) and are increasingly gaining attention in computer vision. However, extending such models to multi-frame human pose estimation is non-trivial due to the presence of the additional temporal dimension in videos. More importantly, learning representations that focus on keypoint regions is crucial for accurate localization of human joints. Nevertheless, the adaptation of the diffusion-based methods remains unclear on how to achieve such objective. In this paper, we present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. First, to better leverage temporal information, we propose SpatioTemporal Representation Learner which aggregates visual evidences across frames and uses the resulting features in each denoising step as a condition. In addition, we present a mechanism called Lookup-based MultiScale Feature Interaction that determines the correlations between local joints and global contexts across multiple scales. This mechanism generates delicate representations that focus on keypoint regions. Altogether, by extending diffusion models, we show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model. DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21.

* This paper is accepted to ICCV 2023

Via

Access Paper or Ask Questions

Clothes Grasping and Unfolding Based on RGB-D Semantic Segmentation

May 08, 2023

Xingyu Zhu, Xin Wang, Jonathan Freer, Hyung Jin Chang, Yixing Gao

Abstract:Clothes grasping and unfolding is a core step in robotic-assisted dressing. Most existing works leverage depth images of clothes to train a deep learning-based model to recognize suitable grasping points. These methods often utilize physics engines to synthesize depth images to reduce the cost of real labeled data collection. However, the natural domain gap between synthetic and real images often leads to poor performance of these methods on real data. Furthermore, these approaches often struggle in scenarios where grasping points are occluded by the clothing item itself. To address the above challenges, we propose a novel Bi-directional Fractal Cross Fusion Network (BiFCNet) for semantic segmentation, enabling recognition of graspable regions in order to provide more possibilities for grasping. Instead of using depth images only, we also utilize RGB images with rich color features as input to our network in which the Fractal Cross Fusion (FCF) module fuses RGB and depth data by considering global complex features based on fractal geometry. To reduce the cost of real data collection, we further propose a data augmentation method based on an adversarial strategy, in which the color and geometric transformations simultaneously process RGB and depth data while maintaining the label correspondence. Finally, we present a pipeline for clothes grasping and unfolding from the perspective of semantic segmentation, through the addition of a strategy for grasp point selection from segmentation regions based on clothing flatness measures, while taking into account the grasping direction. We evaluate our BiFCNet on the public dataset NYUDv2 and obtained comparable performance to current state-of-the-art models. We also deploy our model on a Baxter robot, running extensive grasping and unfolding experiments as part of our ablation studies, achieving an 84% success rate.

* This paper is accepted to ICRA 2023

Via

Access Paper or Ask Questions

Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Mar 15, 2023

Runyang Feng, Yixing Gao, Xueqing Ma, Tze Ho Elden Tse, Hyung Jin Chang

Figure 1 for Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Figure 2 for Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Figure 3 for Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Figure 4 for Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Abstract:Temporal modeling is crucial for multi-frame human pose estimation. Most existing methods directly employ optical flow or deformable convolution to predict full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person or background. Without further efforts to excavate meaningful motion priors, their results are suboptimal, especially in complicated spatiotemporal interactions. On the other hand, the temporal difference has the ability to encode representative motion information which can potentially be valuable for pose estimation but has not been fully exploited. In this paper, we present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts and engages mutual information objectively to facilitate useful motion information disentanglement. To be specific, we design a multi-stage Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage feature difference sequences to derive informative motion representation. We further propose a Representation Disentanglement module from the mutual information perspective, which can grasp discriminative task-relevant motion signals by explicitly defining useful and noisy constituents of the raw motion features and minimizing their mutual information. These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21.

* This paper is accepted to CVPR 2023

Via

Access Paper or Ask Questions

Learning the Network of Graphs for Graph Neural Networks

Oct 08, 2022

Yixiang Shan, Jielong Yang, Xing Liu, Yixing Gao, Hechang Chen, Shuzhi Sam Ge

Figure 1 for Learning the Network of Graphs for Graph Neural Networks

Figure 2 for Learning the Network of Graphs for Graph Neural Networks

Figure 3 for Learning the Network of Graphs for Graph Neural Networks

Figure 4 for Learning the Network of Graphs for Graph Neural Networks

Abstract:Graph neural networks (GNNs) have achieved great success in many scenarios with graph-structured data. However, in many real applications, there are three issues when applying GNNs: graphs are unknown, nodes have noisy features, and graphs contain noisy connections. Aiming at solving these problems, we propose a new graph neural network named as GL-GNN. Our model includes multiple sub-modules, each sub-module selects important data features and learn the corresponding key relation graph of data samples when graphs are unknown. GL-GNN further obtains the network of graphs by learning the network of sub-modules. The learned graphs are further fused using an aggregation method over the network of graphs. Our model solves the first issue by simultaneously learning multiple relation graphs of data samples as well as a relation network of graphs, and solves the second and the third issue by selecting important data features as well as important data sample relations. We compare our method with 14 baseline methods on seven datasets when the graph is unknown and 11 baseline methods on two datasets when the graph is known. The results show that our method achieves better accuracies than the baseline methods and is capable of selecting important features and graph edges from the dataset. Our code will be publicly available at \url{https://github.com/Looomo/GL-GNN}.

Via

Access Paper or Ask Questions

Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation

Apr 03, 2022

Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao, Yunjun Gao, Xiang Wang

Figure 1 for Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation

Figure 2 for Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation

Figure 3 for Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation

Figure 4 for Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation

Abstract:Multi-frame human pose estimation has long been a compelling and fundamental problem in computer vision. This task is challenging due to fast motion and pose occlusion that frequently occur in videos. State-of-the-art methods strive to incorporate additional visual evidences from neighboring frames (supporting frames) to facilitate the pose estimation of the current frame (key frame). One aspect that has been obviated so far, is the fact that current methods directly aggregate unaligned contexts across frames. The spatial-misalignment between pose features of the current frame and neighboring frames might lead to unsatisfactory results. More importantly, existing approaches build upon the straightforward pose estimation loss, which unfortunately cannot constrain the network to fully leverage useful information from neighboring frames. To tackle these problems, we present a novel hierarchical alignment framework, which leverages coarse-to-fine deformations to progressively update a neighboring frame to align with the current frame at the feature level. We further propose to explicitly supervise the knowledge extraction from neighboring frames, guaranteeing that useful complementary cues are extracted. To achieve this goal, we theoretically analyzed the mutual information between the frames and arrived at a loss that maximizes the task-relevant mutual information. These allow us to rank No.1 in the Multi-frame Person Pose Estimation Challenge on benchmark dataset PoseTrack2017, and obtain state-of-the-art performance on benchmarks Sub-JHMDB and Pose-Track2018. Our code is released at https://github. com/Pose-Group/FAMI-Pose, hoping that it will be useful to the community.

* This paper is accepted to CVPR2022 (ORAL presentation)

Via

Access Paper or Ask Questions