Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hui Liang

Efficient Video Instance Segmentation via Tracklet Query and Proposal

Mar 03, 2022

Jialian Wu, Sudhir Yarram, Hui Liang, Tian Lan, Junsong Yuan, Jayan Eledath, Gerard Medioni

Figure 1 for Efficient Video Instance Segmentation via Tracklet Query and Proposal

Figure 2 for Efficient Video Instance Segmentation via Tracklet Query and Proposal

Figure 3 for Efficient Video Instance Segmentation via Tracklet Query and Proposal

Figure 4 for Efficient Video Instance Segmentation via Tracklet Query and Proposal

Abstract:Video Instance Segmentation (VIS) aims to simultaneously classify, segment, and track multiple object instances in videos. Recent clip-level VIS takes a short video clip as input each time showing stronger performance than frame-level VIS (tracking-by-segmentation), as more temporal context from multiple frames is utilized. Yet, most clip-level methods are neither end-to-end learnable nor real-time. These limitations are addressed by the recent VIS transformer (VisTR) which performs VIS end-to-end within a clip. However, VisTR suffers from long training time due to its frame-wise dense attention. In addition, VisTR is not fully end-to-end learnable in multiple video clips as it requires a hand-crafted data association to link instance tracklets between successive clips. This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference. At the core are tracklet query and tracklet proposal that associate and segment regions-of-interest (RoIs) across space and time by an iterative query-video interaction. We further propose a correspondence learning that makes tracklets linking between clips end-to-end learnable. Compared to VisTR, EfficientVIS requires 15x fewer training epochs while achieving state-of-the-art accuracy on the YouTube-VIS benchmark. Meanwhile, our method enables whole video instance segmentation in a single end-to-end pass without data association at all.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions

Robust 3D Hand Pose Estimation in Single Depth Images: from Single-View CNN to Multi-View CNNs

Dec 27, 2016

Liuhao Ge, Hui Liang, Junsong Yuan, Daniel Thalmann

Figure 1 for Robust 3D Hand Pose Estimation in Single Depth Images: from Single-View CNN to Multi-View CNNs

Figure 2 for Robust 3D Hand Pose Estimation in Single Depth Images: from Single-View CNN to Multi-View CNNs

Figure 3 for Robust 3D Hand Pose Estimation in Single Depth Images: from Single-View CNN to Multi-View CNNs

Figure 4 for Robust 3D Hand Pose Estimation in Single Depth Images: from Single-View CNN to Multi-View CNNs

Abstract:Articulated hand pose estimation plays an important role in human-computer interaction. Despite the recent progress, the accuracy of existing methods is still not satisfactory, partially due to the difficulty of embedded high-dimensional and non-linear regression problem. Different from the existing discriminative methods that regress for the hand pose with a single depth image, we propose to first project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane. These multi-view heat-maps are then fused to produce final 3D hand pose estimation with learned pose priors. Experiments show that the proposed method largely outperforms state-of-the-art on a challenging dataset. Moreover, a cross-dataset experiment also demonstrates the good generalization ability of the proposed method.

* 9 pages, 9 figures, published at Computer Vision and Pattern Recognition (CVPR) 2016

Via

Access Paper or Ask Questions