Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cheng-Yen Yang

PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Jan 27, 2025

Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang

Figure 1 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Figure 2 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Figure 3 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Figure 4 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Abstract:Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.

Via

Access Paper or Ask Questions

BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Bird's-Eye View

Dec 01, 2024

Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng-Yen Yang, Sameer Satish Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, Laura Leal-Taixé

Figure 1 for BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Bird's-Eye View

Figure 2 for BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Bird's-Eye View

Figure 3 for BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Bird's-Eye View

Figure 4 for BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Bird's-Eye View

Abstract:Object perception from multi-view cameras is crucial for intelligent systems, particularly in indoor environments, e.g., warehouses, retail stores, and hospitals. Most traditional multi-target multi-camera (MTMC) detection and tracking methods rely on 2D object detection, single-view multi-object tracking (MOT), and cross-view re-identification (ReID) techniques, without properly handling important 3D information by multi-view image aggregation. In this paper, we propose a 3D object detection and tracking framework, named BEV-SUSHI, which first aggregates multi-view images with necessary camera calibration parameters to obtain 3D object detections in bird's-eye view (BEV). Then, we introduce hierarchical graph neural networks (GNNs) to track these 3D detections in BEV for MTMC tracking results. Unlike existing methods, BEV-SUSHI has impressive generalizability across different scenes and diverse camera settings, with exceptional capability for long-term association handling. As a result, our proposed BEV-SUSHI establishes the new state-of-the-art on the AICity'24 dataset with 81.22 HOTA, and 95.6 IDF1 on the WildTrack dataset.

Via

Access Paper or Ask Questions

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Nov 18, 2024

Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

Abstract:The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{\text{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments. Code and results are available at https://github.com/yangchris11/samurai.

Via

Access Paper or Ask Questions

GTA: Global Tracklet Association for Multi-Object Tracking in Sports

Nov 12, 2024

Jiacheng Sun, Hsiang-Wei Huang, Cheng-Yen Yang, Zhongyu Jiang, Jenq-Neng Hwang

Abstract:Multi-object tracking in sports scenarios has become one of the focal points in computer vision, experiencing significant advancements through the integration of deep learning techniques. Despite these breakthroughs, challenges remain, such as accurately re-identifying players upon re-entry into the scene and minimizing ID switches. In this paper, we propose an appearance-based global tracklet association algorithm designed to enhance tracking performance by splitting tracklets containing multiple identities and connecting tracklets seemingly from the same identity. This method can serve as a plug-and-play refinement tool for any multi-object tracker to further boost their performance. The proposed method achieved a new state-of-the-art performance on the SportsMOT dataset with HOTA score of 81.04%. Similarly, on the SoccerNet dataset, our method enhanced multiple trackers' performance, consistently increasing the HOTA score from 79.41% to 83.11%. These significant and consistent improvements across different trackers and datasets underscore our proposed method's potential impact on the application of sports player tracking. We open-source our project codebase at https://github.com/sjc042/gta-link.git.

* Accepted by ACCV 2024 MLCSA Workshop

Via

Access Paper or Ask Questions

ToddlerAct: A Toddler Action Recognition Dataset for Gross Motor Development Assessment

Aug 31, 2024

Hsiang-Wei Huang, Jiacheng Sun, Cheng-Yen Yang, Zhongyu Jiang, Li-Yu Huang, Jenq-Neng Hwang, Yu-Ching Yeh

Abstract:Assessing gross motor development in toddlers is crucial for understanding their physical development and identifying potential developmental delays or disorders. However, existing datasets for action recognition primarily focus on adults, lacking the diversity and specificity required for accurate assessment in toddlers. In this paper, we present ToddlerAct, a toddler gross motor action recognition dataset, aiming to facilitate research in early childhood development. The dataset consists of video recordings capturing a variety of gross motor activities commonly observed in toddlers aged under three years old. We describe the data collection process, annotation methodology, and dataset characteristics. Furthermore, we benchmarked multiple state-of-the-art methods including image-based and skeleton-based action recognition methods on our datasets. Our findings highlight the importance of domain-specific datasets for accurate assessment of gross motor development in toddlers and lay the foundation for future research in this critical area. Our dataset will be available at https://github.com/ipl-uw/ToddlerAct.

* Accepted by 2024 ECCV ABAW Workshop

Via

Access Paper or Ask Questions

Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

Jul 18, 2024

Sheng-Yao Kuan, Jen-Hao Cheng, Hsiang-Wei Huang, Wenhao Chai, Cheng-Yen Yang, Hugo Latapie, Gaowen Liu, Bing-Fei Wu, Jenq-Neng Hwang

Figure 1 for Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

Figure 2 for Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

Figure 3 for Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

Figure 4 for Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

Abstract:In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality detectors through sensor fusion remains an active challenge. This work leverages the respective advantages of cameras in perspective view and radars in Bird's Eye View (BEV) to greatly enhance overall detection and tracking performance. Our approach, Camera-Radar Associated Fusion Tracking Booster (CRAFTBooster), represents a pioneering effort to enhance radar-camera fusion in the tracking stage, contributing to improved 3D MOT accuracy. The superior experimental results on the K-Radaar dataset, which exhibit 5-6% on IDF1 tracking performance gain, validate the potential of effective sensor fusion in advancing autonomous driving.

* 2024 IEEE Intelligent Vehicles Symposium (IV)

Via

Access Paper or Ask Questions

Exploring Learning-based Motion Models in Multi-Object Tracking

Mar 16, 2024

Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

Figure 1 for Exploring Learning-based Motion Models in Multi-Object Tracking

Figure 2 for Exploring Learning-based Motion Models in Multi-Object Tracking

Figure 3 for Exploring Learning-based Motion Models in Multi-Object Tracking

Figure 4 for Exploring Learning-based Motion Models in Multi-Object Tracking

Abstract:In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman Filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman Filter with various learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman Filter-based systems. In this paper, we proposed MambaTrack, an online motion-based tracker that outperforms all existing motion-based trackers on the challenging DanceTrack and SportsMOT datasets. Moreover, we further exploit the potential of the state-space-model in trajectory feature extraction to boost the tracking performance and proposed MambaTrack+, which achieves the state-of-the-art performance on DanceTrack dataset with 56.1 HOTA and 54.9 IDF1.

Via

Access Paper or Ask Questions

Tree Counting by Bridging 3D Point Clouds with Imagery

Mar 12, 2024

Lei Li, Tianfang Zhang, Zhongyu Jiang, Cheng-Yen Yang, Jenq-Neng Hwang, Stefan Oehmcke, Dimitri Pierre Johannes Gominski, Fabian Gieseke, Christian Igel

Figure 1 for Tree Counting by Bridging 3D Point Clouds with Imagery

Figure 2 for Tree Counting by Bridging 3D Point Clouds with Imagery

Figure 3 for Tree Counting by Bridging 3D Point Clouds with Imagery

Figure 4 for Tree Counting by Bridging 3D Point Clouds with Imagery

Abstract:Accurate and consistent methods for counting trees based on remote sensing data are needed to support sustainable forest management, assess climate change mitigation strategies, and build trust in tree carbon credits. Two-dimensional remote sensing imagery primarily shows overstory canopy, and it does not facilitate easy differentiation of individual trees in areas with a dense canopy and does not allow for easy separation of trees when the canopy is dense. We leverage the fusion of three-dimensional LiDAR measurements and 2D imagery to facilitate the accurate counting of trees. We compare a deep learning approach to counting trees in forests using 3D airborne LiDAR data and 2D imagery. The approach is compared with state-of-the-art algorithms, like operating on 3D point cloud and 2D imagery. We empirically evaluate the different methods on the NeonTreeCount data set, which we use to define a tree-counting benchmark. The experiments show that FuseCountNet yields more accurate tree counts.

* need more experiments

Via

Access Paper or Ask Questions

A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video

Mar 06, 2024

Cheng-Yen Yang, Hsiang-Wei Huang, Zhongyu Jiang, Hao Wang, Farron Wallace, Jenq-Neng Hwang

Figure 1 for A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video

Figure 2 for A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video

Figure 3 for A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video

Figure 4 for A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video

Abstract:Dense object counting or crowd counting has come a long way thanks to the recent development in the vision community. However, indiscernible object counting, which aims to count the number of targets that are blended with respect to their surroundings, has been a challenge. Image-based object counting datasets have been the mainstream of the current publicly available datasets. Therefore, we propose a large-scale dataset called YoutubeFish-35, which contains a total of 35 sequences of high-definition videos with high frame-per-second and more than 150,000 annotated center points across a selected variety of scenes. For benchmarking purposes, we select three mainstream methods for dense object counting and carefully evaluate them on the newly collected dataset. We propose TransVidCount, a new strong baseline that combines density and regression branches along the temporal domain in a unified framework and can effectively tackle indiscernible object counting with state-of-the-art performance on YoutubeFish-35 dataset.

* Accepted by ICASSP 2024 (IEEE International Conference on Acoustics, Speech, and Signal Processing)

Via

Access Paper or Ask Questions

UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning

Nov 24, 2023

Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang

Abstract:In recent times, there has been a growing interest in developing effective perception techniques for combining information from multiple modalities. This involves aligning features obtained from diverse sources to enable more efficient training with larger datasets and constraints, as well as leveraging the wealth of information contained in each modality. 2D and 3D Human Pose Estimation (HPE) are two critical perceptual tasks in computer vision, which have numerous downstream applications, such as Action Recognition, Human-Computer Interaction, Object tracking, etc. Yet, there are limited instances where the correlation between Image and 2D/3D human pose has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPE, a unified Human Pose Estimation pipeline, which aligns features from all three modalities, i.e., 2D human pose estimation, lifting-based and image-based 3D human pose estimation, in the same pipeline. To align more than two modalities at the same time, we propose a novel singular value based contrastive learning loss, which better aligns different modalities and further boosts the performance. In our evaluation, UniHPE achieves remarkable performance metrics: MPJPE $50.5$mm on the Human3.6M dataset and PAMPJPE $51.6$mm on the 3DPW dataset. Our proposed method holds immense potential to advance the field of computer vision and contribute to various applications.

Via

Access Paper or Ask Questions