Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Frederick W. B. Li

PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction

May 26, 2025

Kanglei Zhou, Hubert P. H. Shum, Frederick W. B. Li, Xingxing Zhang, Xiaohui Liang

Abstract:Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.

* Accepted by IEEE Transactions on Image Processing

Via

Access Paper or Ask Questions

Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

Feb 08, 2025

Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang

Abstract:We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.

Via

Access Paper or Ask Questions

ST-SACLF: Style Transfer Informed Self-Attention Classifier for Bias-Aware Painting Classification

Aug 03, 2024

Mridula Vijendran, Frederick W. B. Li, Jingjing Deng, Hubert P. H. Shum

Figure 1 for ST-SACLF: Style Transfer Informed Self-Attention Classifier for Bias-Aware Painting Classification

Figure 2 for ST-SACLF: Style Transfer Informed Self-Attention Classifier for Bias-Aware Painting Classification

Figure 3 for ST-SACLF: Style Transfer Informed Self-Attention Classifier for Bias-Aware Painting Classification

Figure 4 for ST-SACLF: Style Transfer Informed Self-Attention Classifier for Bias-Aware Painting Classification

Abstract:Painting classification plays a vital role in organizing, finding, and suggesting artwork for digital and classic art galleries. Existing methods struggle with adapting knowledge from the real world to artistic images during training, leading to poor performance when dealing with different datasets. Our innovation lies in addressing these challenges through a two-step process. First, we generate more data using Style Transfer with Adaptive Instance Normalization (AdaIN), bridging the gap between diverse styles. Then, our classifier gains a boost with feature-map adaptive spatial attention modules, improving its understanding of artistic details. Moreover, we tackle the problem of imbalanced class representation by dynamically adjusting augmented samples. Through a dual-stage process involving careful hyperparameter search and model fine-tuning, we achieve an impressive 87.24\% accuracy using the ResNet-50 backbone over 40 training epochs. Our study explores quantitative analyses that compare different pretrained backbones, investigates model optimization through ablation studies, and examines how varying augmentation levels affect model performance. Complementing this, our qualitative experiments offer valuable insights into the model's decision-making process using spatial attention and its ability to differentiate between easy and challenging samples based on confidence ranking.

Via

Access Paper or Ask Questions

From Category to Scenery: An End-to-End Framework for Multi-Person Human-Object Interaction Recognition in Videos

Jul 01, 2024

Tanqiu Qiao, Ruochen Li, Frederick W. B. Li, Hubert P. H. Shum

Abstract:Video-based Human-Object Interaction (HOI) recognition explores the intricate dynamics between humans and objects, which are essential for a comprehensive understanding of human behavior and intentions. While previous work has made significant strides, effectively integrating geometric and visual features to model dynamic relationships between humans and objects in a graph framework remains a challenge. In this work, we propose a novel end-to-end category to scenery framework, CATS, starting by generating geometric features for various categories through graphs respectively, then fusing them with corresponding visual features. Subsequently, we construct a scenery interactive graph with these enhanced geometric-visual features as nodes to learn the relationships among human and object categories. This methodological advance facilitates a deeper, more structured comprehension of interactions, bridging category-specific insights with broad scenery dynamics. Our method demonstrates state-of-the-art performance on two pivotal HOI benchmarks, including the MPHOI-72 dataset for multi-person HOIs and the single-person HOI CAD-120 dataset.

* Accepted by ICPR 2024

Via

Access Paper or Ask Questions

MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment

Mar 07, 2024

Kanglei Zhou, Liyuan Wang, Xingxing Zhang, Hubert P. H. Shum, Frederick W. B. Li, Jianguo Li, Xiaohui Liang

Abstract:Action Quality Assessment (AQA) evaluates diverse skills but models struggle with non-stationary data. We propose Continual AQA (CAQA) to refine models using sparse new data. Feature replay preserves memory without storing raw inputs. However, the misalignment between static old features and the dynamically changing feature manifold causes severe catastrophic forgetting. To address this novel problem, we propose Manifold-Aligned Graph Regularization (MAGR), which first aligns deviated old features to the current feature manifold, ensuring representation consistency. It then constructs a graph jointly arranging old and new features aligned with quality scores. Experiments show MAGR outperforms recent strong baselines with up to 6.56%, 5.66%, 15.64%, and 9.05% correlation gains on the MTL-AQA, FineDiving, UNLV-Dive, and JDM-MSA split datasets, respectively. This validates MAGR for continual assessment challenges arising from non-stationary skill variations.

Via

Access Paper or Ask Questions

Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model

Sep 12, 2023

Yin Wang, Zhiying Leng, Frederick W. B. Li, Shun-Cheng Wu, Xiaohui Liang

Abstract:Text-driven human motion generation in computer vision is both significant and challenging. However, current methods are limited to producing either deterministic or imprecise motion sequences, failing to effectively control the temporal and spatial relationships required to conform to a given text description. In this work, we propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference. Experiments show that our approach outperforms text-driven motion generation methods on HumanML3D and KIT test sets and generates better visually confirmed motion to the text conditions.

Via

Access Paper or Ask Questions

Tackling Data Bias in Painting Classification with Style Transfer

Jan 06, 2023

Mridula Vijendran, Frederick W. B. Li, Hubert P. H. Shum

Abstract:It is difficult to train classifiers on paintings collections due to model bias from domain gaps and data bias from the uneven distribution of artistic styles. Previous techniques like data distillation, traditional data augmentation and style transfer improve classifier training using task specific training datasets or domain adaptation. We propose a system to handle data bias in small paintings datasets like the Kaokore dataset while simultaneously accounting for domain adaptation in fine-tuning a model trained on real world images. Our system consists of two stages which are style transfer and classification. In the style transfer stage, we generate the stylized training samples per class with uniformly sampled content and style images and train the style transformation network per domain. In the classification stage, we can interpret the effectiveness of the style and content layers at the attention layers when training on the original training dataset and the stylized images. We can tradeoff the model performance and convergence by dynamically varying the proportion of augmented samples in the majority and minority classes. We achieve comparable results to the SOTA with fewer training epochs and a classifier with fewer training parameters.

* International Conference on Computer Vision Theory and Applications (VISAPP), 2023 ,12 pages, 9 figures

Via

Access Paper or Ask Questions

Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Jul 19, 2022

Tanqiu Qiao, Qianhui Men, Frederick W. B. Li, Yoshiki Kubotani, Shigeo Morishima, Hubert P. H. Shum

Figure 1 for Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Figure 2 for Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Figure 3 for Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Figure 4 for Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Abstract:Human-Object Interaction (HOI) recognition in videos is important for analyzing human activity. Most existing work focusing on visual features usually suffer from occlusion in the real-world scenarios. Such a problem will be further complicated when multiple people and objects are involved in HOIs. Consider that geometric features such as human pose and object position provide meaningful information to understand HOIs, we argue to combine the benefits of both visual and geometric features in HOI recognition, and propose a novel Two-level Geometric feature-informed Graph Convolutional Network (2G-GCN). The geometric-level graph models the interdependency between geometric features of humans and objects, while the fusion-level graph further fuses them with visual features of humans and objects. To demonstrate the novelty and effectiveness of our method in challenging scenarios, we propose a new multi-person HOI dataset (MPHOI-72). Extensive experiments on MPHOI-72 (multi-person HOI), CAD-120 (single-human HOI) and Bimanual Actions (two-hand HOI) datasets demonstrate our superior performance compared to state-of-the-arts.

* Accepted by ECCV 2022

Via

Access Paper or Ask Questions

Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing

May 24, 2019

Yang Lu, Xiaohui Liang, Frederick W. B. Li

Figure 1 for Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing

Figure 2 for Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing

Figure 3 for Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing

Figure 4 for Multi-Scale Dual-Branch Fully Convolutional Network for Hand Parsing

Abstract:Recently, fully convolutional neural networks (FCNs) have shown significant performance in image parsing, including scene parsing and object parsing. Different from generic object parsing tasks, hand parsing is more challenging due to small size, complex structure, heavy self-occlusion and ambiguous texture problems. In this paper, we propose a novel parsing framework, Multi-Scale Dual-Branch Fully Convolutional Network (MSDB-FCN), for hand parsing tasks. Our network employs a Dual-Branch architecture to extract features of hand area, paying attention on the hand itself. These features are used to generate multi-scale features with pyramid pooling strategy. In order to better encode multi-scale features, we design a Deconvolution and Bilinear Interpolation Block (DB-Block) for upsampling and merging the features of different scales. To address data imbalance, which is a common problem in many computer vision tasks as well as hand parsing tasks, we propose a generalization of Focal Loss, namely Multi-Class Balanced Focal Loss, to tackle data imbalance in multi-class classification. Extensive experiments on RHD-PARSING dataset demonstrate that our MSDB-FCN has achieved the state-of-the-art performance for hand parsing.

Via

Access Paper or Ask Questions

DOOBNet: Deep Object Occlusion Boundary Detection from an Image

Sep 13, 2018

Guoxia Wang, Xiaohui Liang, Frederick W. B. Li

Figure 1 for DOOBNet: Deep Object Occlusion Boundary Detection from an Image

Figure 2 for DOOBNet: Deep Object Occlusion Boundary Detection from an Image

Figure 3 for DOOBNet: Deep Object Occlusion Boundary Detection from an Image

Figure 4 for DOOBNet: Deep Object Occlusion Boundary Detection from an Image

Abstract:Object occlusion boundary detection is a fundamental and crucial research problem in computer vision. This is challenging to solve as encountering the extreme boundary/non-boundary class imbalance during training an object occlusion boundary detector. In this paper, we propose to address this class imbalance by up-weighting the loss contribution of false negative and false positive examples with our novel Attention Loss function. We also propose a unified end-to-end multi-task deep object occlusion boundary detection network (DOOBNet) by sharing convolutional features to simultaneously predict object boundary and occlusion orientation. DOOBNet adopts an encoder-decoder structure with skip connection in order to automatically learn multi-scale and multi-level features. We significantly surpass the state-of-the-art on the PIOD dataset (ODS F-score of .702) and the BSDS ownership dataset (ODS F-score of .555), as well as improving the detecting speed to as 0.037s per image on the PIOD dataset.

Via

Access Paper or Ask Questions