Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haozhe Qi

LLaVAction: evaluating and training multi-modal large language models for action recognition

Mar 24, 2025

Shaokai Ye, Haozhe Qi, Alexander Mathis, Mackenzie W. Mathis

Abstract:Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.

* https://github.com/AdaptiveMotorControlLab/LLaVAction

Via

Access Paper or Ask Questions

HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Feb 26, 2024

Haozhe Qi, Chen Zhao, Mathieu Salzmann, Alexander Mathis

Figure 1 for HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Figure 2 for HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Figure 3 for HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Figure 4 for HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Abstract:Human hands are highly articulated and versatile at handling objects. Jointly estimating the 3D poses of a hand and the object it manipulates from a monocular camera is challenging due to frequent occlusions. Thus, existing methods often rely on intermediate 3D shape representations to increase performance. These representations are typically explicit, such as 3D point clouds or meshes, and thus provide information in the direct surroundings of the intermediate hand pose estimate. To address this, we introduce HOISDF, a Signed Distance Field (SDF) guided hand-object pose estimation network, which jointly exploits hand and object SDFs to provide a global, implicit representation over the complete reconstruction volume. Specifically, the role of the SDFs is threefold: equip the visual encoder with implicit shape information, help to encode hand-object interactions, and guide the hand and object pose regression via SDF-based sampling and by augmenting the feature representations. We show that HOISDF achieves state-of-the-art results on hand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is available at https://github.com/amathislab/HOISDF

* Accepted at CVPR 2024. 9 figures, many tables

Via

Access Paper or Ask Questions

P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

May 28, 2020

Haozhe Qi, Chen Feng, Zhiguo Cao, Feng Zhao, Yang Xiao

Figure 1 for P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

Figure 2 for P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

Figure 3 for P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

Figure 4 for P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

Abstract:Towards 3D object tracking in point clouds, a novel point-to-box network termed P2B is proposed in an end-to-end learning manner. Our main idea is to first localize potential target centers in 3D search area embedded with target information. Then point-driven 3D target proposal and verification are executed jointly. In this way, the time-consuming 3D exhaustive search can be avoided. Specifically, we first sample seeds from the point clouds in template and search area respectively. Then, we execute permutation-invariant feature augmentation to embed target clues from template into search area seeds and represent them with target-specific features. Consequently, the augmented search area seeds regress the potential target centers via Hough voting. The centers are further strengthened with seed-wise targetness scores. Finally, each center clusters its neighbors to leverage the ensemble power for joint 3D target proposal and verification. We apply PointNet++ as our backbone and experiments on KITTI tracking dataset demonstrate P2B's superiority (~10%'s improvement over state-of-the-art). Note that P2B can run with 40FPS on a single NVIDIA 1080Ti GPU. Our code and model are available at https://github.com/HaozheQi/P2B.

* Accepted by CVPR 2020 (oral)

Via

Access Paper or Ask Questions