Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tz-Ying Wu

EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs

Jun 06, 2025

Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, Giovanni Maria Farinella

Abstract:We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: https://github.com/fpv-iplab/EASG-bench.

Via

Access Paper or Ask Questions

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Jul 28, 2024

Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos

Figure 1 for Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Figure 2 for Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Figure 3 for Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Figure 4 for Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Abstract:Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

Via

Access Paper or Ask Questions

Single-Stage Visual Relationship Learning using Conditional Queries

Jun 09, 2023

Alakh Desai, Tz-Ying Wu, Subarna Tripathi, Nuno Vasconcelos

Abstract:Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR, a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% fewer parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable of end-to-end training and faster inference.

* Accepted to NeurIPS 2022

Via

Access Paper or Ask Questions

ProTeCt: Prompt Tuning for Hierarchical Consistency

Jun 04, 2023

Tz-Ying Wu, Chih-Hui Ho, Nuno Vasconcelos

Abstract:Large visual-language models, like CLIP, learn generalized representations and have shown promising zero-shot performance. Few-shot adaptation methods, based on prompt tuning, have also been shown to further improve performance on downstream datasets. However, these models are not hierarchically consistent. Frequently, they infer incorrect labels at coarser taxonomic class levels, even when the inference at the leaf level (original class labels) is correct. This is problematic, given their support for open set classification and, in particular, open-grained classification, where practitioners define label sets at various levels of granularity. To address this problem, we propose a prompt tuning technique to calibrate the hierarchical consistency of model predictions. A set of metrics of hierarchical consistency, the Hierarchical Consistent Accuracy (HCA) and the Mean Treecut Accuracy (MTA), are first proposed to benchmark model performance in the open-granularity setting. A prompt tuning technique, denoted as Prompt Tuning for Hierarchical Consistency (ProTeCt), is then proposed to calibrate classification across all possible label set granularities. Results show that ProTeCt can be combined with existing prompt tuning methods to significantly improve open-granularity classification performance without degradation of the original classification performance at the leaf level.

Via

Access Paper or Ask Questions

Class-Incremental Learning with Strong Pre-trained Models

Apr 07, 2022

Tz-Ying Wu, Gurumurthy Swaminathan, Zhizhong Li, Avinash Ravichandran, Nuno Vasconcelos, Rahul Bhotika, Stefano Soatto

Figure 1 for Class-Incremental Learning with Strong Pre-trained Models

Figure 2 for Class-Incremental Learning with Strong Pre-trained Models

Figure 3 for Class-Incremental Learning with Strong Pre-trained Models

Figure 4 for Class-Incremental Learning with Strong Pre-trained Models

Abstract:Class-incremental learning (CIL) has been widely studied under the setting of starting from a small number of classes (base classes). Instead, we explore an understudied real-world setting of CIL that starts with a strong model pre-trained on a large number of base classes. We hypothesize that a strong base model can provide a good representation for novel classes and incremental learning can be done with small adaptations. We propose a 2-stage training scheme, i) feature augmentation -- cloning part of the backbone and fine-tuning it on the novel data, and ii) fusion -- combining the base and novel classifiers into a unified classifier. Experiments show that the proposed method significantly outperforms state-of-the-art CIL methods on the large-scale ImageNet dataset (e.g. +10% overall accuracy than the best). We also propose and analyze understudied practical CIL scenarios, such as base-novel overlap with distribution shift. Our proposed method is robust and generalizes to all analyzed CIL settings.

* Accepted at CVPR 2022, code to be released soon

Via

Access Paper or Ask Questions

Learning of Visual Relations: The Devil is in the Tails

Aug 22, 2021

Alakh Desai, Tz-Ying Wu, Subarna Tripathi, Nuno Vasconcelos

Figure 1 for Learning of Visual Relations: The Devil is in the Tails

Figure 2 for Learning of Visual Relations: The Devil is in the Tails

Figure 3 for Learning of Visual Relations: The Devil is in the Tails

Figure 4 for Learning of Visual Relations: The Devil is in the Tails

Abstract:Significant effort has been recently devoted to modeling visual relations. This has mostly addressed the design of architectures, typically by adding parameters and increasing model complexity. However, visual relation learning is a long-tailed problem, due to the combinatorial nature of joint reasoning about groups of objects. Increasing model complexity is, in general, ill-suited for long-tailed problems due to their tendency to overfit. In this paper, we explore an alternative hypothesis, denoted the Devil is in the Tails. Under this hypothesis, better performance is achieved by keeping the model simple but improving its ability to cope with long-tailed distributions. To test this hypothesis, we devise a new approach for training visual relationships models, which is inspired by state-of-the-art long-tailed recognition literature. This is based on an iterative decoupled training scheme, denoted Decoupled Training for Devil in the Tails (DT2). DT2 employs a novel sampling approach, Alternating Class-Balanced Sampling (ACBS), to capture the interplay between the long-tailed entity and predicate distributions of visual relations. Results show that, with an extremely simple architecture, DT2-ACBS significantly outperforms much more complex state-of-the-art methods on scene graph generation tasks. This suggests that the development of sophisticated models must be considered in tandem with the long-tailed nature of the problem.

* Accepted to ICCV 2021

Via

Access Paper or Ask Questions

Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Jul 20, 2020

Tz-Ying Wu, Pedro Morgado, Pei Wang, Chih-Hui Ho, Nuno Vasconcelos

Figure 1 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Figure 2 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Figure 3 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Figure 4 for Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier

Abstract:Long-tail recognition tackles the natural non-uniformly distributed data in real-world scenarios. While modern classifiers perform well on populated classes, its performance degrades significantly on tail classes. Humans, however, are less affected by this since, when confronted with uncertain examples, they simply opt to provide coarser predictions. Motivated by this, a deep realistic taxonomic classifier (Deep-RTC) is proposed as a new solution to the long-tail problem, combining realism with hierarchical predictions. The model has the option to reject classifying samples at different levels of the taxonomy, once it cannot guarantee the desired performance. Deep-RTC is implemented with a stochastic tree sampling during training to simulate all possible classification conditions at finer or coarser levels and a rejection mechanism at inference time. Experiments on the long-tailed version of four datasets, CIFAR100, AWA2, Imagenet, and iNaturalist, demonstrate that the proposed approach preserves more information on all classes with different popularity levels. Deep-RTC also outperforms the state-of-the-art methods in longtailed recognition, hierarchical classification, and learning with rejection literature using the proposed correctly predicted bits (CPB) metric.

* Accepted to ECCV 2020

Via

Access Paper or Ask Questions

Exploit Clues from Views: Self-Supervised and Regularized Learning for Multiview Object Recognition

Mar 28, 2020

Chih-Hui Ho, Bo Liu, Tz-Ying Wu, Nuno Vasconcelos

Figure 1 for Exploit Clues from Views: Self-Supervised and Regularized Learning for Multiview Object Recognition

Figure 2 for Exploit Clues from Views: Self-Supervised and Regularized Learning for Multiview Object Recognition

Figure 3 for Exploit Clues from Views: Self-Supervised and Regularized Learning for Multiview Object Recognition

Figure 4 for Exploit Clues from Views: Self-Supervised and Regularized Learning for Multiview Object Recognition

Abstract:Multiview recognition has been well studied in the literature and achieves decent performance in object recognition and retrieval task. However, most previous works rely on supervised learning and some impractical underlying assumptions, such as the availability of all views in training and inference time. In this work, the problem of multiview self-supervised learning (MV-SSL) is investigated, where only image to object association is given. Given this setup, a novel surrogate task for self-supervised learning is proposed by pursuing "object invariant" representation. This is solved by randomly selecting an image feature of an object as object prototype, accompanied with multiview consistency regularization, which results in view invariant stochastic prototype embedding (VISPE). Experiments shows that the recognition and retrieval results using VISPE outperform that of other self-supervised learning methods on seen and unseen data. VISPE can also be applied to semi-supervised scenario and demonstrates robust performance with limited data available. Code is available at https://github.com/chihhuiho/VISPE

* Accepted to CVPR2020

Via

Access Paper or Ask Questions

Explainable Object-induced Action Decision for Autonomous Vehicles

Mar 20, 2020

Yiran Xu, Xiaoyin Yang, Lihang Gong, Hsuan-Chu Lin, Tz-Ying Wu, Yunsheng Li, Nuno Vasconcelos

Figure 1 for Explainable Object-induced Action Decision for Autonomous Vehicles

Figure 2 for Explainable Object-induced Action Decision for Autonomous Vehicles

Figure 3 for Explainable Object-induced Action Decision for Autonomous Vehicles

Figure 4 for Explainable Object-induced Action Decision for Autonomous Vehicles

Abstract:A new paradigm is proposed for autonomous driving. The new paradigm lies between the end-to-end and pipelined approaches, and is inspired by how humans solve the problem. While it relies on scene understanding, the latter only considers objects that could originate hazard. These are denoted as action-inducing, since changes in their state should trigger vehicle actions. They also define a set of explanations for these actions, which should be produced jointly with the latter. An extension of the BDD100K dataset, annotated for a set of 4 actions and 21 explanations, is proposed. A new multi-task formulation of the problem, which optimizes the accuracy of both action commands and explanations, is then introduced. A CNN architecture is finally proposed to solve this problem, by combining reasoning about action inducing objects and global scene context. Experimental results show that the requirement of explanations improves the recognition of action-inducing objects, which in turn leads to better action predictions.

Via

Access Paper or Ask Questions

Liquid Pouring Monitoring via Rich Sensory Inputs

Aug 06, 2018

Tz-Ying Wu, Juan-Ting Lin, Tsun-Hsuang Wang, Chan-Wei Hu, Juan Carlos Niebles, Min Sun

Figure 1 for Liquid Pouring Monitoring via Rich Sensory Inputs

Figure 2 for Liquid Pouring Monitoring via Rich Sensory Inputs

Figure 3 for Liquid Pouring Monitoring via Rich Sensory Inputs

Figure 4 for Liquid Pouring Monitoring via Rich Sensory Inputs

Abstract:Humans have the amazing ability to perform very subtle manipulation task using a closed-loop control system with imprecise mechanics (i.e., our body parts) but rich sensory information (e.g., vision, tactile, etc.). In the closed-loop system, the ability to monitor the state of the task via rich sensory information is important but often less studied. In this work, we take liquid pouring as a concrete example and aim at learning to continuously monitor whether liquid pouring is successful (e.g., no spilling) or not via rich sensory inputs. We mimic humans' rich sensories using synchronized observation from a chest-mounted camera and a wrist-mounted IMU sensor. Given many success and failure demonstrations of liquid pouring, we train a hierarchical LSTM with late fusion for monitoring. To improve the robustness of the system, we propose two auxiliary tasks during training: inferring (1) the initial state of containers and (2) forecasting the one-step future 3D trajectory of the hand with an adversarial training procedure. These tasks encourage our method to learn representation sensitive to container states and how objects are manipulated in 3D. With these novel components, our method achieves ~8% and ~11% better monitoring accuracy than the baseline method without auxiliary tasks on unseen containers and unseen users respectively.

Via

Access Paper or Ask Questions