Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonis Argyros

Institute of Computer Science, FORTH, Computer Science Department, University of Crete

Reflections on Diversity: A Real-time Virtual Mirror for Inclusive 3D Face Transformations

Mar 25, 2025

Paraskevi Valergaki, Antonis Argyros, Giorgos Giannakakis, Anastasios Roussos

Abstract:Real-time 3D face manipulation has significant applications in virtual reality, social media and human-computer interaction. This paper introduces a novel system, which we call Mirror of Diversity (MOD), that combines Generative Adversarial Networks (GANs) for texture manipulation and 3D Morphable Models (3DMMs) for facial geometry to achieve realistic face transformations that reflect various demographic characteristics, emphasizing the beauty of diversity and the universality of human features. As participants sit in front of a computer monitor with a camera positioned above, their facial characteristics are captured in real time and can further alter their digital face reconstruction with transformations reflecting different demographic characteristics, such as gender and ethnicity (e.g., a person from Africa, Asia, Europe). Another feature of our system, which we call Collective Face, generates an averaged face representation from multiple participants' facial data. A comprehensive evaluation protocol is implemented to assess the realism and demographic accuracy of the transformations. Qualitative feedback is gathered through participant questionnaires, which include comparisons of MOD transformations with similar filters on platforms like Snapchat and TikTok. Additionally, quantitative analysis is conducted using a pretrained Convolutional Neural Network that predicts gender and ethnicity, to validate the accuracy of demographic transformations.

Via

Access Paper or Ask Questions

Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Oct 28, 2024

Manuel Benavent-Lledo, David Mulero-Pérez, David Ortiz-Perez, Jose Garcia-Rodriguez, Antonis Argyros

Figure 1 for Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Figure 2 for Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Figure 3 for Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Figure 4 for Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context

Abstract:The sequential execution of actions and their hierarchical structure consisting of different levels of abstraction, provide features that remain unexplored in the task of action recognition. In this study, we present a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and prior actions to reflect the sequential context. To achieve this goal, we introduce a novel transformer architecture tailored for action recognition that utilizes both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse and fine-grained action recognition, thereby exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset to introduce action hierarchies, introducing the Hierarchical TSU dataset. We also conduct an ablation study to assess the impact of different methods for integrating contextual and hierarchical data on action recognition performance. Results show that the proposed approach outperforms pre-trained SOTA methods when trained with the same hyperparameters. Moreover, they also show a 17.12% improvement in top-1 accuracy over the equivalent fine-grained RGB version when using ground-truth contextual information, and a 5.33% improvement when contextual information is obtained from actual predictions.

Via

Access Paper or Ask Questions

D-PoSE: Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation

Oct 07, 2024

Nikolaos Vasilikopoulos, Drosakis Drosakis, Antonis Argyros

Abstract:We present D-PoSE (Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation), a one-stage method that estimates human pose and SMPL-X shape parameters from a single RGB image. Recent works use larger models with transformer backbones and decoders to improve the accuracy in human pose and shape (HPS) benchmarks. D-PoSE proposes a vision based approach that uses the estimated human depth-maps as an intermediate representation for HPS and leverages training with synthetic data and the ground-truth depth-maps provided with them for depth supervision during training. Although trained on synthetic datasets, D-PoSE achieves state-of-the-art performance on the real-world benchmark datasets, EMDB and 3DPW. Despite its simple lightweight design and the CNN backbone, it outperforms ViT-based models that have a number of parameters that is larger by almost an order of magnitude. D-PoSE code is available at: https://github.com/nvasilik/D-PoSE

Via

Access Paper or Ask Questions

ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers

Sep 11, 2024

Giorgos Savathrakis, Antonis Argyros

Abstract:Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at https://github.com/GSavathrakis/ENACT

Via

Access Paper or Ask Questions

Anticipating Object State Changes

May 21, 2024

Victoria Manousaki, Konstantinos Bacharidis, Filippos Gouidis, Konstantinos Papoutsakis, Dimitris Plexousakis, Antonis Argyros

Abstract:Anticipating object state changes in images and videos is a challenging problem whose solution has important implications in vision-based scene understanding, automated monitoring systems, and action planning. In this work, we propose the first method for solving this problem. The proposed method predicts object state changes that will occur in the near future as a result of yet unseen human actions. To address this new problem, we propose a novel framework that integrates learnt visual features that represent the recent visual information, with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction scenarios, we introduce new curated annotation data for the object state change anticipation task (OSCA), noted as Ego4D-OSCA. An extensive experimental evaluation was conducted that demonstrates the efficacy of the proposed method in predicting object state changes in dynamic scenarios. The proposed work underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems. Moreover, it lays the groundwork for future research on the new task of object state change anticipation. The source code and the new annotation data (Ego4D-OSCA) will be made publicly available.

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Mar 25, 2024

Filippos Gouidis, Katerina Papantoniou, Konstantinos Papoutsakis Theodore Patkos, Antonis Argyros, Dimitris Plexousakis

Figure 1 for Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Figure 2 for Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Figure 3 for Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Figure 4 for Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Abstract:Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.

* Accepted at the AAAI-MAKE 24

Via

Access Paper or Ask Questions

Leveraging Knowledge Graphs for Zero-Shot Object-agnostic State Classification

Jul 22, 2023

Filipos Gouidis, Theodore Patkos, Antonis Argyros, Dimitris Plexousakis

Abstract:We investigate the problem of Object State Classification (OSC) as a zero-shot learning problem. Specifically, we propose the first Object-agnostic State Classification (OaSC) method that infers the state of a certain object without relying on the knowledge or the estimation of the object class. In that direction, we capitalize on Knowledge Graphs (KGs) for structuring and organizing knowledge, which, in combination with visual information, enable the inference of the states of objects in object/state pairs that have not been encountered in the method's training set. A series of experiments investigate the performance of the proposed method in various settings, against several hypotheses and in comparison with state of the art approaches for object attribute classification. The experimental results demonstrate that the knowledge of an object class is not decisive for the prediction of its state. Moreover, the proposed OaSC method outperforms existing methods in all datasets and benchmarks by a great margin.

Via

Access Paper or Ask Questions

TAPE: Temporal Attention-based Probabilistic human pose and shape Estimation

Apr 29, 2023

Nikolaos Vasilikopoulos, Nikos Kolotouros, Aggeliki Tsoli, Antonis Argyros

Abstract:Reconstructing 3D human pose and shape from monocular videos is a well-studied but challenging problem. Common challenges include occlusions, the inherent ambiguities in the 2D to 3D mapping and the computational complexity of video processing. Existing methods ignore the ambiguities of the reconstruction and provide a single deterministic estimate for the 3D pose. In order to address these issues, we present a Temporal Attention based Probabilistic human pose and shape Estimation method (TAPE) that operates on an RGB video. More specifically, we propose to use a neural network to encode video frames to temporal features using an attention-based neural network. Given these features, we output a per-frame but temporally-informed probability distribution for the human pose using Normalizing Flows. We show that TAPE outperforms state-of-the-art methods in standard benchmarks and serves as an effective video-based prior for optimization-based human pose and shape estimation. Code is available at: https: //github.com/nikosvasilik/TAPE

* Scandinavian Conference on Image Analysis (SCIA) 2023

Via

Access Paper or Ask Questions

Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations

Sep 12, 2022

Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros

Figure 1 for Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations

Figure 2 for Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations

Figure 3 for Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations

Figure 4 for Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity Representations

Abstract:We present a novel approach for the visual prediction of human-object interactions in videos. Rather than forecasting the human and object motion or the future hand-object contact points, we aim at predicting (a)the class of the on-going human-object interaction and (b) the class(es) of the next active object(s) (NAOs), i.e., the object(s) that will be involved in the interaction in the near future as well as the time the interaction will occur. Graph matching relies on the efficient Graph Edit distance (GED) method. The experimental evaluation of the proposed approach was conducted using two well-established video datasets that contain human-object interactions, namely the MSR Daily Activities and the CAD120. High prediction accuracy was obtained for both action prediction and NAO forecasting.

* 13 pages, Conference: In Advances in Visual Computing (ISVC 2022), Springer, San Diego, USA, October 2022

Via

Access Paper or Ask Questions

Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental Study

Dec 15, 2021

Filippos Gouidis, Theodoris Patkos, Antonis Argyros, Dimitris Plexousakis

Figure 1 for Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental Study

Figure 2 for Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental Study

Figure 3 for Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental Study

Figure 4 for Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental Study

Abstract:The detection of object states in images (State Detection - SD) is a problem of both theoretical and practical importance and it is tightly interwoven with other important computer vision problems, such as action recognition and affordance detection. It is also highly relevant to any entity that needs to reason and act in dynamic domains, such as robotic systems and intelligent agents. Despite its importance, up to now, the research on this problem has been limited. In this paper, we attempt a systematic study of the SD problem. First, we introduce the Object State Detection Dataset (OSDD), a new publicly available dataset consisting of more than 19,000 annotations for 18 object categories and 9 state classes. Second, using a standard deep learning framework used for Object Detection (OD), we conduct a number of appropriately designed experiments, towards an in-depth study of the behavior of the SD problem. This study enables the setup of a baseline on the performance of SD, as well as its relative performance in comparison to OD, in a variety of scenarios. Overall, the experimental outcomes confirm that SD is harder than OD and that tailored SD methods need to be developed for addressing effectively this significant problem.

Via

Access Paper or Ask Questions