Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vasco Lopes

Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding

Mar 26, 2025

Joao Pereira, Vasco Lopes, David Semedo, Joao Neves

Abstract:Large Vision-Language Models (LVLMs) demonstrate remarkable performance in short-video tasks such as video question answering, but struggle in long-video understanding. The linear frame sampling strategy, conventionally used by LVLMs, fails to account for the non-linear distribution of key events in video data, often introducing redundant or irrelevant information in longer contexts while risking the omission of critical events in shorter ones. To address this, we propose SelfReS, a non-linear spatiotemporal self-reflective sampling method that dynamically selects key video fragments based on user prompts. Unlike prior approaches, SelfReS leverages the inherently sparse attention maps of LVLMs to define reflection tokens, enabling relevance-aware token selection without requiring additional training or external modules. Experiments demonstrate that SelfReS can be seamlessly integrated into strong base LVLMs, improving long-video task accuracy and achieving up to 46% faster inference speed within the same GPU memory budget.

Via

Access Paper or Ask Questions

Zero-Shot Action Recognition in Surveillance Videos

Oct 28, 2024

Joao Pereira, Vasco Lopes, David Semedo, Joao Neves

Abstract:The growing demand for surveillance in public spaces presents significant challenges due to the shortage of human resources. Current AI-based video surveillance systems heavily rely on core computer vision models that require extensive finetuning, which is particularly difficult in surveillance settings due to limited datasets and difficult setting (viewpoint, low quality, etc.). In this work, we propose leveraging Large Vision-Language Models (LVLMs), known for their strong zero and few-shot generalization, to tackle video understanding tasks in surveillance. Specifically, we explore VideoLLaMA2, a state-of-the-art LVLM, and an improved token-level sampling method, Self-Reflective Sampling (Self-ReS). Our experiments on the UCF-Crime dataset show that VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline. Self-ReS additionally increases zero-shot action recognition performance to 44.6%. These results highlight the potential of LVLMs, paired with improved sampling techniques, for advancing surveillance video analysis in diverse scenarios.

Via

Access Paper or Ask Questions

Are Neural Architecture Search Benchmarks Well Designed? A Deeper Look Into Operation Importance

Mar 29, 2023

Vasco Lopes, Bruno Degardin, Luís A. Alexandre

Abstract:Neural Architecture Search (NAS) benchmarks significantly improved the capability of developing and comparing NAS methods while at the same time drastically reduced the computational overhead by providing meta-information about thousands of trained neural networks. However, tabular benchmarks have several drawbacks that can hinder fair comparisons and provide unreliable results. These usually focus on providing a small pool of operations in heavily constrained search spaces -- usually cell-based neural networks with pre-defined outer-skeletons. In this work, we conducted an empirical analysis of the widely used NAS-Bench-101, NAS-Bench-201 and TransNAS-Bench-101 benchmarks in terms of their generability and how different operations influence the performance of the generated architectures. We found that only a subset of the operation pool is required to generate architectures close to the upper-bound of the performance range. Also, the performance distribution is negatively skewed, having a higher density of architectures in the upper-bound range. We consistently found convolution layers to have the highest impact on the architecture's performance, and that specific combination of operations favors top-scoring architectures. These findings shed insights on the correct evaluation and comparison of NAS methods using NAS benchmarks, showing that directly searching on NAS-Bench-201, ImageNet16-120 and TransNAS-Bench-101 produces more reliable results than searching only on CIFAR-10. Furthermore, with this work we provide suggestions for future benchmark evaluations and design. The code used to conduct the evaluations is available at https://github.com/VascoLopes/NAS-Benchmark-Evaluation.

* 15 pages; 11 figues; 10 tables

Via

Access Paper or Ask Questions

Towards Less Constrained Macro-Neural Architecture Search

Mar 10, 2022

Vasco Lopes, Luís A. Alexandre

Figure 1 for Towards Less Constrained Macro-Neural Architecture Search

Figure 2 for Towards Less Constrained Macro-Neural Architecture Search

Figure 3 for Towards Less Constrained Macro-Neural Architecture Search

Figure 4 for Towards Less Constrained Macro-Neural Architecture Search

Abstract:Networks found with Neural Architecture Search (NAS) achieve state-of-the-art performance in a variety of tasks, out-performing human-designed networks. However, most NAS methods heavily rely on human-defined assumptions that constrain the search: architecture's outer-skeletons, number of layers, parameter heuristics and search spaces. Additionally, common search spaces consist of repeatable modules (cells) instead of fully exploring the architecture's search space by designing entire architectures (macro-search). Imposing such constraints requires deep human expertise and restricts the search to pre-defined settings. In this paper, we propose LCMNAS, a method that pushes NAS to less constrained search spaces by performing macro-search without relying on pre-defined heuristics or bounded search spaces. LCMNAS introduces three components for the NAS pipeline: i) a method that leverages information about well-known architectures to autonomously generate complex search spaces based on Weighted Directed Graphs with hidden properties, ii) a evolutionary search strategy that generates complete architectures from scratch, and iii) a mixed-performance estimation approach that combines information about architectures at initialization stage and lower fidelity estimates to infer their trainability and capacity to model complex functions. We present experiments showing that LCMNAS generates state-of-the-art architectures from scratch with minimal GPU computation. We study the importance of different NAS components on a macro-search setting. Code for reproducibility is public at \url{https://github.com/VascoLopes/LCMNAS}.

* 8 pages double-column, 6 tables, 3 figures

Via

Access Paper or Ask Questions

Guided Evolution for Neural Architecture Search

Oct 28, 2021

Vasco Lopes, Miguel Santos, Bruno Degardin, Luís A. Alexandre

Figure 1 for Guided Evolution for Neural Architecture Search

Figure 2 for Guided Evolution for Neural Architecture Search

Figure 3 for Guided Evolution for Neural Architecture Search

Abstract:Neural Architecture Search (NAS) methods have been successfully applied to image tasks with excellent results. However, NAS methods are often complex and tend to converge to local minima as soon as generated architectures seem to yield good results. In this paper, we propose G-EA, a novel approach for guided evolutionary NAS. The rationale behind G-EA, is to explore the search space by generating and evaluating several architectures in each generation at initialization stage using a zero-proxy estimator, where only the highest-scoring network is trained and kept for the next generation. This evaluation at initialization stage allows continuous extraction of knowledge from the search space without increasing computation, thus allowing the search to be efficiently guided. Moreover, G-EA forces exploitation of the most performant networks by descendant generation while at the same time forcing exploration by parent mutation and by favouring younger architectures to the detriment of older ones. Experimental results demonstrate the effectiveness of the proposed method, showing that G-EA achieves state-of-the-art results in NAS-Bench-201 search space in CIFAR-10, CIFAR-100 and ImageNet16-120, with mean accuracies of 93.98%, 72.12% and 45.94% respectively.

* Paper accepted at 35th Conference on Neural Information Processing Systems (NeurIPS) - New In ML. 9 pages, 2 figures, 1 table

Via

Access Paper or Ask Questions

Generative Adversarial Graph Convolutional Networks for Human Action Synthesis

Oct 25, 2021

Bruno Degardin, João Neves, Vasco Lopes, João Brito, Ehsan Yaghoubi, Hugo Proença

Figure 1 for Generative Adversarial Graph Convolutional Networks for Human Action Synthesis

Figure 2 for Generative Adversarial Graph Convolutional Networks for Human Action Synthesis

Figure 3 for Generative Adversarial Graph Convolutional Networks for Human Action Synthesis

Figure 4 for Generative Adversarial Graph Convolutional Networks for Human Action Synthesis

Abstract:Synthesising the spatial and temporal dynamics of the human body skeleton remains a challenging task, not only in terms of the quality of the generated shapes, but also of their diversity, particularly to synthesise realistic body movements of a specific action (action conditioning). In this paper, we propose Kinetic-GAN, a novel architecture that leverages the benefits of Generative Adversarial Networks and Graph Convolutional Networks to synthesise the kinetics of the human body. The proposed adversarial architecture can condition up to 120 different actions over local and global body movements while improving sample quality and diversity through latent space disentanglement and stochastic variations. Our experiments were carried out in three well-known datasets, where Kinetic-GAN notably surpasses the state-of-the-art methods in terms of distribution quality metrics while having the ability to synthesise more than one order of magnitude regarding the number of different actions. Our code and models are publicly available at https://github.com/DegardinBruno/Kinetic-GAN.

* Published as a conference paper at WACV 2022. Code and pretrained models available at https://github.com/DegardinBruno/Kinetic-GAN

Via

Access Paper or Ask Questions

REGINA - Reasoning Graph Convolutional Networks in Human Action Recognition

May 14, 2021

Bruno Degardin, Vasco Lopes, Hugo Proença

Figure 1 for REGINA - Reasoning Graph Convolutional Networks in Human Action Recognition

Figure 2 for REGINA - Reasoning Graph Convolutional Networks in Human Action Recognition

Figure 3 for REGINA - Reasoning Graph Convolutional Networks in Human Action Recognition

Figure 4 for REGINA - Reasoning Graph Convolutional Networks in Human Action Recognition

Abstract:It is known that the kinematics of the human body skeleton reveals valuable information in action recognition. Recently, modeling skeletons as spatio-temporal graphs with Graph Convolutional Networks (GCNs) has been reported to solidly advance the state-of-the-art performance. However, GCN-based approaches exclusively learn from raw skeleton data, and are expected to extract the inherent structural information on their own. This paper describes REGINA, introducing a novel way to REasoning Graph convolutional networks IN Human Action recognition. The rationale is to provide to the GCNs additional knowledge about the skeleton data, obtained by handcrafted features, in order to facilitate the learning process, while guaranteeing that it remains fully trainable in an end-to-end manner. The challenge is to capture complementary information over the dynamics between consecutive frames, which is the key information extracted by state-of-the-art GCN techniques. Moreover, the proposed strategy can be easily integrated in the existing GCN-based methods, which we also regard positively. Our experiments were carried out in well known action recognition datasets and enabled to conclude that REGINA contributes for solid improvements in performance when incorporated to other GCN-based approaches, without any other adjustment regarding the original method. For reproducibility, the REGINA code and all the experiments carried out will be publicly available at https://github.com/DegardinBruno.

Via

Access Paper or Ask Questions

EPE-NAS: Efficient Performance Estimation Without Training for Neural Architecture Search

Feb 16, 2021

Vasco Lopes, Saeid Alirezazadeh, Luís A. Alexandre

Figure 1 for EPE-NAS: Efficient Performance Estimation Without Training for Neural Architecture Search

Figure 2 for EPE-NAS: Efficient Performance Estimation Without Training for Neural Architecture Search

Figure 3 for EPE-NAS: Efficient Performance Estimation Without Training for Neural Architecture Search

Figure 4 for EPE-NAS: Efficient Performance Estimation Without Training for Neural Architecture Search

Abstract:Neural Architecture Search (NAS) has shown excellent results in designing architectures for computer vision problems. NAS alleviates the need for human-defined settings by automating architecture design and engineering. However, NAS methods tend to be slow, as they require large amounts of GPU computation. This bottleneck is mainly due to the performance estimation strategy, which requires the evaluation of the generated architectures, mainly by training them, to update the sampler method. In this paper, we propose EPE-NAS, an efficient performance estimation strategy, that mitigates the problem of evaluating networks, by scoring untrained networks and creating a correlation with their trained performance. We perform this process by looking at intra and inter-class correlations of an untrained network. We show that EPE-NAS can produce a robust correlation and that by incorporating it into a simple random sampling strategy, we are able to search for competitive networks, without requiring any training, in a matter of seconds using a single GPU. Moreover, EPE-NAS is agnostic to the search method, since it focuses on the evaluation of untrained networks, making it easy to integrate into almost any NAS method.

Via

Access Paper or Ask Questions

An AutoML-based Approach to Multimodal Image Sentiment Analysis

Feb 16, 2021

Vasco Lopes, António Gaspar, Luís A. Alexandre, João Cordeiro

Figure 1 for An AutoML-based Approach to Multimodal Image Sentiment Analysis

Figure 2 for An AutoML-based Approach to Multimodal Image Sentiment Analysis

Figure 3 for An AutoML-based Approach to Multimodal Image Sentiment Analysis

Figure 4 for An AutoML-based Approach to Multimodal Image Sentiment Analysis

Abstract:Sentiment analysis is a research topic focused on analysing data to extract information related to the sentiment that it causes. Applications of sentiment analysis are wide, ranging from recommendation systems, and marketing to customer satisfaction. Recent approaches evaluate textual content using Machine Learning techniques that are trained over large corpora. However, as social media grown, other data types emerged in large quantities, such as images. Sentiment analysis in images has shown to be a valuable complement to textual data since it enables the inference of the underlying message polarity by creating context and connections. Multimodal sentiment analysis approaches intend to leverage information of both textual and image content to perform an evaluation. Despite recent advances, current solutions still flounder in combining both image and textual information to classify social media data, mainly due to subjectivity, inter-class homogeneity and fusion data differences. In this paper, we propose a method that combines both textual and image individual sentiment analysis into a final fused classification based on AutoML, that performs a random search to find the best model. Our method achieved state-of-the-art performance in the B-T4SA dataset, with 95.19% accuracy.

Via

Access Paper or Ask Questions

Auto-Classifier: A Robust Defect Detector Based on an AutoML Head

Sep 03, 2020

Vasco Lopes, Luís A. Alexandre

Figure 1 for Auto-Classifier: A Robust Defect Detector Based on an AutoML Head

Figure 2 for Auto-Classifier: A Robust Defect Detector Based on an AutoML Head

Figure 3 for Auto-Classifier: A Robust Defect Detector Based on an AutoML Head

Abstract:The dominant approach for surface defect detection is the use of hand-crafted feature-based methods. However, this falls short when conditions vary that affect extracted images. So, in this paper, we sought to determine how well several state-of-the-art Convolutional Neural Networks perform in the task of surface defect detection. Moreover, we propose two methods: CNN-Fusion, that fuses the prediction of all the networks into a final one, and Auto-Classifier, which is a novel proposal that improves a Convolutional Neural Network by modifying its classification component using AutoML. We carried out experiments to evaluate the proposed methods in the task of surface defect detection using different datasets from DAGM2007. We show that the use of Convolutional Neural Networks achieves better results than traditional methods, and also, that Auto-Classifier out-performs all other methods, by achieving 100% accuracy and 100% AUC results throughout all the datasets.

* 12 pages, 2 figures. Published in ICONIP2020, proceedings published in the Springer's series of Lecture Notes in Computer Science

Via

Access Paper or Ask Questions