Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joao Neves

Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding

Mar 26, 2025

Joao Pereira, Vasco Lopes, David Semedo, Joao Neves

Abstract:Large Vision-Language Models (LVLMs) demonstrate remarkable performance in short-video tasks such as video question answering, but struggle in long-video understanding. The linear frame sampling strategy, conventionally used by LVLMs, fails to account for the non-linear distribution of key events in video data, often introducing redundant or irrelevant information in longer contexts while risking the omission of critical events in shorter ones. To address this, we propose SelfReS, a non-linear spatiotemporal self-reflective sampling method that dynamically selects key video fragments based on user prompts. Unlike prior approaches, SelfReS leverages the inherently sparse attention maps of LVLMs to define reflection tokens, enabling relevance-aware token selection without requiring additional training or external modules. Experiments demonstrate that SelfReS can be seamlessly integrated into strong base LVLMs, improving long-video task accuracy and achieving up to 46% faster inference speed within the same GPU memory budget.

Via

Access Paper or Ask Questions

Zero-Shot Action Recognition in Surveillance Videos

Oct 28, 2024

Joao Pereira, Vasco Lopes, David Semedo, Joao Neves

Figure 1 for Zero-Shot Action Recognition in Surveillance Videos

Figure 2 for Zero-Shot Action Recognition in Surveillance Videos

Figure 3 for Zero-Shot Action Recognition in Surveillance Videos

Figure 4 for Zero-Shot Action Recognition in Surveillance Videos

Abstract:The growing demand for surveillance in public spaces presents significant challenges due to the shortage of human resources. Current AI-based video surveillance systems heavily rely on core computer vision models that require extensive finetuning, which is particularly difficult in surveillance settings due to limited datasets and difficult setting (viewpoint, low quality, etc.). In this work, we propose leveraging Large Vision-Language Models (LVLMs), known for their strong zero and few-shot generalization, to tackle video understanding tasks in surveillance. Specifically, we explore VideoLLaMA2, a state-of-the-art LVLM, and an improved token-level sampling method, Self-Reflective Sampling (Self-ReS). Our experiments on the UCF-Crime dataset show that VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline. Self-ReS additionally increases zero-shot action recognition performance to 44.6%. These results highlight the potential of LVLMs, paired with improved sampling techniques, for advancing surveillance video analysis in diverse scenarios.

Via

Access Paper or Ask Questions

A Laplacian-based Quantum Graph Neural Network for Semi-Supervised Learning

Aug 13, 2024

Hamed Gholipour, Farid Bozorgnia, Kailash Hambarde, Hamzeh Mohammadigheymasi, Javier Mancilla, Andre Sequeira, Joao Neves, Hugo Proença

Figure 1 for A Laplacian-based Quantum Graph Neural Network for Semi-Supervised Learning

Figure 2 for A Laplacian-based Quantum Graph Neural Network for Semi-Supervised Learning

Figure 3 for A Laplacian-based Quantum Graph Neural Network for Semi-Supervised Learning

Figure 4 for A Laplacian-based Quantum Graph Neural Network for Semi-Supervised Learning

Abstract:Laplacian learning method is a well-established technique in classical graph-based semi-supervised learning, but its potential in the quantum domain remains largely unexplored. This study investigates the performance of the Laplacian-based Quantum Semi-Supervised Learning (QSSL) method across four benchmark datasets -- Iris, Wine, Breast Cancer Wisconsin, and Heart Disease. Further analysis explores the impact of increasing Qubit counts, revealing that adding more Qubits to a quantum system doesn't always improve performance. The effectiveness of additional Qubits depends on the quantum algorithm and how well it matches the dataset. Additionally, we examine the effects of varying entangling layers on entanglement entropy and test accuracy. The performance of Laplacian learning is highly dependent on the number of entangling layers, with optimal configurations varying across different datasets. Typically, moderate levels of entanglement offer the best balance between model complexity and generalization capabilities. These observations highlight the crucial need for precise hyperparameter tuning tailored to each dataset to achieve optimal performance in Laplacian learning methods.

Via

Access Paper or Ask Questions