Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alan F. Smeaton

Dublin City University

Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

May 13, 2025

Ayush K. Rai, Kyle Min, Tarun Krishna, Feiyan Hu, Alan F. Smeaton, Noel E. O'Connor

Abstract:Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.

Via

Access Paper or Ask Questions

The Effects of Grouped Structural Global Pruning of Vision Transformers on Domain Generalisation

Apr 05, 2025

Hamza Riaz, Alan F. Smeaton

Abstract:With the growing sizes of AI models like large language models (LLMs) and vision transformers, deploying them on devices with limited computational resources is a significant challenge particularly when addressing domain generalisation (DG) tasks. This paper introduces a novel grouped structural pruning method for pre-trained vision transformers (ViT, BeiT, and DeiT), evaluated on the PACS and Office-Home DG benchmarks. Our method uses dependency graph analysis to identify and remove redundant groups of neurons, weights, filters, or attention heads within transformers, using a range of selection metrics. Grouped structural pruning is applied at pruning ratios of 50\%, 75\% and 95\% and the models are then fine-tuned on selected distributions from DG benchmarks to evaluate their overall performance in DG tasks. Results show significant improvements in inference speed and fine-tuning time with minimal trade-offs in accuracy and DG task performance. For instance, on the PACS benchmark, pruning ViT, BeiT, and DeiT models by 50\% using the Hessian metric resulted in accuracy drops of only -2.94\%, -1.42\%, and -1.72\%, respectively, while achieving speed boosts of 2.5x, 1.81x, and 2.15x. These findings demonstrate the effectiveness of our approach in balancing model efficiency with domain generalisation performance.

* 9 pages

Via

Access Paper or Ask Questions

Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

Apr 05, 2025

Hamza Riaz, Alan F. Smeaton

Figure 1 for Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

Figure 2 for Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

Figure 3 for Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

Figure 4 for Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

Abstract:Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25\%, 50\%, 75\% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT's known robustness while maintaining 94\% accuracy on PACS and 87\% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37\%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68\% drop for external grid masking vs. 22\% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.

* 31 pages

Via

Access Paper or Ask Questions

Efficient Object-centric Representation Learning with Pre-trained Geometric Prior

Dec 16, 2024

Phúc H. Le Khac, Graham Healy, Alan F. Smeaton

Abstract:This paper addresses key challenges in object-centric representation learning of video. While existing approaches struggle with complex scenes, we propose a novel weakly-supervised framework that emphasises geometric understanding and leverages pre-trained vision models to enhance object discovery. Our method introduces an efficient slot decoder specifically designed for object-centric learning, enabling effective representation of multi-object scenes without requiring explicit depth information. Results on synthetic video benchmarks with increasing complexity in terms of objects and their movement, object occlusion and camera motion demonstrate that our approach achieves comparable performance to supervised methods while maintaining computational efficiency. This advances the field towards more practical applications in complex real-world scenarios.

* 6 pages, 4 Figures, 2 Tables

Via

Access Paper or Ask Questions

Generative Outpainting To Enhance the Memorability of Short-Form Videos

Nov 21, 2024

Alan Byju, Aman Sudhindra Ladwa, Lorin Sweeney, Alan F. Smeaton

Figure 1 for Generative Outpainting To Enhance the Memorability of Short-Form Videos

Figure 2 for Generative Outpainting To Enhance the Memorability of Short-Form Videos

Figure 3 for Generative Outpainting To Enhance the Memorability of Short-Form Videos

Figure 4 for Generative Outpainting To Enhance the Memorability of Short-Form Videos

Abstract:With the expanding use of the short-form video format in advertising, social media, entertainment, education and more, there is a need for such media to both captivate and be remembered. Video memorability indicates to us how likely a video is to be remembered by a viewer who has no emotional or personal connection with its content. This paper presents the results of using generative outpainting to expand the screen size of a short-form video with a view to improving its memorability. Advances in machine learning and deep learning are compared and leveraged to understand how extending the borders of video screensizes can affect their memorability to viewers. Using quantitative evaluation we determine the best-performing model for outpainting and the impact of outpainting based on image saliency on video memorability scores

Via

Access Paper or Ask Questions

Capturing Bias Diversity in LLMs

Oct 09, 2024

Purva Prasad Gosavi, Vaishnavi Murlidhar Kulkarni, Alan F. Smeaton

Figure 1 for Capturing Bias Diversity in LLMs

Figure 2 for Capturing Bias Diversity in LLMs

Figure 3 for Capturing Bias Diversity in LLMs

Figure 4 for Capturing Bias Diversity in LLMs

Abstract:This paper presents research on enhancements to Large Language Models (LLMs) through the addition of diversity in its generated outputs. Our study introduces a configuration of multiple LLMs which demonstrates the diversities capable with a single LLM. By developing multiple customised instances of a GPT model, each reflecting biases in specific demographic characteristics including gender, age, and race, we propose, develop and evaluate a framework for a more nuanced and representative AI dialogue which we call BiasGPT. The customised GPT models will ultimately collaborate, merging their diverse perspectives on a topic into an integrated response that captures a broad spectrum of human experiences and viewpoints. In this paper, through experiments, we demonstrate the capabilities of a GPT model to embed different biases which, when combined, can open the possibilities of more inclusive AI technologies.

* 2nd International Conference on Foundation and Large Language Models (FLLM2024), 26-29 November, 2024 | Dubai, UAE

Via

Access Paper or Ask Questions

Understanding Foundation Models: Are We Back in 1924?

Sep 11, 2024

Alan F. Smeaton

Figure 1 for Understanding Foundation Models: Are We Back in 1924?

Figure 2 for Understanding Foundation Models: Are We Back in 1924?

Figure 3 for Understanding Foundation Models: Are We Back in 1924?

Abstract:This position paper explores the rapid development of Foundation Models (FMs) in AI and their implications for intelligence and reasoning. It examines the characteristics of FMs, including their training on vast datasets and use of embedding spaces to capture semantic relationships. The paper discusses recent advancements in FMs' reasoning abilities which we argue cannot be attributed to increased model size but to novel training techniques which yield learning phenomena like grokking. It also addresses the challenges in benchmarking FMs and compares their structure to the human brain. We argue that while FMs show promising developments in reasoning and knowledge representation, understanding their inner workings remains a significant challenge, similar to ongoing efforts in neuroscience to comprehend human brain function. Despite having some similarities, fundamental differences between FMs and the structure of human brain warn us against making direct comparisons or expecting neuroscience to provide immediate insights into FM function.

* 7 pages, 4 Figures, to appear in Proceedings of the 2nd International Conference on Foundation and Large Language Models (FLLM2024) 26-29 November, 2024, Dubai, UAE

Via

Access Paper or Ask Questions

A Review of Multi-Modal Large Language and Vision Models

Mar 28, 2024

Kilian Carolan, Laura Fennelly, Alan F. Smeaton

Figure 1 for A Review of Multi-Modal Large Language and Vision Models

Figure 2 for A Review of Multi-Modal Large Language and Vision Models

Figure 3 for A Review of Multi-Modal Large Language and Vision Models

Abstract:Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality. Even more recently, LLMs have been extended into multi-modal large language models (MM-LLMs) which extends their capabilities to deal with image, video and audio information, in addition to text. This opens up applications like text-to-video generation, image captioning, text-to-speech, and more and is achieved either by retro-fitting an LLM with multi-modal capabilities, or building a MM-LLM from scratch. This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs. It covers the historical development of LLMs especially the advances enabled by transformer-based architectures like OpenAI's GPT series and Google's BERT, as well as the role of attention mechanisms in enhancing model performance. The paper includes coverage of the major and most important of the LLMs and MM-LLMs and also covers the techniques of model tuning, including fine-tuning and prompt engineering, which tailor pre-trained models to specific tasks or domains. Ethical considerations and challenges, such as data bias and model misuse, are also analysed to underscore the importance of responsible AI development and deployment. Finally, we discuss the implications of open-source versus proprietary models in AI research. Through this review, we provide insights into the transformative potential of MM-LLMs in various applications.

* 33 pages, 1 figure

Via

Access Paper or Ask Questions

A Systematic Review of Available Datasets in Additive Manufacturing

Jan 27, 2024

Xiao Liu, Alessandra Mileo, Alan F. Smeaton

Abstract:In-situ monitoring incorporating data from visual and other sensor technologies, allows the collection of extensive datasets during the Additive Manufacturing (AM) process. These datasets have potential for determining the quality of the manufactured output and the detection of defects through the use of Machine Learning during the manufacturing process. Open and annotated datasets derived from AM processes are necessary for the machine learning community to address this opportunity, which creates difficulties in the application of computer vision-related machine learning in AM. This systematic review investigates the availability of open image-based datasets originating from AM processes that align with a number of pre-defined selection criteria. The review identifies existing gaps among the current image-based datasets in the domain of AM, and points to the need for greater availability of open datasets in order to allow quality assessment and defect detection during additive manufacturing, to develop.

* 24 pages

Via

Access Paper or Ask Questions

Lifelogging As An Extreme Form of Personal Information Management -- What Lessons To Learn

Jan 11, 2024

Ly-Duyen Tran, Cathal Gurrin, Alan F. Smeaton

Figure 1 for Lifelogging As An Extreme Form of Personal Information Management -- What Lessons To Learn

Abstract:Personal data includes the digital footprints that we leave behind as part of our everyday activities, both online and offline in the real world. It includes data we collect ourselves, such as from wearables, as well as the data collected by others about our online behaviour and activities. Sometimes we are able to use the personal data we ourselves collect, in order to examine some parts of our lives but for the most part, our personal data is leveraged by third parties including internet companies, for services like targeted advertising and recommendations. Lifelogging is a form of extreme personal data gathering and in this article we present an overview of the tools used to manage access to lifelogs as demonstrated at the most recent of the annual Lifelog Search Challenge benchmarking workshops. Here, experimental systems are showcased in live, real time information seeking tasks by real users. This overview of these systems' capabilities show the range of possibilities for accessing our own personal data which may, in time, become more easily available as consumer-level services.

* IEEE Data Engineering Bulletin 47 (4), 18-29, 2023

Via

Access Paper or Ask Questions