Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas Winterbottom

Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

Oct 23, 2025

Dean L Slack, G Thomas Hudson, Thomas Winterbottom, Noura Al Moubayed

Abstract:Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.

* IEEE Transactions on Neural Networks and Learning Systems, 36, 19106-19118, 2025
* 14 pages, 14 figures

Via

Access Paper or Ask Questions

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Nov 15, 2024

G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed

Abstract:Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text, demonstrating the model's ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.

* 10 pages, 10 figures

Via

Access Paper or Ask Questions

Trying Bilinear Pooling in Video-QA

Dec 18, 2020

Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed

Figure 1 for Trying Bilinear Pooling in Video-QA

Figure 2 for Trying Bilinear Pooling in Video-QA

Figure 3 for Trying Bilinear Pooling in Video-QA

Figure 4 for Trying Bilinear Pooling in Video-QA

Abstract:Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly developed for VQA models. A bilinear (outer-product) expansion is thought to encourage models to learn interactions between two feature spaces and has experimentally outperformed `simpler' vector operations (concatenation and element-wise-addition/multiplication) on VQA benchmarks. Successive BLP techniques have yielded higher performance with lower computational expense and are often implemented alongside attention mechanisms. However, despite significant progress in VQA, BLP methods have not been widely applied to more recently explored video question answering (video-QA) tasks. In this paper, we begin to bridge this research gap by applying BLP techniques to various video-QA benchmarks, namely: TVQA, TGIF-QA, Ego-VQA and MSVD-QA. We share our results on the TVQA baseline model, and the recently proposed heterogeneous-memory-enchanced multimodal attention (HME) model. Our experiments include both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP we name the `dual-stream' model. We find that our relatively simple integration of BLP does not increase, and mostly harms, performance on these video-QA benchmarks. Using recently proposed theoretical multimodal fusion taxonomies, we offer insight into why BLP-driven performance gain for video-QA benchmarks may be more difficult to achieve than in earlier VQA models. We suggest a few additional `best-practices' to consider when applying BLP to video-QA. We stress that video-QA models should carefully consider where the complex representational potential from BLP is actually needed to avoid computational expense on `redundant' fusion.

* 16 Pages, 8 Figures, 4 Tables, +Supp Mats

Via

Access Paper or Ask Questions

On Modality Bias in the TVQA Dataset

Dec 18, 2020

Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed

Figure 1 for On Modality Bias in the TVQA Dataset

Figure 2 for On Modality Bias in the TVQA Dataset

Figure 3 for On Modality Bias in the TVQA Dataset

Figure 4 for On Modality Bias in the TVQA Dataset

Abstract:TVQA is a large scale video question answering (video-QA) dataset based on popular TV shows. The questions were specifically designed to require "both vision and language understanding to answer". In this work, we demonstrate an inherent bias in the dataset towards the textual subtitle modality. We infer said bias both directly and indirectly, notably finding that models trained with subtitles learn, on-average, to suppress video feature contribution. Our results demonstrate that models trained on only the visual information can answer ~45% of the questions, while using only the subtitles achieves ~68%. We find that a bilinear pooling based joint representation of modalities damages model performance by 9% implying a reliance on modality specific information. We also show that TVQA fails to benefit from the RUBi modality bias reduction technique popularised in VQA. By simply improving text processing using BERT embeddings with the simple model first proposed for TVQA, we achieve state-of-the-art results (72.13%) compared to the highly complex STAGE model (70.50%). We recommend a multimodal evaluation framework that can highlight biases in models and isolate visual and textual reliant subsets of data. Using this framework we propose subsets of TVQA that respond exclusively to either or both modalities in order to facilitate multimodal modelling as TVQA originally intended.

* 10 pages, 4 Figures, 2 Tables, +Supp Mats, BMVC 2020

Via

Access Paper or Ask Questions

Bilinear Fusion of Commonsense Knowledge with Attention-Based NLI Models

Oct 22, 2020

Amit Gajbhiye, Thomas Winterbottom, Noura Al Moubayed, Steven Bradley

Figure 1 for Bilinear Fusion of Commonsense Knowledge with Attention-Based NLI Models

Figure 2 for Bilinear Fusion of Commonsense Knowledge with Attention-Based NLI Models

Figure 3 for Bilinear Fusion of Commonsense Knowledge with Attention-Based NLI Models

Figure 4 for Bilinear Fusion of Commonsense Knowledge with Attention-Based NLI Models

Abstract:We consider the task of incorporating real-world commonsense knowledge into deep Natural Language Inference (NLI) models. Existing external knowledge incorporation methods are limited to lexical level knowledge and lack generalization across NLI models, datasets, and commonsense knowledge sources. To address these issues, we propose a novel NLI model-independent neural framework, BiCAM. BiCAM incorporates real-world commonsense knowledge into NLI models. Combined with convolutional feature detectors and bilinear feature fusion, BiCAM provides a conceptually simple mechanism that generalizes well. Quantitative evaluations with two state-of-the-art NLI baselines on SNLI and SciTail datasets in conjunction with ConceptNet and Aristo Tuple KGs show that BiCAM considerably improves the accuracy the incorporated NLI baselines. For example, our BiECAM model, an instance of BiCAM, on the challenging SciTail dataset, improves the accuracy of incorporated baselines by 7.0% with ConceptNet, and 8.0% with Aristo Tuple KG.

* Published in Lecture Notes in Computer Science, Springer International Publishing

Via

Access Paper or Ask Questions