Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Edward Fish

VALLR: Visual ASR Language Model for Lip Reading

Mar 27, 2025

Marshall Thomas, Edward Fish, Richard Bowden

Abstract:Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.

Via

Access Paper or Ask Questions

PLOT-TAL -- Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

Mar 27, 2024

Edward Fish, Jon Weinbren, Andrew Gilbert

Figure 1 for PLOT-TAL -- Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

Figure 2 for PLOT-TAL -- Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

Figure 3 for PLOT-TAL -- Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

Figure 4 for PLOT-TAL -- Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

Abstract:This paper introduces a novel approach to temporal action localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to overfitting due to the inability to generalize across varying contexts in real-world videos. Recognizing the diversity of camera views, backgrounds, and objects in videos, we propose a multi-prompt learning framework enhanced with optimal transport. This design allows the model to learn a set of diverse prompts for each action, capturing general characteristics more effectively and distributing the representation to mitigate the risk of overfitting. Furthermore, by employing optimal transport theory, we efficiently align these prompts with action features, optimizing for a comprehensive representation that adapts to the multifaceted nature of video data. Our experiments demonstrate significant improvements in action localization accuracy and robustness in few-shot settings on the standard challenging datasets of THUMOS-14 and EpicKitchens100, highlighting the efficacy of our multi-prompt optimal transport approach in overcoming the challenges of conventional few-shot TAL methods.

* Under Review

Via

Access Paper or Ask Questions

Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Oct 05, 2023

Edward Fish, Jon Weinbren, Andrew Gilbert

Figure 1 for Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Figure 2 for Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Figure 3 for Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Figure 4 for Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Abstract:Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a significant enhancement in performance when audio data is available.

* Under Review

Via

Access Paper or Ask Questions

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

Jul 24, 2023

Edward Fish, Umberto Michieli, Mete Ozay

Abstract:Recent advancement in Automatic Speech Recognition (ASR) has produced large AI models, which become impractical for deployment in mobile devices. Model quantization is effective to produce compressed general-purpose models, however such models may only be deployed to a restricted sub-domain of interest. We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain. To this end, we propose myQASR, a mixed-precision quantization method that generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning. myQASR automatically evaluates the quantization sensitivity of network layers by analysing the full-precision activation values. We are then able to generate a personalised mixed-precision quantization scheme for any pre-determined memory budget. Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.

* INTERSPEECH 2023

Via

Access Paper or Ask Questions

Two-Stream Transformer Architecture for Long Video Understanding

Aug 02, 2022

Edward Fish, Jon Weinbren, Andrew Gilbert

Figure 1 for Two-Stream Transformer Architecture for Long Video Understanding

Figure 2 for Two-Stream Transformer Architecture for Long Video Understanding

Figure 3 for Two-Stream Transformer Architecture for Long Video Understanding

Figure 4 for Two-Stream Transformer Architecture for Long Video Understanding

Abstract:Pure vision transformer architectures are highly effective for short video classification and action recognition tasks. However, due to the quadratic complexity of self attention and lack of inductive bias, transformers are resource intensive and suffer from data inefficiencies. Long form video understanding tasks amplify data and memory efficiency problems in transformers making current approaches unfeasible to implement on data or memory restricted domains. This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.

Via

Access Paper or Ask Questions

Rethinking movie genre classification with fine-grained semantic clustering

Dec 07, 2020

Edward Fish, Andrew Gilbert, Jon Weinbren

Figure 1 for Rethinking movie genre classification with fine-grained semantic clustering

Figure 2 for Rethinking movie genre classification with fine-grained semantic clustering

Figure 3 for Rethinking movie genre classification with fine-grained semantic clustering

Figure 4 for Rethinking movie genre classification with fine-grained semantic clustering

Abstract:Movie genre classification is an active research area in machine learning. However, due to the limited labels available, there can be large semantic variations between movies within a single genre definition. We expand these 'coarse' genre labels by identifying 'fine-grained' semantic information within the multi-modal content of movies. By leveraging pre-trained 'expert' networks, we learn the influence of different combinations of modes for multi-label genre classification. Using a contrastive loss, we continue to fine-tune this 'coarse' genre classification network to identify high-level intertextual similarities between the movies across all genre labels. This leads to a more 'fine-grained' and detailed clustering, based on semantic similarities while still retaining some genre information. Our approach is demonstrated on a newly introduced multi-modal 37,866,450 frame, 8,800 movie trailer dataset, MMX-Trailer-20, which includes pre-computed audio, location, motion, and image embeddings.

Via

Access Paper or Ask Questions