Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Senior

Large-Scale Visual Speech Recognition

Oct 01, 2018

Brendan Shillingford, Yannis Assael, Matthew W. Hoffman, Thomas Paine, Cían Hughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett(+5 more)

Figure 1 for Large-Scale Visual Speech Recognition

Figure 2 for Large-Scale Visual Speech Recognition

Figure 3 for Large-Scale Visual Speech Recognition

Figure 4 for Large-Scale Visual Speech Recognition

Abstract:This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.

Via

Access Paper or Ask Questions

Deep Audio-Visual Speech Recognition

Sep 06, 2018

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

Figure 1 for Deep Audio-Visual Speech Recognition

Figure 2 for Deep Audio-Visual Speech Recognition

Figure 3 for Deep Audio-Visual Speech Recognition

Figure 4 for Deep Audio-Visual Speech Recognition

Abstract:The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

Via

Access Paper or Ask Questions

Lip Reading Sentences in the Wild

Jan 30, 2017

Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

Figure 1 for Lip Reading Sentences in the Wild

Figure 2 for Lip Reading Sentences in the Wild

Figure 3 for Lip Reading Sentences in the Wild

Figure 4 for Lip Reading Sentences in the Wild

Abstract:The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available.

Via

Access Paper or Ask Questions

Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Oct 27, 2016

Jack W Rae, Jonathan J Hunt, Tim Harley, Ivo Danihelka, Andrew Senior, Greg Wayne, Alex Graves, Timothy P Lillicrap

Figure 1 for Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Figure 2 for Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Figure 3 for Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Figure 4 for Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Abstract:Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs $1,\!000\times$ faster and with $3,\!000\times$ less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring $100,\!000$s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.

* in 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain

Via

Access Paper or Ask Questions

WaveNet: A Generative Model for Raw Audio

Sep 19, 2016

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu

Figure 1 for WaveNet: A Generative Model for Raw Audio

Figure 2 for WaveNet: A Generative Model for Raw Audio

Figure 3 for WaveNet: A Generative Model for Raw Audio

Figure 4 for WaveNet: A Generative Model for Raw Audio

Abstract:This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

Via

Access Paper or Ask Questions

Latent Predictor Networks for Code Generation

Jun 08, 2016

Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom

Figure 1 for Latent Predictor Networks for Code Generation

Figure 2 for Latent Predictor Networks for Code Generation

Figure 3 for Latent Predictor Networks for Code Generation

Figure 4 for Latent Predictor Networks for Code Generation

Abstract:Many language generation tasks require the production of text conditioned on both structured and unstructured inputs. We present a novel neural network architecture which generates an output sequence conditioned on an arbitrary number of input functions. Crucially, our approach allows both the choice of conditioning context and the granularity of generation, for example characters or tokens, to be marginalised, thus permitting scalable and effective training. Using this framework, we address the problem of generating programming code from a mixed natural language and structured specification. We create two new data sets for this paradigm derived from the collectible trading card games Magic the Gathering and Hearthstone. On these, and a third preexisting corpus, we demonstrate that marginalising multiple predictors allows our model to outperform strong benchmarks.

Via

Access Paper or Ask Questions

Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

Jul 24, 2015

Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays

Figure 1 for Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

Figure 2 for Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

Figure 3 for Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

Figure 4 for Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

Abstract:We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.

* To be published in the INTERSPEECH 2015 proceedings

Via

Access Paper or Ask Questions

Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Feb 05, 2014

Haşim Sak, Andrew Senior, Françoise Beaufays

Figure 1 for Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Figure 2 for Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Figure 3 for Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Figure 4 for Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Abstract:Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.

Via

Access Paper or Ask Questions