Abstract:Selective attention is an essential mechanism to filter sensory input and to select only its most important components, allowing the capacity-limited cognitive structures of the brain to process them in detail. The saliency map model, originally developed to understand the process of selective attention in the primate visual system, has also been extensively used in computer vision. Due to the wide-spread use of frame-based video, this is how dynamic input from non-stationary scenes is commonly implemented in saliency maps. However, the temporal structure of this input modality is very different from that of the primate visual system. Retinal input to the brain is massively parallel, local rather than frame-based, asynchronous rather than synchronous, and transmitted in the form of discrete events, neuronal action potentials (spikes). These features are captured by event-based cameras. We show that a computational saliency model can be obtained organically from such vision sensors, at minimal computational cost. We assess the performance of the model by comparing its predictions with the distribution of overt attention (fixations) of human observers, and we make available an event-based dataset that can be used as ground truth for future studies.
Abstract:The ability to attend to salient regions of a visual scene is an innate and necessary preprocessing step for both biological and engineered systems performing high-level visual tasks (e.g. object detection, tracking, and classification). Computational efficiency, in regard to processing bandwidth and speed, is improved by only devoting computational resources to salient regions of the visual stimuli. In this paper, we first present a biologically-plausible, bottom-up, dynamic visual saliency model based on the notion of proto-objects. This is achieved by incorporating the temporal characteristics of the visual stimulus into the model, similarly to the manner in which early stages of the human visual system extracts temporal information. This model outperforms state-of-the-art dynamic visual saliency models in predicting human eye fixations on a commonly-used video dataset with associated eye tracking data. Secondly, for this model to have practical applications, it must be capable of performing its computations in real-time under lowpower, small-size, and lightweight constraints. To address this, we introduce a Field-Programmable Gate Array implementation of the model on an Opal Kelly 7350 Kintex-7 board. This novel hardware implementation allows for processing of up to 23.35 frames per second running on a 100 MHz clock -- better than 26x speedup from the software implementation.