Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Walterio Mayol-Cuevas

X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding

Jan 12, 2025

Wenqi Zhou, Kai Cao, Hao Zheng, Xinyi Zheng, Miao Liu, Per Ola Kristensson, Walterio Mayol-Cuevas, Fan Zhang, Weizhe Lin, Junxiao Shen

Abstract:Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of minutes, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset specifically crafted for evaluating tasks on extremely long egocentric video recordings. Leveraging the advanced text processing capabilities of large language models (LLMs), X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D-a massive-scale egocentric video dataset covers a wide range of daily life scenarios-resulting in 432 simulated video life logs that mirror realistic daily activities in contextually rich scenarios. The video life-log durations span from 23 minutes to 16.4 hours. The evaluation of several baseline systems and multimodal large language models (MLLMs) reveals their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding and underscoring the need for more advanced models.

Via

Access Paper or Ask Questions

Re-localization acceleration with Medoid Silhouette Clustering

Jul 30, 2024

Hongyi Zhang, Walterio Mayol-Cuevas

Abstract:Two crucial performance criteria for the deployment of visual localization are speed and accuracy. Current research on visual localization with neural networks is limited to examining methods for enhancing the accuracy of networks across various datasets. How to expedite the re-localization process within deep neural network architectures still needs further investigation. In this paper, we present a novel approach for accelerating visual re-localization in practice. A tree-like search strategy, built on the keyframes extracted by a visual clustering algorithm, is designed for matching acceleration. Our method has been validated on two tasks across three public datasets, allowing for 50 up to 90 percent time saving over the baseline while not reducing location accuracy.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos

Feb 28, 2024

Shijia Feng, Michael Wray, Brian Sullivan, Youngkyoon Jang, Casimir Ludwig, Iain Gilchrist, Walterio Mayol-Cuevas

Abstract:Determining when people are struggling from video enables a finer-grained understanding of actions and opens opportunities for building intelligent support visual interfaces. In this paper, we present a new dataset with three assembly activities and corresponding performance baselines for the determination of struggle from video. Three real-world problem-solving activities including assembling plumbing pipes (Pipes-Struggle), pitching camping tents (Tent-Struggle) and solving the Tower of Hanoi puzzle (Tower-Struggle) are introduced. Video segments were scored w.r.t. the level of struggle as perceived by annotators using a forced choice 4-point scale. Each video segment was annotated by a single expert annotator in addition to crowd-sourced annotations. The dataset is the first struggle annotation dataset and contains 5.1 hours of video and 725,100 frames from 73 participants in total. We evaluate three decision-making tasks: struggle classification, struggle level regression, and struggle label distribution learning. We provide baseline results for each of the tasks utilising several mainstream deep neural networks, along with an ablation study and visualisation of results. Our work is motivated toward assistive systems that analyze struggle, support users during manual activities and encourage learning, as well as other video understanding competencies.

Via

Access Paper or Ask Questions

SuperTran: Reference Based Video Transformer for Enhancing Low Bitrate Streams in Real Time

Nov 22, 2022

Tejas Khot, Nataliya Shapovalova, Silviu Andrei, Walterio Mayol-Cuevas

Abstract:This work focuses on low bitrate video streaming scenarios (e.g. 50 - 200Kbps) where the video quality is severely compromised. We present a family of novel deep generative models for enhancing perceptual video quality of such streams by performing super-resolution while also removing compression artifacts. Our model, which we call SuperTran, consumes as input a single high-quality, high-resolution reference images in addition to the low-quality, low-resolution video stream. The model thus learns how to borrow or copy visual elements like textures from the reference image and fill in the remaining details from the low resolution stream in order to produce perceptually enhanced output video. The reference frame can be sent once at the start of the video session or be retrieved from a gallery. Importantly, the resulting output has substantially better detail than what has been otherwise possible with methods that only use a low resolution input such as the SuperVEGAN method. SuperTran works in real-time (up to 30 frames/sec) on the cloud alongside standard pipelines.

* 4 pages

Via

Access Paper or Ask Questions

On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

Feb 02, 2022

Yanan Liu, Laurie Bose, Yao Lu, Piotr Dudek, Walterio Mayol-Cuevas

Figure 1 for On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

Figure 2 for On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

Figure 3 for On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

Figure 4 for On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

Abstract:This work presents a method to implement fully convolutional neural networks (FCNs) on Pixel Processor Array (PPA) sensors, and demonstrates coarse segmentation and object localisation tasks. We design and train binarized FCN for both binary weights and activations using batchnorm, group convolution, and learnable threshold for binarization, producing networks small enough to be embedded on the focal plane of the PPA, with limited local memory resources, and using parallel elementary add/subtract, shifting, and bit operations only. We demonstrate the first implementation of an FCN on a PPA device, performing three convolution layers entirely in the pixel-level processors. We use this architecture to demonstrate inference generating heat maps for object segmentation and localisation at over 280 FPS using the SCAMP-5 PPA vision chip.

Via

Access Paper or Ask Questions

Bringing A Robot Simulator to the SCAMP Vision System

May 21, 2021

Yanan Liu, Jianing Chen, Laurie Bose, Piotr Dudek, Walterio Mayol-Cuevas

Figure 1 for Bringing A Robot Simulator to the SCAMP Vision System

Figure 2 for Bringing A Robot Simulator to the SCAMP Vision System

Figure 3 for Bringing A Robot Simulator to the SCAMP Vision System

Figure 4 for Bringing A Robot Simulator to the SCAMP Vision System

Abstract:This work develops and demonstrates the integration of the SCAMP-5d vision system into the CoppeliaSim robot simulator, creating a semi-simulated environment. By configuring a camera in the simulator and setting up communication with the SCAMP python host through remote API, sensor images from the simulator can be transferred to the SCAMP vision sensor, where on-sensor image processing such as CNN inference can be performed. SCAMP output is then fed back into CoppeliaSim. This proposed platform integration enables rapid prototyping validations of SCAMP algorithms for robotic systems. We demonstrate a car localisation and tracking task using this proposed semi-simulated platform, with a CNN inference on SCAMP to command the motion of a robot. We made this platform available online.

Via

Access Paper or Ask Questions

Filter Distribution Templates in Convolutional Networks for Image Classification Tasks

Apr 28, 2021

Ramon Izquierdo-Cordova, Walterio Mayol-Cuevas

Figure 1 for Filter Distribution Templates in Convolutional Networks for Image Classification Tasks

Figure 2 for Filter Distribution Templates in Convolutional Networks for Image Classification Tasks

Figure 3 for Filter Distribution Templates in Convolutional Networks for Image Classification Tasks

Figure 4 for Filter Distribution Templates in Convolutional Networks for Image Classification Tasks

Abstract:Neural network designers have reached progressive accuracy by increasing models depth, introducing new layer types and discovering new combinations of layers. A common element in many architectures is the distribution of the number of filters in each layer. Neural network models keep a pattern design of increasing filters in deeper layers such as those in LeNet, VGG, ResNet, MobileNet and even in automatic discovered architectures such as NASNet. It remains unknown if this pyramidal distribution of filters is the best for different tasks and constrains. In this work we present a series of modifications in the distribution of filters in four popular neural network models and their effects in accuracy and resource consumption. Results show that by applying this approach, some models improve up to 8.9% in accuracy showing reductions in parameters up to 54%.

* arXiv admin note: text overlap with arXiv:2104.08446

Via

Access Paper or Ask Questions

Towards Efficient Convolutional Network Models with Filter Distribution Templates

Apr 17, 2021

Ramon Izquierdo-Cordova, Walterio Mayol-Cuevas

Figure 1 for Towards Efficient Convolutional Network Models with Filter Distribution Templates

Figure 2 for Towards Efficient Convolutional Network Models with Filter Distribution Templates

Figure 3 for Towards Efficient Convolutional Network Models with Filter Distribution Templates

Figure 4 for Towards Efficient Convolutional Network Models with Filter Distribution Templates

Abstract:Increasing number of filters in deeper layers when feature maps are decreased is a widely adopted pattern in convolutional network design. It can be found in classical CNN architectures and in automatic discovered models. Even CNS methods commonly explore a selection of multipliers derived from this pyramidal pattern. We defy this practice by introducing a small set of templates consisting of easy to implement, intuitive and aggressive variations of the original pyramidal distribution of filters in VGG and ResNet architectures. Experiments on CIFAR, CINIC10 and TinyImagenet datasets show that models produced by our templates, are more efficient in terms of fewer parameters and memory needs.

Via

Access Paper or Ask Questions

Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array

Sep 27, 2020

Yanan Liu, Laurie Bose, Colin Greatwood, Jianing Chen, Rui Fan, Thomas Richardson, Stephen J. Carey, Piotr Dudek, Walterio Mayol-Cuevas

Figure 1 for Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array

Figure 2 for Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array

Figure 3 for Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array

Figure 4 for Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array

Abstract:This paper presents an agile reactive navigation strategy for driving a non-holonomic ground vehicle around a preset course of gates in a cluttered environment using a low-cost processor array sensor. This enables machine vision tasks to be performed directly upon the sensor's image plane, rather than using a separate general-purpose computer. We demonstrate a small ground vehicle running through or avoiding multiple gates at high speed using minimal computational resources. To achieve this, target tracking algorithms are developed for the Pixel Processing Array and captured images are then processed directly on the vision sensor acquiring target information for controlling the ground vehicle. The algorithm can run at up to 2000 fps outdoors and 200fps at indoor illumination levels. Conducting image processing at the sensor level avoids the bottleneck of image transfer encountered in conventional sensors. The real-time performance of on-board image processing and robustness is validated through experiments. Experimental results demonstrate that the algorithm's ability to enable a ground vehicle to navigate at an average speed of 2.20 m/s for passing through multiple gates and 3.88 m/s for a 'slalom' task in an environment featuring significant visual clutter.

* 7 pages

Via

Access Paper or Ask Questions

Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

Apr 27, 2020

Laurie Bose, Jianing Chen, Stephen J. Carey, Piotr Dudek, Walterio Mayol-Cuevas

Figure 1 for Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

Figure 2 for Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

Figure 3 for Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

Figure 4 for Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

Abstract:We present a novel method of CNN inference for pixel processor array (PPA) vision sensors, designed to take advantage of their massive parallelism and analog compute capabilities. PPA sensors consist of an array of processing elements (PEs), with each PE capable of light capture, data storage and computation, allowing various computer vision processing to be executed directly upon the sensor device. The key idea behind our approach is storing network weights "in-pixel" within the PEs of the PPA sensor itself to allow various computations, such as multiple different image convolutions, to be carried out in parallel. Our approach can perform convolutional layers, max pooling, ReLu, and a final fully connected layer entirely upon the PPA sensor, while leaving no untapped computational resources. This is in contrast to previous works that only use a sensor-level processing to sequentially compute image convolutions, and must transfer data to an external digital processor to complete the computation. We demonstrate our approach on the SCAMP-5 vision system, performing inference of a MNIST digit classification network at over 3000 frames per second and over 93% classification accuracy. This is the first work demonstrating CNN inference conducted entirely upon the processor array of a PPA vision sensor device, requiring no external processing.

Via

Access Paper or Ask Questions