Abstract:This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.
Abstract:Neuromorphic or event cameras, inspired by biological vision systems, capture changes in illumination with high temporal resolution and efficiency, producing streams of events rather than traditional images. In this paper, we explore the use of neuromorphic cameras for passive optical wireless communication (OWC), leveraging their asynchronous detection of illumination changes to decode data transmitted through reflections of light from objects. We propose a novel system that utilizes neuromorphic cameras for passive visible light communication (VLC), extending the concept to Non Line-of-Sight (NLoS) scenarios through passive reflections from everyday objects. Our experiments demonstrate the feasibility and advantages of using neuromorphic cameras for VLC, characterizing the performance of various modulation schemes, including traditional On-Off Keying (OOK) and advanced N-pulse modulation. We introduce an adaptive N-pulse modulation scheme that dynamically adjusts encoding based on the packet's bit composition, achieving higher data rates and robustness in different scenarios. Our results show that lighter-colored, glossy objects are better for NLoS communication, while larger objects and those with matte finishes experience higher error rates due to multipath reflections.
Abstract:We present `Revelio', a real-world screen-camera communication system leveraging temporal flicker fusion in the OKLAB color space. Using spatially-adaptive flickering and encoding information in pixel region shapes, Revelio achieves visually imperceptible data embedding while remaining robust against noise, asynchronicity, and distortions in screen-camera channels, ensuring reliable decoding by standard smartphone cameras. The decoder, driven by a two-stage neural network, uses a weighted differential accumulator for precise frame detection and symbol recognition. Initial experiments demonstrate Revelio's effectiveness in interactive television, offering an unobtrusive method for meta-information transmission.