Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Francois Bremond

INRIA Sophia Antipolis

Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

May 19, 2025

Snehashis Majhi, Giacomo D'Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, Francois Bremond

Abstract:Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: "PI-VAD", a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. PI-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, PI-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.

Via

Access Paper or Ask Questions

SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Feb 05, 2025

Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan Das

Figure 1 for SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Figure 2 for SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Figure 3 for SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Figure 4 for SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Abstract:The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.

Via

Access Paper or Ask Questions

CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets

Jan 06, 2025

Tanay Agrawal, Mohammed Guermal, Michal Balazia, Francois Bremond

Abstract:Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP, adapters and prefix tuning, this paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T, that adapts transformer-based models to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the backbone to process video input and only 22.3% trainable parameters for two additional modalities, we achieve comparable and even better results than the state-of-the-art. CM3T has no specific requirements for training or pretraining and is a step towards bridging the gap between a general model and specific practical applications of video classification.

* Preprint. Final paper accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, February, 2025. 10 pages

Via

Access Paper or Ask Questions

Identifying Surgical Instruments in Pedagogical Cataract Surgery Videos through an Optimized Aggregation Network

Jan 05, 2025

Sanya Sinha, Michal Balazia, Francois Bremond

Abstract:Instructional cataract surgery videos are crucial for ophthalmologists and trainees to observe surgical details repeatedly. This paper presents a deep learning model for real-time identification of surgical instruments in these videos, using a custom dataset scraped from open-access sources. Inspired by the architecture of YOLOV9, the model employs a Programmable Gradient Information (PGI) mechanism and a novel Generally-Optimized Efficient Layer Aggregation Network (Go-ELAN) to address the information bottleneck problem, enhancing Minimum Average Precision (mAP) at higher Non-Maximum Suppression Intersection over Union (NMS IoU) scores. The Go-ELAN YOLOV9 model, evaluated against YOLO v5, v7, v8, v9 vanilla, Laptool and DETR, achieves a superior mAP of 73.74 at IoU 0.5 on a dataset of 615 images with 10 instrument classes, demonstrating the effectiveness of the proposed model.

* Preprint. Full paper accepted at the IEEE International Conference on Image Processing Applications and Systems (IPAS), Lyon, France, Jan 2025. 6 pages

Via

Access Paper or Ask Questions

Anti-Forgetting Adaptation for Unsupervised Person Re-identification

Nov 22, 2024

Hao Chen, Francois Bremond, Nicu Sebe, Shiliang Zhang

Figure 1 for Anti-Forgetting Adaptation for Unsupervised Person Re-identification

Figure 2 for Anti-Forgetting Adaptation for Unsupervised Person Re-identification

Figure 3 for Anti-Forgetting Adaptation for Unsupervised Person Re-identification

Figure 4 for Anti-Forgetting Adaptation for Unsupervised Person Re-identification

Abstract:Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.

* Accepted to TPAMI

Via

Access Paper or Ask Questions

AM Flow: Adapters for Temporal Processing in Action Recognition

Nov 04, 2024

Tanay Agrawal, Abid Ali, Antitza Dantcheva, Francois Bremond

Figure 1 for AM Flow: Adapters for Temporal Processing in Action Recognition

Figure 2 for AM Flow: Adapters for Temporal Processing in Action Recognition

Figure 3 for AM Flow: Adapters for Temporal Processing in Action Recognition

Figure 4 for AM Flow: Adapters for Temporal Processing in Action Recognition

Abstract:Deep learning models, in particular \textit{image} models, have recently gained generalisability and robustness. %are becoming more general and robust by the day. In this work, we propose to exploit such advances in the realm of \textit{video} classification. Video foundation models suffer from the requirement of extensive pretraining and a large training time. Towards mitigating such limitations, we propose "\textit{Attention Map (AM) Flow}" for image models, a method for identifying pixels relevant to motion in each input video frame. In this context, we propose two methods to compute AM flow, depending on camera motion. AM flow allows the separation of spatial and temporal processing, while providing improved results over combined spatio-temporal processing (as in video models). Adapters, one of the popular techniques in parameter efficient transfer learning, facilitate the incorporation of AM flow into pretrained image models, mitigating the need for full-finetuning. We extend adapters to "\textit{temporal processing adapters}" by incorporating a temporal processing unit into the adapters. Our work achieves faster convergence, therefore reducing the number of epochs needed for training. Moreover, we endow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets. This reduces training time and simplifies pretraining. We present experiments on Kinetics-400, Something-Something v2, and Toyota Smarthome datasets, showcasing state-of-the-art or comparable results.

Via

Access Paper or Ask Questions

Loose Social-Interaction Recognition in Real-world Therapy Scenarios

Sep 30, 2024

Abid Ali, Rui Dai, Ashish Marisetty, Guillaume Astruc, Monique Thonnat, Jean-Marc Odobez, Susanne Thümmler, Francois Bremond

Figure 1 for Loose Social-Interaction Recognition in Real-world Therapy Scenarios

Figure 2 for Loose Social-Interaction Recognition in Real-world Therapy Scenarios

Figure 3 for Loose Social-Interaction Recognition in Real-world Therapy Scenarios

Figure 4 for Loose Social-Interaction Recognition in Real-world Therapy Scenarios

Abstract:The computer vision community has explored dyadic interactions for atomic actions such as pushing, carrying-object, etc. However, with the advancement in deep learning models, there is a need to explore more complex dyadic situations such as loose interactions. These are interactions where two people perform certain atomic activities to complete a global action irrespective of temporal synchronisation and physical engagement, like cooking-together for example. Analysing these types of dyadic-interactions has several useful applications in the medical domain for social-skills development and mental health diagnosis. To achieve this, we propose a novel dual-path architecture to capture the loose interaction between two individuals. Our model learns global abstract features from each stream via a CNNs backbone and fuses them using a new Global-Layer-Attention module based on a cross-attention strategy. We evaluate our model on real-world autism diagnoses such as our Loose-Interaction dataset, and the publicly available Autism dataset for loose interactions. Our network achieves baseline results on the Loose-Interaction and SOTA results on the Autism datasets. Moreover, we study different social interactions by experimenting on a publicly available dataset i.e. NTU-RGB+D (interactive classes from both NTU-60 and NTU-120). We have found that different interactions require different network designs. We also compare a slightly different version of our method by incorporating time information to address tight interactions achieving SOTA results.

Via

Access Paper or Ask Questions

Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

Sep 26, 2024

Tomasz Stanczyk, Francois Bremond

Figure 1 for Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

Figure 2 for Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

Figure 3 for Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

Figure 4 for Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

Abstract:Multi-object tracking (MOT) involves identifying and consistently tracking objects across video sequences. Traditional tracking-by-detection methods, while effective, often require extensive tuning and lack generalizability. On the other hand, segmentation mask-based methods are more generic but struggle with tracking management, making them unsuitable for MOT. We propose a novel approach, McByte, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework. By combining bounding box and mask information, McByte enhances robustness and generalizability without per-sequence tuning. Evaluated on four benchmark datasets - DanceTrack, MOT17, SoccerNet-tracking 2022, and KITTI-tracking - McByte demonstrates performance gain in all cases examined. At the same time, it outperforms existing mask-based methods. Implementation code will be provided upon acceptance.

Via

Access Paper or Ask Questions

Introducing Gating and Context into Temporal Action Detection

Sep 06, 2024

Aglind Reka, Diana Laura Borza, Dominick Reilly, Michal Balazia, Francois Bremond

Abstract:Temporal Action Detection (TAD), the task of localizing and classifying actions in untrimmed video, remains challenging due to action overlaps and variable action durations. Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism. Building on this insight, we propose a refined feature extraction process through lightweight, yet effective operations. First, we employ a local branch that employs parallel convolutions with varying window sizes to capture both fine-grained and coarse-grained temporal features. This branch incorporates a gating mechanism to select the most relevant features. Second, we introduce a context branch that uses boundary frames as key-value pairs to analyze their relationship with the central frame through cross-attention. The proposed method captures temporal dependencies and improves contextual understanding. Evaluations of the gating mechanism and context branch on challenging datasets (THUMOS14 and EPIC-KITCHEN 100) show a consistent improvement over the baseline and existing methods.

* Accepted for publication at the ECCV 2024 ABAW Workshop

Via

Access Paper or Ask Questions

Weakly-supervised Autism Severity Assessment in Long Videos

Jul 12, 2024

Abid Ali, Mahmoud Ali, Jean-Marc Odobez, Camilla Barbini, Séverine Dubuisson, Francois Bremond, Susanne Thümmler

Abstract:Autism Spectrum Disorder (ASD) is a diverse collection of neurobiological conditions marked by challenges in social communication and reciprocal interactions, as well as repetitive and stereotypical behaviors. Atypical behavior patterns in a long, untrimmed video can serve as biomarkers for children with ASD. In this paper, we propose a video-based weakly-supervised method that takes spatio-temporal features of long videos to learn typical and atypical behaviors for autism detection. On top of that, we propose a shallow TCN-MLP network, which is designed to further categorize the severity score. We evaluate our method on actual evaluation videos of children with autism collected and annotated (for severity score) by clinical professionals. Experimental results demonstrate the effectiveness of behavioral biomarkers that could help clinicians in autism spectrum analysis.

* https://cbmi2024.org/

Via

Access Paper or Ask Questions