Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stéphane Dupont

Deep learning in medical image registration: introduction and survey

Sep 01, 2023

Ahmad Hammoudeh, Stéphane Dupont

Abstract:Image registration (IR) is a process that deforms images to align them with respect to a reference space, making it easier for medical practitioners to examine various medical images in a standardized reference frame, such as having the same rotation and scale. This document introduces image registration using a simple numeric example. It provides a definition of image registration along with a space-oriented symbolic representation. This review covers various aspects of image transformations, including affine, deformable, invertible, and bidirectional transformations, as well as medical image registration algorithms such as Voxelmorph, Demons, SyN, Iterative Closest Point, and SynthMorph. It also explores atlas-based registration and multistage image registration techniques, including coarse-fine and pyramid approaches. Furthermore, this survey paper discusses medical image registration taxonomies, datasets, evaluation measures, such as correlation-based metrics, segmentation-based metrics, processing time, and model size. It also explores applications in image-guided surgery, motion tracking, and tumor diagnosis. Finally, the document addresses future research directions, including the further development of transformers.

Via

Access Paper or Ask Questions

A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with Batch Normalization and Knowledge Distillation

May 30, 2023

Omar Seddati, Nathan Hubens, Stéphane Dupont, Thierry Dutoit

Abstract:Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. Researchers have already proposed several well-performing solutions for this task, but most focus on enhancing embedding through different approaches such as triplet loss, quadruplet loss, adding data augmentation, and using edge extraction. In this work, we tackle the problem from various angles. We start by examining the training data quality and show some of its limitations. Then, we introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome those limitations through loss weighting based on anchors similarity. Through a series of experiments, we demonstrate that replacing a triplet loss with RTL outperforms previous state-of-the-art without the need for any data augmentation. In addition, we demonstrate why batch normalization is more suited for SBIR embeddings than l2-normalization and show that it improves significantly the performance of our models. We further investigate the capacity of models required for the photo and sketch domains and demonstrate that the photo encoder requires a higher capacity than the sketch encoder, which validates the hypothesis formulated in [34]. Then, we propose a straightforward approach to train small models, such as ShuffleNetv2 [22] efficiently with a marginal loss of accuracy through knowledge distillation. The same approach used with larger models enabled us to outperform previous state-of-the-art results and achieve a recall of 62.38% at k = 1 on The Sketchy Database [30].

Via

Access Paper or Ask Questions

Transformers and CNNs both Beat Humans on SBIR

Sep 14, 2022

Omar Seddati, Stéphane Dupont, Saïd Mahmoudi, Thierry Dutoit

Figure 1 for Transformers and CNNs both Beat Humans on SBIR

Figure 2 for Transformers and CNNs both Beat Humans on SBIR

Figure 3 for Transformers and CNNs both Beat Humans on SBIR

Figure 4 for Transformers and CNNs both Beat Humans on SBIR

Abstract:Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR solutions and show that a persistent invariance to horizontal flip (even after model finetuning) is harming performance. To overcome this limitation, we propose several approaches and evaluate in depth each of them to check their effectiveness. Our main contributions are twofold: We propose and evaluate several intuitive modifications to build SBIR solutions with better flip equivariance. We show that vision transformers are more suited for the SBIR task, and that they outperform CNNs with a large margin. We carried out numerous experiments and introduce the first models to outperform human performance on a large-scale SBIR benchmark (Sketchy). Our best model achieves a recall of 62.25% (at k = 1) on the sketchy benchmark compared to previous state-of-the-art methods 46.2%.

Via

Access Paper or Ask Questions

Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

May 20, 2022

Hugo Bohy, Ahmad Hammoudeh, Antoine Maiorca, Stéphane Dupont, Thierry Dutoit

Figure 1 for Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

Figure 2 for Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

Figure 3 for Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

Figure 4 for Analysis of Co-Laughter Gesture Relationship on RGB videos in Dyadic Conversation Contex

Abstract:The development of virtual agents has enabled human-avatar interactions to become increasingly rich and varied. Moreover, an expressive virtual agent i.e. that mimics the natural expression of emotions, enhances social interaction between a user (human) and an agent (intelligent machine). The set of non-verbal behaviors of a virtual character is, therefore, an important component in the context of human-machine interaction. Laughter is not just an audio signal, but an intrinsic relationship of multimodal non-verbal communication, in addition to audio, it includes facial expressions and body movements. Motion analysis often relies on a relevant motion capture dataset, but the main issue is that the acquisition of such a dataset is expensive and time-consuming. This work studies the relationship between laughter and body movements in dyadic conversations. The body movements were extracted from videos using deep learning based pose estimator model. We found that, in the explored NDC-ME dataset, a single statistical feature (i.e, the maximum value, or the maximum of Fourier transform) of a joint movement weakly correlates with laughter intensity by 30%. However, we did not find a direct correlation between audio features and body movements. We discuss about the challenges to use such dataset for the audio-driven co-laughter motion synthesis task.

* 5 pages, 2 figures, 2 tables

Via

Access Paper or Ask Questions

Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation

Feb 11, 2022

Ahmad Hammoudeh, Bastein Vanderplaetse, Stéphane Dupont

Abstract:This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words. Semantics-related losses and the utilization of more visual features (optical flow, inpainting) improved the normalized captioning score by 28\%. The web page of this work: https://sites.google.com/view/soccercaptioning}{https://sites.google.com/view/soccercaptioning

Via

Access Paper or Ask Questions

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Jun 12, 2021

Mathilde Brousmiche, Jean Rouat, Stéphane Dupont

Figure 1 for Multi-level Attention Fusion Network for Audio-visual Event Recognition

Figure 2 for Multi-level Attention Fusion Network for Audio-visual Event Recognition

Figure 3 for Multi-level Attention Fusion Network for Audio-visual Event Recognition

Figure 4 for Multi-level Attention Fusion Network for Audio-visual Event Recognition

Abstract:Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. Inspired by prior studies in neuroscience, we couple both modalities at different levels of visual and audio paths. Furthermore, the network dynamically highlights a modality at a given time window relevant to classify events. Experimental results in AVE (Audio-Visual Event), UCF51, and Kinetics-Sounds datasets show that the approach can effectively improve the accuracy in audio-visual event classification. Code is available at: https://github.com/numediart/MAFnet

* Preprint submitted to the Information Fusion journal in August 2020

Via

Access Paper or Ask Questions

Improved Soccer Action Spotting using both Audio and Video Streams

Nov 09, 2020

Bastien Vanderplaetse, Stéphane Dupont

Figure 1 for Improved Soccer Action Spotting using both Audio and Video Streams

Figure 2 for Improved Soccer Action Spotting using both Audio and Video Streams

Figure 3 for Improved Soccer Action Spotting using both Audio and Video Streams

Figure 4 for Improved Soccer Action Spotting using both Audio and Video Streams

Abstract:In this paper, we propose a study on multi-modal (audio and video) action spotting and classification in soccer videos. Action spotting and classification are the tasks that consist in finding the temporal anchors of events in a video and determine which event they are. This is an important application of general activity understanding. Here, we propose an experimental study on combining audio and video information at different stages of deep neural network architectures. We used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues. Through this work, we evaluated several ways to integrate audio stream into video-only-based architectures. We observed an average absolute improvement of the mean Average Precision (mAP) metric of $7.43\%$ for the action classification task and of $4.19\%$ for the action spotting task.

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 896-897

Via

Access Paper or Ask Questions

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Oct 05, 2020

Jean-Benoit Delbrouck, Noé Tits, Stéphane Dupont

Figure 1 for Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Figure 2 for Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Figure 3 for Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Figure 4 for Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Abstract:This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our models, we carefully evaluate their performances on the IEMOCAP, MOSI, MOSEI and MELD dataset. The experiments can be directly replicated and the code is fully open for future researches.

* EMNLP 2020 workshop: NLP Beyond Text (NLPBT)

Via

Access Paper or Ask Questions

A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Jun 29, 2020

Jean-Benoit Delbrouck, Noé Tits, Mathilde Brousmiche, Stéphane Dupont

Figure 1 for A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Figure 2 for A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Figure 3 for A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Figure 4 for A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Abstract:Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities. The proposed solution has also been submitted to the ACL20: Second Grand-Challenge on Multimodal Language to be evaluated on the CMU-MOSEI dataset. The code to replicate the presented experiments is open-source: https://github.com/jbdel/MOSEI_UMONS.

* Winner of the ACL20: Second Grand-Challenge on Multimodal Language

Via

Access Paper or Ask Questions

Modulated Self-attention Convolutional Network for VQA

Oct 31, 2019

Jean-Benoit Delbrouck, Antoine Maiorca, Nathan Hubens, Stéphane Dupont

Figure 1 for Modulated Self-attention Convolutional Network for VQA

Figure 2 for Modulated Self-attention Convolutional Network for VQA

Abstract:As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training. This small contribution aims to suggest new ideas to improve the visual processing of traditional convolutional network for visual question answering (VQA). In this paper, we propose to modulate by a linguistic input a CNN augmented with self-attention. We show encouraging relative improvements for future research in this direction.

* Accepted at NeurIPS 2019 workshop: ViGIL

Via

Access Paper or Ask Questions