Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wassim Bouachir

Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Aug 06, 2025

Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Figure 1 for Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Figure 2 for Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Figure 3 for Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Figure 4 for Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Abstract:Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.

* 10 pages

Via

Access Paper or Ask Questions

InceptoFormer: A Multi-Signal Neural Framework for Parkinson's Disease Severity Evaluation from Gait

Aug 06, 2025

Safwen Naimi, Arij Said, Wassim Bouachir, Guillaume-Alexandre Bilodeau

Abstract:We present InceptoFormer, a multi-signal neural framework designed for Parkinson's Disease (PD) severity evaluation via gait dynamics analysis. Our architecture introduces a 1D adaptation of the Inception model, which we refer to as Inception1D, along with a Transformer-based framework to stage PD severity according to the Hoehn and Yahr (H&Y) scale. The Inception1D component captures multi-scale temporal features by employing parallel 1D convolutional filters with varying kernel sizes, thereby extracting features across multiple temporal scales. The transformer component efficiently models long-range dependencies within gait sequences, providing a comprehensive understanding of both local and global patterns. To address the issue of class imbalance in PD severity staging, we propose a data structuring and preprocessing strategy based on oversampling to enhance the representation of underrepresented severity levels. The overall design enables to capture fine-grained temporal variations and global dynamics in gait signal, significantly improving classification performance for PD severity evaluation. Through extensive experimentation, InceptoFormer achieves an accuracy of 96.6%, outperforming existing state-of-the-art methods in PD severity assessment. The source code for our implementation is publicly available at https://github.com/SafwenNaimi/InceptoFormer

* 11 pages; 5 figures. Published in the proceedings of the 2025 Canadian AI conference

Via

Access Paper or Ask Questions

ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

Sep 09, 2024

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau

Abstract:To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: https://github.com/SafwenNaimi/Representation-Learning-for-Skeleton-Action-Recognition-with-Convolutional-Transformers-and-BYOL

* 8 pages, 4 figures, 6 tables

Via

Access Paper or Ask Questions

Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition

Jun 01, 2024

Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara

Abstract:Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction. However, the accuracy of SER systems is hindered by high-dimensional feature sets that may contain irrelevant and redundant information. To overcome this challenge, this study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance. Our approach involves meticulous feature selection and analysis to build efficient SER systems. In addressing our main problem through model explainability, we employ a feature evaluation loop with Shapley values to iteratively refine feature sets. This process strikes a balance between model performance and transparency, which enables a comprehensive understanding of the model's predictions. The proposed approach offers several advantages, including the identification and removal of irrelevant and redundant features, leading to a more effective model. Additionally, it promotes explainability, facilitating comprehension of the model's predictions and the identification of crucial features for emotion determination. The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets, outperforming state-of-the-art methods. These results highlight the potential of the proposed technique in developing accurate and explainable SER systems. To the best of our knowledge, this is the first work to incorporate model explainability into an SER framework.

* Applied Intelligence (2024)
* Published in: Springer Nature International Journal of Applied Intelligence (2024)

Via

Access Paper or Ask Questions

Iterative Feature Boosting for Explainable Speech Emotion Recognition

May 31, 2024

Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara

Figure 1 for Iterative Feature Boosting for Explainable Speech Emotion Recognition

Figure 2 for Iterative Feature Boosting for Explainable Speech Emotion Recognition

Figure 3 for Iterative Feature Boosting for Explainable Speech Emotion Recognition

Figure 4 for Iterative Feature Boosting for Explainable Speech Emotion Recognition

Abstract:In speech emotion recognition (SER), using predefined features without considering their practical importance may lead to high dimensional datasets, including redundant and irrelevant information. Consequently, high-dimensional learning often results in decreasing model accuracy while increasing computational complexity. Our work underlines the importance of carefully considering and analyzing features in order to build efficient SER systems. We present a new supervised SER method based on an efficient feature engineering approach. We pay particular attention to the explainability of results to evaluate feature relevance and refine feature sets. This is performed iteratively through feature evaluation loop, using Shapley values to boost feature selection and improve overall framework performance. Our approach allows thus to balance the benefits between model performance and transparency. The proposed method outperforms human-level performance (HLP) and state-of-the-art machine learning methods in emotion recognition on the TESS dataset.

* 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 2023, pp. 543-549
* Published in: 2023 International Conference on Machine Learning and Applications (ICMLA)

Via

Access Paper or Ask Questions

Detection of Micromobility Vehicles in Urban Traffic Videos

Feb 28, 2024

Khalil Sabri, Célia Djilali, Guillaume-Alexandre Bilodeau, Nicolas Saunier, Wassim Bouachir

Figure 1 for Detection of Micromobility Vehicles in Urban Traffic Videos

Figure 2 for Detection of Micromobility Vehicles in Urban Traffic Videos

Figure 3 for Detection of Micromobility Vehicles in Urban Traffic Videos

Figure 4 for Detection of Micromobility Vehicles in Urban Traffic Videos

Abstract:Urban traffic environments present unique challenges for object detection, particularly with the increasing presence of micromobility vehicles like e-scooters and bikes. To address this object detection problem, this work introduces an adapted detection model that combines the accuracy and speed of single-frame object detection with the richer features offered by video object detection frameworks. This is done by applying aggregated feature maps from consecutive frames processed through motion flow to the YOLOX architecture. This fusion brings a temporal perspective to YOLOX detection abilities, allowing for a better understanding of urban mobility patterns and substantially improving detection reliability. Tested on a custom dataset curated for urban micromobility scenarios, our model showcases substantial improvement over existing state-of-the-art methods, demonstrating the need to consider spatio-temporal information for detecting such small and thin objects. Our approach enhances detection in challenging conditions, including occlusions, ensuring temporal consistency, and effectively mitigating motion blur.

Via

Access Paper or Ask Questions

STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Feb 16, 2024

Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir

Figure 1 for STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Figure 2 for STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Figure 3 for STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Figure 4 for STF: Spatio-Temporal Fusion Module for Improving Video Object Detection

Abstract:Consecutive frames in a video contain redundancy, but they may also contain relevant complementary information for the detection task. The objective of our work is to leverage this complementary information to improve detection. Therefore, we propose a spatio-temporal fusion framework (STF). We first introduce multi-frame and single-frame attention modules that allow a neural network to share feature maps between nearby frames to obtain more robust object representations. Second, we introduce a dual-frame fusion module that merges feature maps in a learnable manner to improve them. Our evaluation is conducted on three different benchmarks including video sequences of moving road users. The performed experiments demonstrate that the proposed spatio-temporal fusion module leads to improved detection performance compared to baseline object detectors. Code is available at https://github.com/noreenanwar/STF-module

* 8 pages,3 figures

Via

Access Paper or Ask Questions

1D-Convolutional transformer for Parkinson disease diagnosis from gait

Nov 06, 2023

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau

Figure 1 for 1D-Convolutional transformer for Parkinson disease diagnosis from gait

Figure 2 for 1D-Convolutional transformer for Parkinson disease diagnosis from gait

Figure 3 for 1D-Convolutional transformer for Parkinson disease diagnosis from gait

Figure 4 for 1D-Convolutional transformer for Parkinson disease diagnosis from gait

Abstract:This paper presents an efficient deep neural network model for diagnosing Parkinson's disease from gait. More specifically, we introduce a hybrid ConvNet-Transformer architecture to accurately diagnose the disease by detecting the severity stage. The proposed architecture exploits the strengths of both Convolutional Neural Networks and Transformers in a single end-to-end model, where the former is able to extract relevant local features from Vertical Ground Reaction Force (VGRF) signal, while the latter allows to capture long-term spatio-temporal dependencies in data. In this manner, our hybrid architecture achieves an improved performance compared to using either models individually. Our experimental results show that our approach is effective for detecting the different stages of Parkinson's disease from gait data, with a final accuracy of 88%, outperforming other state-of-the-art AI methods on the Physionet gait dataset. Moreover, our method can be generalized and adapted for other classification problems to jointly address the feature relevance and spatio-temporal dependency problems in 1D signals. Our source code and pre-trained models are publicly available at https://github.com/SafwenNaimi/1D-Convolutional-transformer-for-Parkinson-disease-diagnosis-from-gait.

* 17 pages, 5 Figures, 6 Tables. Accepted for publication in Neural Computing and Applications (NCAA) 2023

Via

Access Paper or Ask Questions

Automatic counting of planting microsites via local visual detection and global count estimation

Nov 01, 2023

Ahmed Zgaren, Wassim Bouachir, Nizar Bouguila

Figure 1 for Automatic counting of planting microsites via local visual detection and global count estimation

Figure 2 for Automatic counting of planting microsites via local visual detection and global count estimation

Figure 3 for Automatic counting of planting microsites via local visual detection and global count estimation

Figure 4 for Automatic counting of planting microsites via local visual detection and global count estimation

Abstract:In forest industry, mechanical site preparation by mounding is widely used prior to planting operations. One of the main problems when planning planting operations is the difficulty in estimating the number of mounds present on a planting block, as their number may greatly vary depending on site characteristics. This estimation is often carried out through field surveys by several forestry workers. However, this procedure is prone to error and slowness. Motivated by recent advances in UAV imagery and artificial intelligence, we propose a fully automated framework to estimate the number of mounds on a planting block. Using computer vision and machine learning, we formulate the counting task as a supervised learning problem using two prediction models. A local detection model is firstly used to detect visible mounds based on deep features, while a global prediction function is subsequently applied to provide a final estimation based on block-level features. To evaluate the proposed method, we constructed a challenging UAV dataset representing several plantation blocks with different characteristics. The performed experiments demonstrated the robustness of the proposed method, which outperforms manual methods in precision, while significantly reducing time and cost.

* IEEE Transactions on Emerging Topics in Computational Intelligence, 2023, 1-15

Via

Access Paper or Ask Questions

HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait

Oct 26, 2023

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau

Figure 1 for HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait

Figure 2 for HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait

Figure 3 for HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait

Figure 4 for HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait

Abstract:In this paper, we propose a novel deep learning method based on a new Hybrid ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) from gait data. We adopt a two-step approach by dividing the problem into two sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy versus parkinsonian patients. If the patient is parkinsonian, a multi-class Hybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to assess the PD severity stage. Our hybrid architecture exploits the strengths of both Convolutional Neural Networks (ConvNets) and Transformers to accurately detect PD and determine the severity stage. In particular, we take advantage of ConvNets to capture local patterns and correlations in the data, while we exploit Transformers for handling long-term dependencies in the input signal. We show that our hybrid method achieves superior performance when compared to other state-of-the-art methods, with a PD detection accuracy of 97% and a severity staging accuracy of 87%. Our source code is available at: https://github.com/SafwenNaimi

* 6 pages, 6 figures, 3 tables, Accepted for publication in IEEE International Conference on Machine Learning and Applications (ICMLA), copyright IEEE

Via

Access Paper or Ask Questions