Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ajinkya Kulkarni

MULTISPEECH

Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Dec 29, 2025

Panagiotis Theocharopoulos, Ajinkya Kulkarni, Mathew Magimai. -Doss

Abstract:Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.

Via

Access Paper or Ask Questions

Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems

Mar 02, 2025

Ajinkya Kulkarni, Atharva Kulkarni, Miguel Couceiro, Isabel Trancoso

Abstract:In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOTA) performances. Despite their improved performance in controlled settings, there remains a critical gap in understanding their efficacy and equity in real-world scenarios. We analyze ASR biases w.r.t. gender, accent, and age group, as well as their effect on downstream tasks. In addition, we examine the environmental impact of ASR systems, scrutinizing the use of large acoustic models on carbon emission and energy consumption. We also provide insights into our empirical analyses, offering a valuable contribution to the claims surrounding bias and sustainability in ASR systems.

* Interspeech 2024

Via

Access Paper or Ask Questions

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

Jan 17, 2025

Karl El Hajal, Enno Hermann, Ajinkya Kulkarni, Mathew Magimai. -Doss

Abstract:Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at https://idiap.github.io/RnV .

* Accepted at ICASSP 2025 Satellite Workshop: Workshop on Speech Pathology Analysis and DEtection (SPADE)

Via

Access Paper or Ask Questions

SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS

Aug 20, 2024

Karl El Hajal, Ajinkya Kulkarni, Enno Hermann, Mathew Magimai. -Doss

Abstract:While recent zero-shot multispeaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. It was also observed that SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity, which enables straight-forward and robust voice cloning. In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker. SSL-TTS leverages SSL features and retrieval methods for simple and robust zero-shot multi-speaker synthesis. Objective and subjective evaluations show that our approach achieves performance comparable to state-of-the-art models that require significantly larger training datasets. The low training data requirements mean that SSL-TTS is well suited for the development of multi-speaker TTS systems for low-resource domains and languages. We also introduce an interpolation parameter which enables fine control over the output speech by blending voices. Demo samples are available at https://idiap.github.io/ssl-tts

* Submitted to IEEE Signal Processing Letters

Via

Access Paper or Ask Questions

What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

Jun 14, 2024

Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed

Figure 1 for What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

Figure 2 for What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

Figure 3 for What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

Figure 4 for What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

Abstract:Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.

* ACCEPTED AT INTERSPEECH 2024, GREECE

Via

Access Paper or Ask Questions

The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese

Feb 12, 2024

Ajinkya Kulkarni, Anna Tokareva, Rameez Qureshi, Miguel Couceiro

Abstract:In the field of spoken language understanding, systems like Whisper and Multilingual Massive Speech (MMS) have shown state-of-the-art performances. This study is dedicated to a comprehensive exploration of the Whisper and MMS systems, with a focus on assessing biases in automatic speech recognition (ASR) inherent to casual conversation speech specific to the Portuguese language. Our investigation encompasses various categories, including gender, age, skin tone color, and geo-location. Alongside traditional ASR evaluation metrics such as Word Error Rate (WER), we have incorporated p-value statistical significance for gender bias analysis. Furthermore, we extensively examine the impact of data distribution and empirically show that oversampling techniques alleviate such stereotypical biases. This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems' performance in multilingual settings.

* EACL-2024 LT-EDI Workshop

Via

Access Paper or Ask Questions

Training Convolutional Neural Networks with the Forward-Forward algorithm

Jan 07, 2024

Riccardo Scodellaro, Ajinkya Kulkarni, Frauke Alves, Matthias Schröter

Figure 1 for Training Convolutional Neural Networks with the Forward-Forward algorithm

Figure 2 for Training Convolutional Neural Networks with the Forward-Forward algorithm

Figure 3 for Training Convolutional Neural Networks with the Forward-Forward algorithm

Figure 4 for Training Convolutional Neural Networks with the Forward-Forward algorithm

Abstract:The recent successes in analyzing images with deep neural networks are almost exclusively achieved with Convolutional Neural Networks (CNNs). The training of these CNNs, and in fact of all deep neural network architectures, uses the backpropagation algorithm where the output of the network is compared with the desired result and the difference is then used to tune the weights of the network towards the desired outcome. In a 2022 preprint, Geoffrey Hinton suggested an alternative way of training which passes the desired results together with the images at the input of the network. This so called Forward Forward (FF) algorithm has up to now only been used in fully connected networks. In this paper, we show how the FF paradigm can be extended to CNNs. Our FF-trained CNN, featuring a novel spatially-extended labeling technique, achieves a classification accuracy of 99.16% on the MNIST hand-written digits dataset. We show how different hyperparameters affect the performance of the proposed algorithm and compare the results with CNN trained with the standard backpropagation approach. Furthermore, we use Class Activation Maps to investigate which type of features are learnt by the FF algorithm.

* 19 pages, 9 figures

Via

Access Paper or Ask Questions

ArTST: Arabic Text and Speech Transformer

Oct 25, 2023

Hawau Olamide Toyin, Amirbek Djanibekov, Ajinkya Kulkarni, Hanan Aldarmaki

Abstract:We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture follows the unified-modal framework, SpeechT5, that was recently released for English, and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions. We pre-trained the model from scratch on MSA speech and text data, and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), Text-To-Speech synthesis (TTS), and spoken dialect identification. In our experiments comparing ArTST with SpeechT5, as well as with previously reported results in these tasks, ArTST performs on a par with or exceeding the current state-of-the-art in all three tasks. Moreover, we find that our pre-training is conducive for generalization, which is particularly evident in the low-resource TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models are released for research use.

* 11 pages, 1 figure, SIGARAB ArabicNLP 2023

Via

Access Paper or Ask Questions

Yet Another Model for Arabic Dialect Identification

Oct 20, 2023

Ajinkya Kulkarni, Hanan Aldarmaki

Figure 1 for Yet Another Model for Arabic Dialect Identification

Figure 2 for Yet Another Model for Arabic Dialect Identification

Abstract:In this paper, we describe a spoken Arabic dialect identification (ADI) model for Arabic that consistently outperforms previously published results on two benchmark datasets: ADI-5 and ADI-17. We explore two architectural variations: ResNet and ECAPA-TDNN, coupled with two types of acoustic features: MFCCs and features exratected from the pre-trained self-supervised model UniSpeech-SAT Large, as well as a fusion of all four variants. We find that individually, ECAPA-TDNN network outperforms ResNet, and models with UniSpeech-SAT features outperform models with MFCCs by a large margin. Furthermore, a fusion of all four variants consistently outperforms individual models. Our best models outperform previously reported results on both datasets, with accuracies of 84.7% and 96.9% on ADI-5 and ADI-17, respectively.

* ACCEPTED AT ArabicNLP 2023

Via

Access Paper or Ask Questions

Adapting the adapters for code-switching in multilingual ASR

Oct 11, 2023

Atharva Kulkarni, Ajinkya Kulkarni, Miguel Couceiro, Hanan Aldarmaki

Figure 1 for Adapting the adapters for code-switching in multilingual ASR

Figure 2 for Adapting the adapters for code-switching in multilingual ASR

Abstract:Recently, large pre-trained multilingual speech models have shown potential in scaling Automatic Speech Recognition (ASR) to many low-resource languages. Some of these models employ language adapters in their formulation, which helps to improve monolingual performance and avoids some of the drawbacks of multi-lingual modeling on resource-rich languages. However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. We also model code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level. The proposed approaches are evaluated on three code-switched datasets encompassing Arabic, Mandarin, and Hindi languages paired with English, showing consistent improvements in code-switching performance with at least 10\% absolute reduction in CER across all test sets.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions