Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Javier Hernando

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

May 30, 2025

Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando

Abstract:We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.

* Accepted at Interspeech 2025

Via

Access Paper or Ask Questions

Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization

Mar 28, 2025

Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas

Abstract:Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.

Via

Access Paper or Ask Questions

Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge

Feb 04, 2025

Daniel Tamayo, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas

Abstract:Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at https://github.com/dtamayo-nlp/MEMAT.

* Findings of the Association for Computational Linguistics: ACL 2024. Pages: 5831-5847

Via

Access Paper or Ask Questions

Language Modelling for Speaker Diarization in Telephonic Interviews

Jan 28, 2025

Miquel India, Javier Hernando, José A. R. Fonollosa

Figure 1 for Language Modelling for Speaker Diarization in Telephonic Interviews

Figure 2 for Language Modelling for Speaker Diarization in Telephonic Interviews

Figure 3 for Language Modelling for Speaker Diarization in Telephonic Interviews

Figure 4 for Language Modelling for Speaker Diarization in Telephonic Interviews

Abstract:The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.

Via

Access Paper or Ask Questions

On the Use of Audio to Improve Dialogue Policies

Oct 17, 2024

Daniel Roncel, Federico Costa, Javier Hernando

Figure 1 for On the Use of Audio to Improve Dialogue Policies

Figure 2 for On the Use of Audio to Improve Dialogue Policies

Figure 3 for On the Use of Audio to Improve Dialogue Policies

Abstract:With the significant progress of speech technologies, spoken goal-oriented dialogue systems are becoming increasingly popular. One of the main modules of a dialogue system is typically the dialogue policy, which is responsible for determining system actions. This component usually relies only on audio transcriptions, being strongly dependent on their quality and ignoring very important extralinguistic information embedded in the user's speech. In this paper, we propose new architectures to add audio information by combining speech and text embeddings using a Double Multi-Head Attention component. Our experiments show that audio embedding-aware dialogue policies outperform text-based ones, particularly in noisy transcription scenarios, and that how text and audio embeddings are combined is crucial to improve performance. We obtained a 9.8% relative improvement in the User Request Score compared to an only-text-based dialogue system on the DSTC2 dataset.

* IberSpeech 2024

Via

Access Paper or Ask Questions

BSC-UPC at EmoSPeech-IberLEF2024: Attention Pooling for Emotion Recognition

Jul 17, 2024

Marc Casals-Salvador, Federico Costa, Miquel India, Javier Hernando

Abstract:The domain of speech emotion recognition (SER) has persistently been a frontier within the landscape of machine learning. It is an active field that has been revolutionized in the last few decades and whose implementations are remarkable in multiple applications that could affect daily life. Consequently, the Iberian Languages Evaluation Forum (IberLEF) of 2024 held a competitive challenge to leverage the SER results with a Spanish corpus. This paper presents the approach followed with the goal of participating in this competition. The main architecture consists of different pre-trained speech and text models to extract features from both modalities, utilizing an attention pooling mechanism. The proposed system has achieved the first position in the challenge with an 86.69% in Macro F1-Score.

Via

Access Paper or Ask Questions

Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Jun 15, 2024

Federico Costa, Miquel India, Javier Hernando

Figure 1 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Figure 2 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Figure 3 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Figure 4 for Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Abstract:As computer-based applications are becoming more integrated into our daily lives, the importance of Speech Emotion Recognition (SER) has increased significantly. Promoting research with innovative approaches in SER, the Odyssey 2024 Speech Emotion Recognition Challenge was organized as part of the Odyssey 2024 Speaker and Language Recognition Workshop. In this paper we describe the Double Multi-Head Attention Multimodal System developed for this challenge. Pre-trained self-supervised models were used to extract informative acoustic and text features. An early fusion strategy was adopted, where a Multi-Head Attention layer transforms these mixed features into complementary contextualized representations. A second attention mechanism is then applied to pool these representations into an utterance-level vector. Our proposed system achieved the third position in the categorical task ranking with a 34.41% Macro-F1 score, where 31 teams participated in total.

* Odyssey 2024: The Speaker and Language Recognition Workshop

Via

Access Paper or Ask Questions

Speaker Characterization by means of Attention Pooling

May 07, 2024

Federico Costa, Miquel India, Javier Hernando

Abstract:State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.

* Proc. IberSPEECH 2022, 166-170
* IberSpeech 2022

Via

Access Paper or Ask Questions

Self-attention encoding and pooling for speaker recognition

Aug 03, 2020

Pooyan Safari, Miquel India, Javier Hernando

Figure 1 for Self-attention encoding and pooling for speaker recognition

Figure 2 for Self-attention encoding and pooling for speaker recognition

Figure 3 for Self-attention encoding and pooling for speaker recognition

Abstract:The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.

Via

Access Paper or Ask Questions

End-to-end User Recognition using Touchscreen Biometrics

Jun 09, 2020

Michał Krzemiński, Javier Hernando

Figure 1 for End-to-end User Recognition using Touchscreen Biometrics

Figure 2 for End-to-end User Recognition using Touchscreen Biometrics

Figure 3 for End-to-end User Recognition using Touchscreen Biometrics

Figure 4 for End-to-end User Recognition using Touchscreen Biometrics

Abstract:We study the touchscreen data as behavioural biometrics. The goal was to create an end-to-end system that can transparently identify users using raw data from mobile devices. The touchscreen biometrics was researched only few times in series of works with disparity in used methodology and databases. In the proposed system data from the touchscreen goes directly, without any processing, to the input of a deep neural network, which is able to decide on the identity of the user. No hand-crafted features are used. The implemented classification algorithm tries to find patterns by its own from raw data. The achieved results show that the proposed deep model is sufficient enough for the given identification task. The performed tests indicate high accuracy of user identification and better EER results compared to state of the art systems. The best result achieved by our system is 0.65% EER.

Via

Access Paper or Ask Questions