Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marco Dinarelli

Université Grenoble Alpes, Laboratoires: LIG - Getalp Group

DOLFIN -- Document-Level Financial test set for Machine Translation

Feb 05, 2025

Mariam Nakhlé, Marco Dinarelli, Raheel Qader, Emmanuelle Esperança-Rodier, Hervé Blanchon

Abstract:Despite the strong research interest in document-level Machine Translation (MT), the test sets dedicated to this task are still scarce. The existing test sets mainly cover topics from the general domain and fall short on specialised domains, such as legal and financial. Also, in spite of their document-level aspect, they still follow a sentence-level logic that does not allow for including certain linguistic phenomena such as information reorganisation. In this work, we aim to fill this gap by proposing a novel test set: DOLFIN. The dataset is built from specialised financial documents, and it makes a step towards true document-level MT by abandoning the paradigm of perfectly aligned sentences, presenting data in units of sections rather than sentences. The test set consists of an average of 1950 aligned sections for five language pairs. We present a detailed data collection pipeline that can serve as inspiration for aligning new document-level datasets. We demonstrate the usefulness and quality of this test set by evaluating a number of models. Our results show that the test set is able to discriminate between context-sensitive and context-agnostic models and shows the weaknesses when models fail to accurately translate financial texts. The test set is made public for the community.

* To be published in NAACL 2025 Findings

Via

Access Paper or Ask Questions

Towards Early Prediction of Self-Supervised Speech Model Performance

Jan 10, 2025

Ryan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Estève

Abstract:In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.

Via

Access Paper or Ask Questions

An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

Sep 04, 2024

Ryan Whetten, Titouan Parcollet, Adel Moumen, Marco Dinarelli, Yannick Estève

Figure 1 for An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

Figure 2 for An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

Figure 3 for An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

Abstract:Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.

* Accepted in the IEEE Soken Language Technology Workshop 2024

Via

Access Paper or Ask Questions

Open Implementation and Study of BEST-RQ for Speech Processing

May 07, 2024

Ryan Whetten, Titouan Parcollet, Marco Dinarelli, Yannick Estève

Abstract:Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.

* Accepted in IEEE ICASSP 2024 workshop on Self-supervision in Audio, Speech and Beyond (SASB 2024)

Via

Access Paper or Ask Questions

LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

Sep 11, 2023

Titouan Parcollet, Ha Nguyen, Solene Evain, Marcely Zanon Boito, Adrien Pupier, Salima Mdhaffar, Hang Le, Sina Alisamir, Natalia Tomashenko, Marco Dinarelli(+12 more)

Abstract:Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training.

* Under submission at Computer Science and Language. Preprint allowed

Via

Access Paper or Ask Questions

Encoding Sentence Position in Context-Aware Neural Machine Translation with Concatenation

Feb 13, 2023

Lorenzo Lupo, Marco Dinarelli, Laurent Besacier

Abstract:Context-aware translation can be achieved by processing a concatenation of consecutive sentences with the standard translation approach. This paper investigates the intuitive idea of adopting segment embeddings for this task to help the Transformer discern the position of each sentence in the concatenation sequence. We compare various segment embeddings and propose novel methods to encode sentence position into token representations, showing that they do not benefit the vanilla concatenation approach except in a specific setting.

Via

Access Paper or Ask Questions

Focused Concatenation for Context-Aware Neural Machine Translation

Oct 24, 2022

Lorenzo Lupo, Marco Dinarelli, Laurent Besacier

Abstract:A straightforward approach to context-aware neural machine translation consists in feeding the standard encoder-decoder architecture with a window of consecutive sentences, formed by the current sentence and a number of sentences from its context concatenated to it. In this work, we propose an improved concatenation approach that encourages the model to focus on the translation of the current sentence, discounting the loss generated by target context. We also propose an additional improvement that strengthen the notion of sentence boundaries and of relative sentence distance, facilitating model compliance to the context-discounted objective. We evaluate our approach with both average-translation quality metrics and contrastive test sets for the translation of inter-sentential discourse phenomena, proving its superiority to the vanilla concatenation approach and other sophisticated context-aware systems.

* WMT 2022 (camera ready)

Via

Access Paper or Ask Questions

TArC: Tunisian Arabish Corpus First complete release

Jul 11, 2022

Elisa Gugliotta, Marco Dinarelli

Figure 1 for TArC: Tunisian Arabish Corpus First complete release

Figure 2 for TArC: Tunisian Arabish Corpus First complete release

Figure 3 for TArC: Tunisian Arabish Corpus First complete release

Figure 4 for TArC: Tunisian Arabish Corpus First complete release

Abstract:In this paper we present the final result of a project on Tunisian Arabic encoded in Arabizi, the Latin-based writing system for digital conversations. The project led to the creation of two integrated and independent resources: a corpus and a NLP tool created to annotate the former with various levels of linguistic information: word classification, transliteration, tokenization, POS-tagging, lemmatization. We discuss our choices in terms of computational and linguistic methodology and the strategies adopted to improve our results. We report on the experiments performed in order to outline our research path. Finally, we explain why we believe in the potential of these resources for both computational and linguistic researches. Keywords: Tunisian Arabizi, Annotated Corpus, Neural Network Architecture

* In Proceedings of the Language Resources and Evaluation Conference (LREC2022), Marseille. European Language Resources Association (pp. 1125-1136)

Via

Access Paper or Ask Questions

Toward Low-Cost End-to-End Spoken Language Understanding

Jul 01, 2022

Marco Dinarelli, Marco Naguib, François Portet

Figure 1 for Toward Low-Cost End-to-End Spoken Language Understanding

Figure 2 for Toward Low-Cost End-to-End Spoken Language Understanding

Figure 3 for Toward Low-Cost End-to-End Spoken Language Understanding

Abstract:Recent advances in spoken language understanding benefited from Self-Supervised models trained on large speech corpora. For French, the LeBenchmark project has made such models available and has led to impressive progress on several tasks including spoken language understanding. These advances have a non-negligible cost in terms of computation time and energy consumption. In this paper, we compare several learning strategies trying to reduce such cost while keeping competitive performance. At the same time we propose an extensive analysis where we measure the cost of our models in terms of training time and electric energy consumption, hopefully promoting a comprehensive evaluation procedure. The experiments are performed on the FSC and MEDIA corpora, and show that it is possible to reduce the learning cost while maintaining state-of-the-art performance and using SSL models.

* Accepted for publication at Interspeech 2022; Slightly improved (longer) version

Via

Access Paper or Ask Questions

Vers la compréhension automatique de la parole bout-en-bout à moindre effort

Jul 01, 2022

Marco Naguib, François Portet, Marco Dinarelli

Figure 1 for Vers la compréhension automatique de la parole bout-en-bout à moindre effort

Figure 2 for Vers la compréhension automatique de la parole bout-en-bout à moindre effort

Figure 3 for Vers la compréhension automatique de la parole bout-en-bout à moindre effort

Figure 4 for Vers la compréhension automatique de la parole bout-en-bout à moindre effort

Abstract:Recent advances in spoken language understanding benefited from Self-Supervised models trained on large speech corpora. For French, the LeBenchmark project has made such models available and has led to impressive progress on several tasks including spoken language understanding. These advances have a non-negligible cost in terms of computation time and energy consumption. In this paper, we compare several learning strategies aiming at reducing such cost while keeping competitive performances. The experiments are performed on the MEDIA corpus, and show that it is possible to reduce the learning cost while maintaining state-of-the-art performances.

* Language: French; Paper accepted for publication at the French Conference TALN 2022; preliminary work for the Interspeech 2022 paper (coming soon)

Via

Access Paper or Ask Questions