Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alina Karakanta

Direct Speech Translation for Automatic Subtitling

Sep 27, 2022

Sara Papi, Marco Gaido, Alina Karakanta, Mauro Cettolo, Matteo Negri, Marco Turchi

Figure 1 for Direct Speech Translation for Automatic Subtitling

Figure 2 for Direct Speech Translation for Automatic Subtitling

Figure 3 for Direct Speech Translation for Automatic Subtitling

Figure 4 for Direct Speech Translation for Automatic Subtitling

Abstract:Automatic subtitling is the task of automatically translating the speech of an audiovisual product into short pieces of timed text, in other words, subtitles and their corresponding timestamps. The generated subtitles need to conform to multiple space and time requirements (length, reading speed) while being synchronised with the speech and segmented in a way that facilitates comprehension. Given its considerable complexity, automatic subtitling has so far been addressed through a pipeline of elements that deal separately with transcribing, translating, segmenting into subtitles and predicting timestamps. In this paper, we propose the first direct automatic subtitling model that generates target language subtitles and their timestamps from the source speech in a single solution. Comparisons with state-of-the-art cascaded models trained with both in- and out-domain data show that our system provides high-quality subtitles while also being competitive in terms of conformity, with all the advantages of maintaining a single model.

Via

Access Paper or Ask Questions

Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Sep 21, 2022

Sara Papi, Alina Karakanta, Matteo Negri, Marco Turchi

Figure 1 for Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Figure 2 for Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Figure 3 for Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Figure 4 for Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora

Abstract:Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.

* Accepted to AACL 2022

Via

Access Paper or Ask Questions

Evaluating Subtitle Segmentation for End-to-end Generation Systems

May 19, 2022

Alina Karakanta, François Buet, Mauro Cettolo, François Yvon

Figure 1 for Evaluating Subtitle Segmentation for End-to-end Generation Systems

Figure 2 for Evaluating Subtitle Segmentation for End-to-end Generation Systems

Figure 3 for Evaluating Subtitle Segmentation for End-to-end Generation Systems

Figure 4 for Evaluating Subtitle Segmentation for End-to-end Generation Systems

Abstract:Subtitles appear on screen as short pieces of text, segmented based on formal constraints (length) and syntactic/semantic criteria. Subtitle segmentation can be evaluated with sequence segmentation metrics against a human reference. However, standard segmentation metrics cannot be applied when systems generate outputs different than the reference, e.g. with end-to-end subtitling systems. In this paper, we study ways to conduct reference-based evaluations of segmentation accuracy irrespective of the textual content. We first conduct a systematic analysis of existing metrics for evaluating subtitle segmentation. We then introduce $Sigma$, a new Subtitle Segmentation Score derived from an approximate upper-bound of BLEU on segmentation boundaries, which allows us to disentangle the effect of good segmentation from text quality. To compare $Sigma$ with existing metrics, we further propose a boundary projection method from imperfect hypotheses to the true reference. Results show that all metrics are able to reward high quality output but for similar outputs system ranking depends on each metric's sensitivity to error type. Our thorough analyses suggest $Sigma$ is a promising segmentation candidate but its reliability over other segmentation metrics remains to be validated through correlations with human judgements.

* Accepted at LREC 2022

Via

Access Paper or Ask Questions

Simultaneous Speech Translation for Live Subtitling: from Delay to Display

Jul 20, 2021

Alina Karakanta, Sara Papi, Matteo Negri, Marco Turchi

Figure 1 for Simultaneous Speech Translation for Live Subtitling: from Delay to Display

Figure 2 for Simultaneous Speech Translation for Live Subtitling: from Delay to Display

Figure 3 for Simultaneous Speech Translation for Live Subtitling: from Delay to Display

Abstract:With the increased audiovisualisation of communication, the need for live subtitles in multilingual events is more relevant than ever. In an attempt to automatise the process, we aim at exploring the feasibility of simultaneous speech translation (SimulST) for live subtitling. However, the word-for-word rate of generation of SimulST systems is not optimal for displaying the subtitles in a comprehensible and readable way. In this work, we adapt SimulST systems to predict subtitle breaks along with the translation. We then propose a display mode that exploits the predicted break structure by presenting the subtitles in scrolling lines. We compare our proposed mode with a display 1) word-for-word and 2) in blocks, in terms of reading speed and delay. Experiments on three language pairs (en$\rightarrow$it, de, fr) show that scrolling lines is the only mode achieving an acceptable reading speed while keeping delay close to a 4-second threshold. We argue that simultaneous translation for readable live subtitles still faces challenges, the main one being poor translation quality, and propose directions for steering future research.

* Proceedings of MT Summit 2021 at Automatic Spoken Language Translation in Real-World Settings

Via

Access Paper or Ask Questions

Between Flexibility and Consistency: Joint Generation of Captions and Subtitles

Jul 13, 2021

Alina Karakanta, Marco Gaido, Matteo Negri, Marco Turchi

Figure 1 for Between Flexibility and Consistency: Joint Generation of Captions and Subtitles

Figure 2 for Between Flexibility and Consistency: Joint Generation of Captions and Subtitles

Abstract:Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing (i.e. captions). However, the joint generation of source captions and target subtitles does not only bring potential output quality advantages when the two decoding processes inform each other, but it is also often required in multilingual scenarios. In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content. We further introduce new metrics for evaluating subtitling consistency. Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.

* Accepted at IWSLT 2021

Via

Access Paper or Ask Questions

Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?

Jun 02, 2021

Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Karakanta, Alberto Martinelli, Matteo Negri, Marco Turchi

Figure 1 for Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?

Figure 2 for Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?

Figure 3 for Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?

Figure 4 for Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?

Abstract:Five years after the first published proofs of concept, direct approaches to speech translation (ST) are now competing with traditional cascade solutions. In light of this steady progress, can we claim that the performance gap between the two is closed? Starting from this question, we present a systematic comparison between state-of-the-art systems representative of the two paradigms. Focusing on three language directions (English-German/Italian/Spanish), we conduct automatic and manual evaluations, exploiting high-quality professional post-edits and annotations. Our multi-faceted analysis on one of the few publicly available ST benchmarks attests for the first time that: i) the gap between the two paradigms is now closed, and ii) the subtle differences observed in their behavior are not sufficient for humans neither to distinguish them nor to prefer one over the other.

* Accepted at ACL2021

Via

Access Paper or Ask Questions

Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?

Jun 01, 2020

Alina Karakanta, Matteo Negri, Marco Turchi

Figure 1 for Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?

Figure 2 for Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?

Figure 3 for Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?

Abstract:Subtitling is becoming increasingly important for disseminating information, given the enormous amounts of audiovisual content becoming available daily. Although Neural Machine Translation (NMT) can speed up the process of translating audiovisual content, large manual effort is still required for transcribing the source language, and for spotting and segmenting the text into proper subtitles. Creating proper subtitles in terms of timing and segmentation highly depends on information present in the audio (utterance duration, natural pauses). In this work, we explore two methods for applying Speech Translation (ST) to subtitling: a) a direct end-to-end and b) a classical cascade approach. We discuss the benefit of having access to the source language speech for improving the conformity of the generated subtitles to the spatial and temporal subtitling constraints and show that length is not the answer to everything in the case of subtitling-oriented ST.

* Accepted at IWSLT 2020

Via

Access Paper or Ask Questions

MuST-Cinema: a Speech-to-Subtitles corpus

Feb 25, 2020

Alina Karakanta, Matteo Negri, Marco Turchi

Figure 1 for MuST-Cinema: a Speech-to-Subtitles corpus

Figure 2 for MuST-Cinema: a Speech-to-Subtitles corpus

Figure 3 for MuST-Cinema: a Speech-to-Subtitles corpus

Figure 4 for MuST-Cinema: a Speech-to-Subtitles corpus

Abstract:Growing needs in localising audiovisual content in multiple languages through subtitles call for the development of automatic solutions for human subtitling. Neural Machine Translation (NMT) can contribute to the automatisation of subtitling, facilitating the work of human subtitlers and reducing turn-around times and related costs. NMT requires high-quality, large, task-specific training data. The existing subtitling corpora, however, are missing both alignments to the source language audio and important information about subtitle breaks. This poses a significant limitation for developing efficient automatic approaches for subtitling, since the length and form of a subtitle directly depends on the duration of the utterance. In this work, we present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. The corpus is comprised of (audio, transcription, translation) triplets. Subtitle breaks are preserved by inserting special symbols. We show that the corpus can be used to build models that efficiently segment sentences into subtitles and propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.

* Accepted at LREC 2020

Via

Access Paper or Ask Questions

Adapting Multilingual Neural Machine Translation to Unseen Languages

Oct 30, 2019

Surafel M. Lakew, Alina Karakanta, Marcello Federico, Matteo Negri, Marco Turchi

Figure 1 for Adapting Multilingual Neural Machine Translation to Unseen Languages

Figure 2 for Adapting Multilingual Neural Machine Translation to Unseen Languages

Figure 3 for Adapting Multilingual Neural Machine Translation to Unseen Languages

Figure 4 for Adapting Multilingual Neural Machine Translation to Unseen Languages

Abstract:Multilingual Neural Machine Translation (MNMT) for low-resource languages (LRL) can be enhanced by the presence of related high-resource languages (HRL), but the relatedness of HRL usually relies on predefined linguistic assumptions about language similarity. Recently, adapting MNMT to a LRL has shown to greatly improve performance. In this work, we explore the problem of adapting an MNMT model to an unseen LRL using data selection and model adaptation. In order to improve NMT for LRL, we employ perplexity to select HRL data that are most similar to the LRL on the basis of language distance. We extensively explore data selection in popular multilingual NMT settings, namely in (zero-shot) translation, and in adaptation from a multilingual pre-trained model, for both directions (LRL-en). We further show that dynamic adaptation of the model's vocabulary results in a more favourable segmentation for the LRL in comparison with direct adaptation. Experiments show reductions in training time and significant performance gains over LRL baselines, even with zero LRL data (+13.0 BLEU), up to +17.0 BLEU for pre-trained multilingual model dynamic adaptation with related data selection. Our method outperforms current approaches, such as massively multilingual models and data augmentation, on four LRL.

* Accepted at the 16th International Workshop on Spoken Language Translation (IWSLT), November, 2019

Via

Access Paper or Ask Questions