Abstract:In this work, we introduce sequence-to-sequence language resources for Catalan, a moderately under-resourced language, towards two tasks, namely: Summarization and Machine Translation (MT). We present two new abstractive summarization datasets in the domain of newswire. We also introduce a parallel Catalan-English corpus, paired with three different brand new test sets. Finally, we evaluate the data presented with competing state of the art models, and we develop baselines for these tasks using a newly created Catalan BART. We release the resulting resources of this work under open license to encourage the development of language technology in Catalan.
Abstract:Multilingual Neural Machine Translation architectures mainly differ in the amount of sharing modules and parameters among languages. In this paper, and from an algorithmic perspective, we explore if the chosen architecture, when trained with the same data, influences the gender bias accuracy. Experiments in four language pairs show that Language-Specific encoders-decoders exhibit less bias than the Shared encoder-decoder architecture. Further interpretability analysis of source embeddings and the attention shows that, in the Language-Specific case, the embeddings encode more gender information, and its attention is more diverted. Both behaviors help in mitigating gender bias.