Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Jun 25, 2021

Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, Rifat Shahriyar

Figure 1 for XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Figure 2 for XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Figure 3 for XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Figure 4 for XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Share this with someone who'll enjoy it:

Abstract:Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}.

* Findings of the Association for Computational Linguistics, ACL 2021 (camera-ready)

View paper on

Share this with someone who'll enjoy it:

Title:XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Paper and Code