Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Afonso Mendes

Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach

Oct 01, 2024

Diogo Pernes, Gonçalo M. Correia, Afonso Mendes

Figure 1 for Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach

Figure 2 for Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach

Figure 3 for Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach

Figure 4 for Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach

Abstract:Cross-lingual summarization aims to bridge language barriers by summarizing documents in different languages. However, ensuring semantic coherence across languages is an overlooked challenge and can be critical in several contexts. To fill this gap, we introduce multi-target cross-lingual summarization as the task of summarizing a document into multiple target languages while ensuring that the produced summaries are semantically similar. We propose a principled re-ranking approach to this problem and a multi-criteria evaluation protocol to assess semantic coherence across target languages, marking a first step that will hopefully stimulate further research on this problem.

* Accepted to EMNLP 2024 (Findings)

Via

Access Paper or Ask Questions

Supervising the Centroid Baseline for Extractive Multi-Document Summarization

Nov 29, 2023

Simão Gonçalves, Gonçalo Correia, Diogo Pernes, Afonso Mendes

Abstract:The centroid method is a simple approach for extractive multi-document summarization and many improvements to its pipeline have been proposed. We further refine it by adding a beam search process to the sentence selection and also a centroid estimation attention model that leads to improved results. We demonstrate this in several multi-document summarization datasets, including in a multilingual scenario.

* Accepted at "The 4th New Frontiers in Summarization (with LLMs) Workshop"

Via

Access Paper or Ask Questions

Improving abstractive summarization with energy-based re-ranking

Nov 07, 2022

Diogo Pernes, Afonso Mendes, André F. T. Martins

Abstract:Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.

* 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) at EMNLP 2022

Via

Access Paper or Ask Questions

Simplifying Multilingual News Clustering Through Projection From a Shared Space

Apr 28, 2022

João Santos, Afonso Mendes, Sebastião Miranda

Figure 1 for Simplifying Multilingual News Clustering Through Projection From a Shared Space

Figure 2 for Simplifying Multilingual News Clustering Through Projection From a Shared Space

Figure 3 for Simplifying Multilingual News Clustering Through Projection From a Shared Space

Abstract:The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time. Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded. With that in mind, we present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features. We empirically demonstrate that the use of multilingual contextual embeddings as the document representation significantly improves clustering quality. We challenge previous crosslingual approaches by removing the precondition of building monolingual clusters. We model the clustering process as a set of linear classifiers to aggregate similar documents, and correct closely-related multilingual clusters through merging in an online fashion. Our system achieves state-of-the-art results on a multilingual news stream clustering dataset, and we introduce a new evaluation for zero-shot news clustering in multiple languages. We make our code available as open-source.

* Proceedings of Text2Story - Fifth Workshop on Narrative Extraction From Texts held in conjunction with the 44th European Conference on Information Retrieval (ECIR 2022) Stavanger, Norway, April 10, 2022 (pp. 015-024)
* 10 pages, 1 figure

Via

Access Paper or Ask Questions

An NLP Solution to Foster the Use of Information in Electronic Health Records for Efficiency in Decision-Making in Hospital Care

Feb 24, 2022

Adelino Leite-Moreira, Afonso Mendes, Afonso Pedrosa, Amândio Rocha-Sousa, Ana Azevedo, André Amaral-Gomes, Cláudia Pinto, Helena Figueira, Nuno Rocha Pereira, Pedro Mendes(+1 more)

Figure 1 for An NLP Solution to Foster the Use of Information in Electronic Health Records for Efficiency in Decision-Making in Hospital Care

Figure 2 for An NLP Solution to Foster the Use of Information in Electronic Health Records for Efficiency in Decision-Making in Hospital Care

Figure 3 for An NLP Solution to Foster the Use of Information in Electronic Health Records for Efficiency in Decision-Making in Hospital Care

Abstract:The project aimed to define the rules and develop a technological solution to automatically identify a set of attributes within free-text clinical records written in Portuguese. The first application developed and implemented on this basis was a structured summary of a patient's clinical history, including previous diagnoses and procedures, usual medication, and relevant characteristics or conditions for clinical decisions, such as allergies, being under anticoagulant therapy, etc. The project's goal was achieved by a multidisciplinary team that included clinicians, epidemiologists, computational linguists, machine learning researchers and software engineers, bringing together the expertise and perspectives of a public hospital, the university and the private sector. Relevant benefits to users and patients are related with facilitated access to the patient's history, which translates into exhaustiveness in apprehending the patient's clinical past and efficiency due to time saving.

* 11 pages, 2 figures

Via

Access Paper or Ask Questions

Priberam at MESINESP Multi-label Classification of Medical Texts Task

May 12, 2021

Ruben Cardoso, Zita Marinho, Afonso Mendes, Sebastião Miranda

Figure 1 for Priberam at MESINESP Multi-label Classification of Medical Texts Task

Figure 2 for Priberam at MESINESP Multi-label Classification of Medical Texts Task

Figure 3 for Priberam at MESINESP Multi-label Classification of Medical Texts Task

Figure 4 for Priberam at MESINESP Multi-label Classification of Medical Texts Task

Abstract:Medical articles provide current state of the art treatments and diagnostics to many medical practitioners and professionals. Existing public databases such as MEDLINE contain over 27 million articles, making it difficult to extract relevant content without the use of efficient search engines. Information retrieval tools are crucial in order to navigate and provide meaningful recommendations for articles and treatments. Classifying these articles into broader medical topics can improve the retrieval of related articles. The set of medical labels considered for the MESINESP task is on the order of several thousands of labels (DeCS codes), which falls under the extreme multi-label classification problem. The heterogeneous and highly hierarchical structure of medical topics makes the task of manually classifying articles extremely laborious and costly. It is, therefore, crucial to automate the process of classification. Typical machine learning algorithms become computationally demanding with such a large number of labels and achieving better recall on such datasets becomes an unsolved problem. This work presents Priberam's participation at the BioASQ task Mesinesp. We address the large multi-label classification problem through the use of four different models: a Support Vector Machine (SVM), a customised search engine (Priberam Search), a BERT based classifier, and a SVM-rank ensemble of all the previous models. Results demonstrate that all three individual models perform well and the best performance is achieved by their ensemble, granting Priberam the 6th place in the present challenge and making it the 2nd best team.

* Presented at CLEF2020 conference (2020)

Via

Access Paper or Ask Questions

Priberam Labs at the NTCIR-15 SHINRA2020-ML: Classification Task

May 12, 2021

Ruben Cardoso, Afonso Mendes, Andre Lamurias

Figure 1 for Priberam Labs at the NTCIR-15 SHINRA2020-ML: Classification Task

Figure 2 for Priberam Labs at the NTCIR-15 SHINRA2020-ML: Classification Task

Figure 3 for Priberam Labs at the NTCIR-15 SHINRA2020-ML: Classification Task

Figure 4 for Priberam Labs at the NTCIR-15 SHINRA2020-ML: Classification Task

Abstract:Wikipedia is an online encyclopedia available in 285 languages. It composes an extremely relevant Knowledge Base (KB), which could be leveraged by automatic systems for several purposes. However, the structure and organisation of such information are not prone to automatic parsing and understanding and it is, therefore, necessary to structure this knowledge. The goal of the current SHINRA2020-ML task is to leverage Wikipedia pages in order to categorise their corresponding entities across 268 hierarchical categories, belonging to the Extended Named Entity (ENE) ontology. In this work, we propose three distinct models based on the contextualised embeddings yielded by Multilingual BERT. We explore the performances of a linear layer with and without explicit usage of the ontology's hierarchy, and a Gated Recurrent Units (GRU) layer. We also test several pooling strategies to leverage BERT's embeddings and selection criteria based on the labels' scores. We were able to achieve good performance across a large variety of languages, including those not seen during the fine-tuning process (zero-shot languages).

* Presented at NTCIR-15 conference (2020)

Via

Access Paper or Ask Questions

Jointly Extracting and Compressing Documents with Summary State Representations

Apr 05, 2019

Afonso Mendes, Shashi Narayan, Sebastião Miranda, Zita Marinho, André F. T. Martins, Shay B. Cohen

Figure 1 for Jointly Extracting and Compressing Documents with Summary State Representations

Figure 2 for Jointly Extracting and Compressing Documents with Summary State Representations

Figure 3 for Jointly Extracting and Compressing Documents with Summary State Representations

Figure 4 for Jointly Extracting and Compressing Documents with Summary State Representations

Abstract:We present a new neural model for text summarization that first extracts sentences from a document and then compresses them. The proposed model offers a balance that sidesteps the difficulties in abstractive methods while generating more concise summaries than extractive methods. In addition, our model dynamically determines the length of the output summary based on the gold summaries it observes during training and does not require length constraints typical to extractive summarization. The model achieves state-of-the-art results on the CNN/DailyMail and Newsroom datasets, improving over current extractive and abstractive methods. Human evaluations demonstrate that our model generates concise and informative summaries. We also make available a new dataset of oracle compressive summaries derived automatically from the CNN/DailyMail reference summaries.

* NAACL 2019

Via

Access Paper or Ask Questions

Automated Fact Checking in the News Room

Apr 03, 2019

Sebastião Miranda, David Nogueira, Afonso Mendes, Andreas Vlachos, Andrew Secker, Rebecca Garrett, Jeff Mitchel, Zita Marinho

Figure 1 for Automated Fact Checking in the News Room

Figure 2 for Automated Fact Checking in the News Room

Figure 3 for Automated Fact Checking in the News Room

Figure 4 for Automated Fact Checking in the News Room

Abstract:Fact checking is an essential task in journalism; its importance has been highlighted due to recently increased concerns and efforts in combating misinformation. In this paper, we present an automated fact-checking platform which given a claim, it retrieves relevant textual evidence from a document collection, predicts whether each piece of evidence supports or refutes the claim, and returns a final verdict. We describe the architecture of the system and the user interface, focusing on the choices made to improve its user-friendliness and transparency. We conduct a user study of the fact-checking platform in a journalistic setting: we integrated it with a collection of news articles and provide an evaluation of the platform using feedback from journalists in their workflow. We found that the predictions of our platform were correct 58\% of the time, and 59\% of the returned evidence was relevant.

* WEBCONF 2019

Via

Access Paper or Ask Questions