Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alon Lavie

School of Computer Science, Carnegie Mellon University

Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs

Aug 20, 2024

John Mendonça, Isabel Trancoso, Alon Lavie

Abstract:Although human evaluation remains the gold standard for open-domain dialogue evaluation, the growing popularity of automated evaluation using Large Language Models (LLMs) has also extended to dialogue. However, most frameworks leverage benchmarks that assess older chatbots on aspects such as fluency and relevance, which are not reflective of the challenges associated with contemporary models. In fact, a qualitative analysis on Soda, a GPT-3.5 generated dialogue dataset, suggests that current chatbots may exhibit several recurring issues related to coherence and commonsense knowledge, but generally produce highly fluent and relevant responses. Noting the aforementioned limitations, this paper introduces Soda-Eval, an annotated dataset based on Soda that covers over 120K turn-level assessments across 10K dialogues, where the annotations were generated by GPT-4. Using Soda-Eval as a benchmark, we then study the performance of several open-access instruction-tuned LLMs, finding that dialogue evaluation remains challenging. Fine-tuning these models improves performance over few-shot inferences, both in terms of correlation and explanation.

* 22 pages, 10 figures

Via

Access Paper or Ask Questions

ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues

Jul 16, 2024

John Mendonça, Isabel Trancoso, Alon Lavie

Abstract:Despite being heralded as the new standard for dialogue evaluation, the closed-source nature of GPT-4 poses challenges for the community. Motivated by the need for lightweight, open source, and multilingual dialogue evaluators, this paper introduces GenResCoh (Generated Responses targeting Coherence). GenResCoh is a novel LLM generated dataset comprising over 130k negative and positive responses and accompanying explanations seeded from XDailyDialog and XPersona covering English, French, German, Italian, and Chinese. Leveraging GenResCoh, we propose ECoh (Evaluation of Coherence), a family of evaluators trained to assess response coherence across multiple languages. Experimental results demonstrate that ECoh achieves multilingual detection capabilities superior to the teacher model (GPT-3.5-Turbo) on GenResCoh, despite being based on a much smaller architecture. Furthermore, the explanations provided by ECoh closely align in terms of quality with those generated by the teacher model.

* Accepted to SIGDIAL 2024

Via

Access Paper or Ask Questions

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Jul 04, 2024

John Mendonça, Alon Lavie, Isabel Trancoso

Figure 1 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Figure 2 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Figure 3 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Figure 4 for On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Abstract:Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.

* Accepted to the 6th NLP for Conversational AI workshop at ACL

Via

Access Paper or Ask Questions

Dialogue Quality and Emotion Annotations for Customer Support Conversations

Nov 23, 2023

John Mendonça, Patrícia Pereira, Miguel Menezes, Vera Cabarrão, Ana C. Farinha, Helena Moniz, João Paulo Carvalho, Alon Lavie, Isabel Trancoso

Figure 1 for Dialogue Quality and Emotion Annotations for Customer Support Conversations

Figure 2 for Dialogue Quality and Emotion Annotations for Customer Support Conversations

Figure 3 for Dialogue Quality and Emotion Annotations for Customer Support Conversations

Figure 4 for Dialogue Quality and Emotion Annotations for Customer Support Conversations

Abstract:Task-oriented conversational datasets often lack topic variability and linguistic diversity. However, with the advent of Large Language Models (LLMs) pretrained on extensive, multilingual and diverse text data, these limitations seem overcome. Nevertheless, their generalisability to different languages and domains in dialogue applications remains uncertain without benchmarking datasets. This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. By performing annotations that take into consideration the complete instances that compose a conversation, one can form a broader perspective of the dialogue as a whole. Furthermore, it provides a unique and valuable resource for the development of text classification models. To this end, we present benchmarks for Emotion Recognition and Dialogue Quality Estimation and show that further research is needed to leverage these models in a production setting.

* Accepted at GEM (EMNLP Workshop)

Via

Access Paper or Ask Questions

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

Sep 08, 2023

John Mendonça, Patrícia Pereira, Helena Moniz, João Paulo Carvalho, Alon Lavie, Isabel Trancoso

Figure 1 for Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

Figure 2 for Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

Figure 3 for Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

Figure 4 for Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation

Abstract:Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of prompted LLMs.

* DSTC11 best paper for Track 4

Via

Access Paper or Ask Questions

Towards Multilingual Automatic Dialogue Evaluation

Aug 31, 2023

John Mendonça, Alon Lavie, Isabel Trancoso

Figure 1 for Towards Multilingual Automatic Dialogue Evaluation

Figure 2 for Towards Multilingual Automatic Dialogue Evaluation

Figure 3 for Towards Multilingual Automatic Dialogue Evaluation

Figure 4 for Towards Multilingual Automatic Dialogue Evaluation

Abstract:The main limiting factor in the development of robust multilingual dialogue evaluation metrics is the lack of multilingual data and the limited availability of open sourced multilingual dialogue systems. In this work, we propose a workaround for this lack of data by leveraging a strong multilingual pretrained LLM and augmenting existing English dialogue data using Machine Translation. We empirically show that the naive approach of finetuning a pretrained multilingual encoder model with translated data is insufficient to outperform the strong baseline of finetuning a multilingual model with only source data. Instead, the best approach consists in the careful curation of translated data using MT Quality Estimation metrics, excluding low quality translations that hinder its performance.

* SIGDIAL23

Via

Access Paper or Ask Questions

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

May 19, 2023

Ricardo Rei, Nuno M. Guerreiro, Marcos Treviso, Luisa Coheur, Alon Lavie, André F. T. Martins

Abstract:Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, "black boxes" returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors, as assessed through comparison of token-level neural saliency maps with Multidimensional Quality Metrics (MQM) annotations and with synthetically-generated critical translation errors. To ease future research, we release our code at: https://github.com/Unbabel/COMET/tree/explainable-metrics.

* Accepted at ACL 2023

Via

Access Paper or Ask Questions

Appropriateness is all you need!

Apr 27, 2023

Hendrik Kempt, Alon Lavie, Saskia K. Nagel

Abstract:The strive to make AI applications "safe" has led to the development of safety-measures as the main or even sole normative requirement of their permissible use. Similar can be attested to the latest version of chatbots, such as chatGPT. In this view, if they are "safe", they are supposed to be permissible to deploy. This approach, which we call "safety-normativity", is rather limited in solving the emerging issues that chatGPT and other chatbots have caused thus far. In answering this limitation, in this paper we argue for limiting chatbots in the range of topics they can chat about according to the normative concept of appropriateness. We argue that rather than looking for "safety" in a chatbot's utterances to determine what they may and may not say, we ought to assess those utterances according to three forms of appropriateness: technical-discursive, social, and moral. We then spell out what requirements for chatbots follow from these forms of appropriateness to avoid the limits of previous accounts: positionality, acceptability, and value alignment (PAVA). With these in mind, we may be able to determine what a chatbot may and may not say. Lastly, one initial suggestion is to use challenge sets, specifically designed for appropriateness, as a validation method.

Via

Access Paper or Ask Questions

CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

Sep 13, 2022

Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C. Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte M. Alves, Alon Lavie(+2 more)

Figure 1 for CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

Figure 2 for CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

Figure 3 for CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

Figure 4 for CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

Abstract:We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated on all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictor-estimator architecture of OpenKiwi, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin.

* WMT 2022 Quality Estimation shared task

Via

Access Paper or Ask Questions

Unbabel's Participation in the WMT20 Metrics Shared Task

Oct 29, 2020

Ricardo Rei, Craig Stewart, Catarina Farinha, Alon Lavie

Figure 1 for Unbabel's Participation in the WMT20 Metrics Shared Task

Figure 2 for Unbabel's Participation in the WMT20 Metrics Shared Task

Figure 3 for Unbabel's Participation in the WMT20 Metrics Shared Task

Figure 4 for Unbabel's Participation in the WMT20 Metrics Shared Task

Abstract:We present the contribution of the Unbabel team to the WMT 2020 Shared Task on Metrics. We intend to participate on the segment-level, document-level and system-level tracks on all language pairs, as well as the 'QE as a Metric' track. Accordingly, we illustrate results of our models in these tracks with reference to test sets from the previous year. Our submissions build upon the recently proposed COMET framework: We train several estimator models to regress on different human-generated quality scores and a novel ranking model trained on relative ranks obtained from Direct Assessments. We also propose a simple technique for converting segment-level predictions into a document-level score. Overall, our systems achieve strong results for all language pairs on previous test sets and in many cases set a new state-of-the-art.

* WMT Metrics Shared Task 2020

Via

Access Paper or Ask Questions