Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunmeng Li

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Mar 31, 2025

Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery

Figure 1 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Figure 2 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Figure 3 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Figure 4 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Abstract:The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.

* 9 main pages (21 appendix pages), 7 figures, submitted to ACL 2025

Via

Access Paper or Ask Questions

MQM-Chat: Multidimensional Quality Metrics for Chat Translation

Aug 29, 2024

Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Kentaro Inui

Abstract:The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.

Via

Access Paper or Ask Questions

An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Aug 28, 2024

Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Kentaro Inui

Figure 1 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Figure 2 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Figure 3 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

Figure 4 for An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication

* IJCNLP-AACL 2023 Student Research Workshop

Via

Access Paper or Ask Questions

Chat Translation Error Detection for Assisting Cross-lingual Communications

Aug 02, 2023

Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Ryoko Tokuhisa, Ana Brassard, Kentaro Inui

Abstract:In this paper, we describe the development of a communication support system that detects erroneous translations to facilitate crosslingual communications due to the limitations of current machine chat translation methods. We trained an error detector as the baseline of the system and constructed a new Japanese-English bilingual chat corpus, BPersona-chat, which comprises multiturn colloquial chats augmented with crowdsourced quality ratings. The error detector can serve as an encouraging foundation for more advanced erroneous translation detection systems.

* Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems, pages 88-95, November 2022, Online. Association for Computational Linguistics

Via

Access Paper or Ask Questions