Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Liubov Kovriguina

RoMe: A Robust Metric for Evaluating Natural Language Generation

Mar 17, 2022

Md Rashad Al Hasan Rony, Liubov Kovriguina, Debanjan Chaudhuri, Ricardo Usbeck, Jens Lehmann

Figure 1 for RoMe: A Robust Metric for Evaluating Natural Language Generation

Figure 2 for RoMe: A Robust Metric for Evaluating Natural Language Generation

Figure 3 for RoMe: A Robust Metric for Evaluating Natural Language Generation

Figure 4 for RoMe: A Robust Metric for Evaluating Natural Language Generation

Abstract:Evaluating Natural Language Generation (NLG) systems is a challenging task. Firstly, the metric should ensure that the generated hypothesis reflects the reference's semantics. Secondly, it should consider the grammatical quality of the generated sentence. Thirdly, it should be robust enough to handle various surface forms of the generated sentence. Thus, an effective evaluation metric has to be multifaceted. In this paper, we propose an automatic evaluation metric incorporating several core aspects of natural language understanding (language competence, syntactic and semantic variation). Our proposed metric, RoMe, is trained on language features such as semantic similarity combined with tree edit distance and grammatical acceptability, using a self-supervised neural network to assess the overall quality of the generated sentence. Moreover, we perform an extensive robustness analysis of the state-of-the-art methods and RoMe. Empirical results suggest that RoMe has a stronger correlation to human judgment over state-of-the-art metrics in evaluating system-generated sentences across several NLG tasks.

* Accepted by the Association for Computational Linguistics (ACL) 2022

Via

Access Paper or Ask Questions

Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis

Jan 20, 2022

Aleksandr Perevalov, Xi Yan, Liubov Kovriguina, Longquan Jiang, Andreas Both, Ricardo Usbeck

Figure 1 for Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis

Figure 2 for Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis

Figure 3 for Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis

Figure 4 for Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis

Abstract:Data-driven systems need to be evaluated to establish trust in the scientific approach and its applicability. In particular, this is true for Knowledge Graph (KG) Question Answering (QA), where complex data structures are made accessible via natural-language interfaces. Evaluating the capabilities of these systems has been a driver for the community for more than ten years while establishing different KGQA benchmark datasets. However, comparing different approaches is cumbersome. The lack of existing and curated leaderboards leads to a missing global view over the research field and could inject mistrust into the results. In particular, the latest and most-used datasets in the KGQA community, LC-QuAD and QALD, miss providing central and up-to-date points of trust. In this paper, we survey and analyze a wide range of evaluation results with significant coverage of 100 publications and 98 systems from the last decade. We provide a new central and open leaderboard for any KGQA benchmark dataset as a focal point for the community - https://kgqa.github.io/leaderboard. Our analysis highlights existing problems during the evaluation of KGQA systems. Thus, we will point to possible improvements for future evaluations.

Via

Access Paper or Ask Questions