Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Don Tuggener

Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation

Jun 03, 2024

Pius von Däniken, Jan Deriu, Don Tuggener, Mark Cieliebak

Figure 1 for Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation

Figure 2 for Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation

Figure 3 for Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation

Figure 4 for Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation

Abstract:Generative AI systems have become ubiquitous for all kinds of modalities, which makes the issue of the evaluation of such models more pressing. One popular approach is preference ratings, where the generated outputs of different systems are shown to evaluators who choose their preferences. In recent years the field shifted towards the development of automated (trained) metrics to assess generated outputs, which can be used to create preference ratings automatically. In this work, we investigate the evaluation of the metrics themselves, which currently rely on measuring the correlation to human judgments or computing sign accuracy scores. These measures only assess how well the metric agrees with the human ratings. However, our research shows that this does not tell the whole story. Most metrics exhibit a disagreement with human system assessments which is often skewed in favor of particular text generation systems, exposing a degree of favoritism in automated metrics. This paper introduces a formal definition of favoritism in preference metrics, and derives the Favi-Score, which measures this phenomenon. In particular we show that favoritism is strongly related to errors in final system rankings. Thus, we propose that preference-based metrics ought to be evaluated on both sign accuracy scores and favoritism.

* Accepted at ACL Main Conference

Via

Access Paper or Ask Questions

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Jun 06, 2023

Jan Deriu, Pius von Däniken, Don Tuggener, Mark Cieliebak

Figure 1 for Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Figure 2 for Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Figure 3 for Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Figure 4 for Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Abstract:A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization.

Via

Access Paper or Ask Questions

On the Effectiveness of Automated Metrics for Text Generation Systems

Oct 24, 2022

Pius von Däniken, Jan Deriu, Don Tuggener, Mark Cieliebak

Figure 1 for On the Effectiveness of Automated Metrics for Text Generation Systems

Figure 2 for On the Effectiveness of Automated Metrics for Text Generation Systems

Figure 3 for On the Effectiveness of Automated Metrics for Text Generation Systems

Figure 4 for On the Effectiveness of Automated Metrics for Text Generation Systems

Abstract:A major challenge in the field of Text Generation is evaluation because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.

Via

Access Paper or Ask Questions

Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Feb 28, 2022

Jan Deriu, Don Tuggener, Pius von Däniken, Mark Cieliebak

Figure 1 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Figure 2 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Figure 3 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Figure 4 for Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Abstract:This paper introduces an adversarial method to stress-test trained metrics to evaluate conversational dialogue systems. The method leverages Reinforcement Learning to find response strategies that elicit optimal scores from the trained metrics. We apply our method to test recently proposed trained metrics. We find that they all are susceptible to giving high scores to responses generated by relatively simple and obviously flawed strategies that our method converges on. For instance, simply copying parts of the conversation context to form a response yields competitive scores or even outperforms responses written by humans.

Via

Access Paper or Ask Questions

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Oct 05, 2020

Jan Deriu, Don Tuggener, Pius von Däniken, Jon Ander Campos, Alvaro Rodrigo, Thiziri Belkacem, Aitor Soroa, Eneko Agirre, Mark Cieliebak

Figure 1 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Figure 2 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Figure 3 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Figure 4 for Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Abstract:The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., \emph{Survival Analysis}. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., \ fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying \emph{Spot The Bot} to three domains, evaluating several state-of-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool.

Via

Access Paper or Ask Questions