Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Aug 21, 2024

Sepehr Kamahi, Yadollah Yaghoobzadeh

Figure 1 for Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Figure 2 for Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Figure 3 for Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Figure 4 for Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Share this with someone who'll enjoy it:

Abstract:Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models (MLMs). Evaluating the faithfulness of an explanation method -- how accurately the method explains the inner workings and decision-making of the model -- is very challenging because it is very hard to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove some input tokens considered important according to a particular attribution (feature importance) method and observe the change in the model's output. This approach creates out-of-distribution inputs for causal language models (CLMs) due to their training objective of next token prediction. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language modeling scenarios. Our technique creates fluent and in-distribution counterfactuals that makes evaluation protocol more reliable. Code is available at https://github.com/Sepehr-Kamahi/faith

* 17 pages, 6 figures

View paper on

Share this with someone who'll enjoy it:

Title:Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Paper and Code