Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Are self-explanations from Large Language Models faithful?

Jan 17, 2024

Andreas Madsen, Sarath Chandar, Siva Reddy

Figure 1 for Are self-explanations from Large Language Models faithful?

Figure 2 for Are self-explanations from Large Language Models faithful?

Figure 3 for Are self-explanations from Large Language Models faithful?

Figure 4 for Are self-explanations from Large Language Models faithful?

Share this with someone who'll enjoy it:

Abstract:Instruction-tuned large language models (LLMs) excel at many tasks, and will even provide explanations for their behavior. Since these models are directly accessible to the public, there is a risk that convincing and wrong explanations can lead to unsupported confidence in LLMs. Therefore, interpretability-faithfulness of self-explanations is an important consideration for AI Safety. Assessing the interpretability-faithfulness of these explanations, termed self-explanations, is challenging as the models are too complex for humans to annotate what is a correct explanation. To address this, we propose employing self-consistency checks as a measure of faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been applied to LLM's self-explanations. We apply self-consistency checks to three types of self-explanations: counterfactuals, importance measures, and redactions. Our work demonstrate that faithfulness is both task and model dependent, e.g., for sentiment classification, counterfactual explanations are more faithful for Llama2, importance measures for Mistral, and redaction for Falcon 40B. Finally, our findings are robust to prompt-variations.

View paper on

Share this with someone who'll enjoy it:

Title:Are self-explanations from Large Language Models faithful?

Paper and Code