Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Oct 27, 2024

Zimo Qi, Guangliang Liu, Kristen Marie Johnson, Lu Chen

Figure 1 for Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Figure 2 for Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Figure 3 for Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Figure 4 for Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Share this with someone who'll enjoy it:

Abstract:Though intensive attentions to the self-correction capability of Large Language Models (LLMs), the underlying mechanism of this capability is still under-explored. In this paper, we aim to answer two fundamental questions for moral self-correction: (1) how different components in self-correction, such as Chain-of-Thought (CoT) reasoning, external feedback, and instructional prompts, interact to enable moral self-correction; and (2) is the self-correction one of LLMs' innate capabilities? To answer the first question, we examine how different self-correction components interact to intervene the embedded morality within hidden states, therefore contributing to different performance. For the second question, we (i) evaluate the robustness of moral self-correction by introducing natural language interventions of weak evidence into prompts; (ii) propose a validation framework, self-distinguish, that requires effective self-correction to enable LLMs to distinguish between desirable and undesirable outputs. Our experimental results indicate that there is no universally optimal self-correction method for the tasks considered, although external feedback and CoT can contribute to additional performance gains. However, our mechanistic analysis reveals negative interactions among instructional prompts, CoT, and external feedback, suggesting a conflict between internal knowledge and external feedback. The self-distinguish experiments demonstrate that while LLMs can self-correct their responses, they are unable to reliably distinguish between desired and undesired outputs. With our empirical evidence, we can conclude that moral self-correction is not an innate capability of LLMs acquired during pretraining.

View paper on

Share this with someone who'll enjoy it:

Title:Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Paper and Code