Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Jun 17, 2024

Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, Yankai Lin

Figure 1 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Figure 2 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Figure 3 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Figure 4 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Share this with someone who'll enjoy it:

Abstract:Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

* Code is available at https://github.com/keven980716/weak-to-strong-deception

View paper on

Share this with someone who'll enjoy it:

Title:Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Paper and Code