Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guangyao Shen

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Jun 17, 2024

Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, Yankai Lin

Figure 1 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Figure 2 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Figure 3 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Figure 4 for Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Abstract:Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

* Code is available at https://github.com/keven980716/weak-to-strong-deception

Via

Access Paper or Ask Questions

Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision

Mar 08, 2024

Zeyang Zhang, Xin Wang, Ziwei Zhang, Guangyao Shen, Shiqi Shen, Wenwu Zhu

Figure 1 for Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision

Figure 2 for Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision

Figure 3 for Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision

Figure 4 for Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision

Abstract:The existing graph neural architecture search (GNAS) methods heavily rely on supervised labels during the search process, failing to handle ubiquitous scenarios where supervisions are not available. In this paper, we study the problem of unsupervised graph neural architecture search, which remains unexplored in the literature. The key problem is to discover the latent graph factors that drive the formation of graph data as well as the underlying relations between the factors and the optimal neural architectures. Handling this problem is challenging given that the latent graph factors together with architectures are highly entangled due to the nature of the graph and the complexity of the neural architecture search process. To address the challenge, we propose a novel Disentangled Self-supervised Graph Neural Architecture Search (DSGAS) model, which is able to discover the optimal architectures capturing various latent graph factors in a self-supervised fashion based on unlabeled graph data. Specifically, we first design a disentangled graph super-network capable of incorporating multiple architectures with factor-wise disentanglement, which are optimized simultaneously. Then, we estimate the performance of architectures under different factors by our proposed self-supervised training with joint architecture-graph disentanglement. Finally, we propose a contrastive search with architecture augmentations to discover architectures with factor-specific expertise. Extensive experiments on 11 real-world datasets demonstrate that the proposed model is able to achieve state-of-the-art performance against several baseline methods in an unsupervised manner.

* NeurIPS'23

Via

Access Paper or Ask Questions