Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Apr 22, 2024

Jan-Philipp Fränken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi, Tobias Gerstenberg, Noah D. Goodman

Figure 1 for Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Figure 2 for Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Figure 3 for Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Figure 4 for Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Share this with someone who'll enjoy it:

Abstract:When prompting a language model (LM), users frequently expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles into a model can be resource-intensive and technically challenging, generally requiring human preference labels or examples. We introduce SAMI, a method for teaching a pretrained LM to follow behavioral principles that does not require any preference labels or demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM to increase the conditional mutual information between constitutions and self-generated responses given queries from a datasest. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a "principle writer" model; to avoid dependence on stronger models, we further evaluate aligning a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct). The SAMI-trained mixtral-8x7b outperforms both the initial model and the instruction-finetuned model, achieving a 65% win rate on summarization. Our results indicate that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

View paper on

Share this with someone who'll enjoy it:

Title:Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Paper and Code