Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Personas as a Way to Model Truthfulness in Language Models

Oct 30, 2023

Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He

Figure 1 for Personas as a Way to Model Truthfulness in Language Models

Figure 2 for Personas as a Way to Model Truthfulness in Language Models

Figure 3 for Personas as a Way to Model Truthfulness in Language Models

Figure 4 for Personas as a Way to Model Truthfulness in Language Models

Share this with someone who'll enjoy it:

Abstract:Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.

View paper on

Share this with someone who'll enjoy it:

Title:Personas as a Way to Model Truthfulness in Language Models

Paper and Code