Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ben Prystawski

Scaling up the think-aloud method

May 29, 2025

Daniel Wurgaft, Ben Prystawski, Kanishk Gandhi, Cedegao E. Zhang, Joshua B. Tenenbaum, Noah D. Goodman

Abstract:The think-aloud method, where participants voice their thoughts as they solve a task, is a valuable source of rich data about human reasoning processes. Yet, it has declined in popularity in contemporary cognitive science, largely because labor-intensive transcription and annotation preclude large sample sizes. Here, we develop methods to automate the transcription and annotation of verbal reports of reasoning using natural language processing tools, allowing for large-scale analysis of think-aloud data. In our study, 640 participants thought aloud while playing the Game of 24, a mathematical reasoning task. We automatically transcribed the recordings and coded the transcripts as search graphs, finding moderate inter-rater reliability with humans. We analyze these graphs and characterize consistency and variation in human reasoning traces. Our work demonstrates the value of think-aloud data at scale and serves as a proof of concept for the automated analysis of verbal reports.

* 8 pages, 4 figures. Daniel Wurgaft and Ben Prystawski contributed equally

Via

Access Paper or Ask Questions

Why think step-by-step? Reasoning emerges from the locality of experience

Apr 07, 2023

Ben Prystawski, Noah D. Goodman

Abstract:Humans have a powerful and mysterious capacity to reason. By working through a series of purely mental steps, we can make inferences we would not be capable of making directly -- despite that fact that we get no additional data from the world. Similarly, large language models can perform better at complex tasks through chain-of-thought reasoning, where they generate intermediate steps before answering a question. We use language models to investigate the questions of when and why reasoning is helpful, testing the hypothesis that reasoning is effective when training data consisting of local clusters of variables that influence each other strongly. These training conditions enable the chaining of accurate local inferences in order to estimate relationships between variables that were not seen together in training. We train an autoregressive transformer on samples from joint distributions defined by Bayes nets, but only include a subset of all the variables in each sample. We compare language models' ability to match conditional probabilities both with and without intermediate reasoning steps, finding that intermediate steps help only when the training data is locally structured with respect to dependencies between variables. Furthermore, intermediate variables need to be relevant to the relationship between observed information and target inferences. Our results illustrate how the statistical structure of training data drives the effectiveness of reasoning step by step.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

Toward a normative theory of management by goal-setting

Feb 06, 2023

Nishad Singhi, Florian Mohnert, Ben Prystawski, Falk Lieder

Abstract:People are often confronted with problems whose complexity exceeds their cognitive capacities. To deal with this complexity, individuals and managers can break complex problems down into a series of subgoals. Which subgoals are most effective depends on people's cognitive constraints and the cognitive mechanisms of goal pursuit. This creates an untapped opportunity to derive practical recommendations for which subgoals managers and individuals should set from cognitive models of bounded rationality. To seize this opportunity, we apply the principle of resource-rationality to formulate a mathematically precise normative theory of (self-)management by goal-setting. We leverage this theory to computationally derive optimal subgoals from a resource-rational model of human goal pursuit. Finally, we show that the resulting subgoals improve the problem-solving performance of bounded agents and human participants. This constitutes a first step towards grounding prescriptive theories of management and practical recommendations for goal-setting in computational models of the relevant psychological processes and cognitive limitations.

Via

Access Paper or Ask Questions

Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models

Sep 16, 2022

Ben Prystawski, Paul Thibodeau, Noah Goodman

Figure 1 for Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models

Figure 2 for Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models

Figure 3 for Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models

Figure 4 for Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models

Abstract:Probabilistic models of language understanding are interpretable and structured, for instance models of metaphor understanding describe inference about latent topics and features. However, these models are manually designed for a specific task. Large language models (LLMs) can perform many tasks through in-context learning, but they lack the clear structure of probabilistic models. In this paper, we use chain-of-thought prompts to introduce structures from probabilistic models into LLMs. These prompts lead the model to infer latent variables and reason about their relationships to choose appropriate paraphrases for metaphors. The latent variables and relationships chosen are informed by theories of metaphor understanding from cognitive psychology. We apply these prompts to the two largest versions of GPT-3 and show that they can improve paraphrase selection.

* 12 pages, 1 figure

Via

Access Paper or Ask Questions

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Dec 21, 2021

Jacob Nogas, Tong Li, Fernando J. Yanez, Arghavan Modiri, Nina Deliu, Ben Prystawski, Sofia S. Villar, Anna Rafferty, Joseph J. Williams

Figure 1 for Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Figure 2 for Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Figure 3 for Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Figure 4 for Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Abstract:Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign more participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one. We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis, with the benefits of reward maximization achieved by Thompson Sampling (TS). First, Top-Two Thompson Sampling adds a fixed amount of uniform random allocation (UR) spread evenly over time. Second, a novel heuristic algorithm, called TS PostDiff (Posterior Probability of Difference). TS PostDiff takes a Bayesian approach to mixing TS and UR: the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is `small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We find that TS PostDiff method performs well across multiple effect sizes, and thus does not require tuning based on a guess for the true effect size.

Via

Access Paper or Ask Questions