Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Iván Arcuschin

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Mar 13, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

Abstract:Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal non-negligible rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%) all answer a notable proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions ("implicit post-hoc rationalization"). For example, when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.

* Accepted to the Reasoning and Planning for Large Language Models Workshop (ICLR 25), 10 main paper pages, 38 appendix pages

Via

Access Paper or Ask Questions

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Jul 19, 2024

Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso

Figure 1 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Figure 2 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Figure 3 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Figure 4 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Abstract:Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train these neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

Via

Access Paper or Ask Questions