Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Keito Kudo

Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Jan 27, 2025

Go Kamoda, Benjamin Hienzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Figure 1 for Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Figure 2 for Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Figure 3 for Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Figure 4 for Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Abstract:According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model's ``inner vocabulary''. Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior. Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps. Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2. Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects. By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization.

* 22 pages, 14 figures, to appear in NAACL Findings 2025

Via

Access Paper or Ask Questions

Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning

Dec 02, 2024

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, Kentaro Inui

Abstract:This study investigates the internal reasoning mechanism of language models during symbolic multi-step reasoning, motivated by the question of whether chain-of-thought (CoT) outputs are faithful to the model's internals. Specifically, we inspect when they internally determine their answers, particularly before or after CoT begins, to determine whether models follow a post-hoc "think-to-talk" mode or a step-by-step "talk-to-think" mode of explanation. Through causal probing experiments in controlled arithmetic reasoning tasks, we found systematic internal reasoning patterns across models; for example, simple subproblems are solved before CoT begins, and more complicated multi-hop calculations are performed during CoT.

Via

Access Paper or Ask Questions

First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Jun 23, 2024

Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui

Figure 1 for First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Figure 2 for First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Figure 3 for First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Figure 4 for First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Abstract:Multi-step reasoning is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process. Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, as LMs progress closer to the final answer, their reliance on heuristics decreases. This suggests that LMs track only a limited number of future steps and dynamically combine heuristic strategies with logical ones in tasks involving multi-step reasoning.

Via

Access Paper or Ask Questions

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

May 08, 2024

Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Figure 1 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Figure 2 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Figure 3 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Figure 4 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Abstract:Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: https://github.com/a-brassard/ACORN.

* 18 pages, 7 figures, under review. Data available here: https://github.com/a-brassard/ACORN

Via

Access Paper or Ask Questions

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

Dec 04, 2023

Keito Kudo, Haruki Nagasawa, Jun Suzuki, Nobuyuki Shimizu

Abstract:This paper proposes a practical multimodal video summarization task setting and a dataset to train and evaluate the task. The target task involves summarizing a given video into a predefined number of keyframe-caption pairs and displaying them in a listable format to grasp the video content quickly. This task aims to extract crucial scenes from the video in the form of images (keyframes) and generate corresponding captions explaining each keyframe's situation. This task is useful as a practical application and presents a highly challenging problem worthy of study. Specifically, achieving simultaneous optimization of the keyframe selection performance and caption quality necessitates careful consideration of the mutual dependence on both preceding and subsequent keyframes and captions. To facilitate subsequent research in this field, we also construct a dataset by expanding upon existing datasets and propose an evaluation framework. Furthermore, we develop two baseline systems and report their respective performance.

Via

Access Paper or Ask Questions

Empirical Investigation of Neural Symbolic Reasoning Strategies

Feb 16, 2023

Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract:Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models.

* This paper is accepted as the findings at EACL 2023, and the earlier version (non-archival) of this work got the Best Paper Award in the Student Research Workshop of AACL 2022

Via

Access Paper or Ask Questions

Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?

Feb 15, 2023

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract:Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.

* accepted by EACL 2023

Via

Access Paper or Ask Questions