Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sosuke Kobayashi

Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

Mar 17, 2026

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki

Abstract:We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

* 25 pages, accepted by ICLR 2026 as a conference paper

Via

Access Paper or Ask Questions

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Apr 24, 2025

Rei Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi, Sosuke Kobayashi, Seiya Tokui, Kohei Hayashi, Daisuke Okanohara, Taiji Suzuki

Figure 1 for When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Figure 2 for When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Figure 3 for When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Figure 4 for When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

Abstract:The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task's prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model's performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.

Via

Access Paper or Ask Questions

Efficient Construction of Model Family through Progressive Training Using Model Expansion

Apr 01, 2025

Kazuki Yano, Sho Takase, Sosuke Kobayashi, Shun Kiyono, Jun Suzuki

Abstract:As Large Language Models (LLMs) gain widespread practical application, providing the model family of different parameter sizes has become standard practice to address diverse computational requirements. Conventionally, each model in a family is trained independently, resulting in computational costs that scale additively with the number of models. We propose an efficient method for constructing the model family through progressive training, where smaller models are incrementally expanded to larger sizes to create a complete model family. Through extensive experiments with a model family ranging from 1B to 8B parameters, we demonstrate that our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models. Furthermore, by strategically adjusting maximum learning rates based on model size, our method outperforms the independent training across various metrics. Beyond performance gains, our approach offers an additional advantage: models in our family tend to yield more consistent behavior across different model sizes.

Via

Access Paper or Ask Questions

Spike No More: Stabilizing the Pre-training of Large Language Models

Dec 28, 2023

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

Figure 1 for Spike No More: Stabilizing the Pre-training of Large Language Models

Figure 2 for Spike No More: Stabilizing the Pre-training of Large Language Models

Figure 3 for Spike No More: Stabilizing the Pre-training of Large Language Models

Figure 4 for Spike No More: Stabilizing the Pre-training of Large Language Models

Abstract:The loss spike often occurs during pre-training of a large language model. The spikes degrade the performance of a large language model, and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. To investigate a cause of loss spikes, we focus on gradients of internal layers in this study. Through theoretical analyses, we introduce two causes of the exploding gradients, and provide requirements to prevent the explosion. In addition, we introduce the combination of the initialization method and a simple modification to embeddings as a method to satisfy the requirements. We conduct various experiments to verify our theoretical analyses empirically. Experimental results indicate that the combination is effective in preventing spikes during pre-training.

* Work in progress

Via

Access Paper or Ask Questions

On Layer Normalizations and Residual Connections in Transformers

Jun 01, 2022

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

Figure 1 for On Layer Normalizations and Residual Connections in Transformers

Figure 2 for On Layer Normalizations and Residual Connections in Transformers

Figure 3 for On Layer Normalizations and Residual Connections in Transformers

Figure 4 for On Layer Normalizations and Residual Connections in Transformers

Abstract:In the perspective of a layer normalization (LN) position, the architecture of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers prefer to select Pre-LN because the training in Post-LN with deep Transformers, e.g., ten or more layers, often becomes unstable, resulting in useless models. However, in contrast, Post-LN has also consistently achieved better performance than Pre-LN in relatively shallow Transformers, e.g., six or fewer layers. This study first investigates the reason for these discrepant observations empirically and theoretically and discovers 1, the LN in Post-LN is the source of the vanishing gradient problem that mainly leads the unstable training whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation that may lead an effective training. Exploiting the new findings, we propose a method that can equip both higher stability and effective training by a simple modification from Post-LN. We conduct experiments on a wide range of text generation tasks and demonstrate that our method outperforms Pre-LN, and stable training regardless of the shallow or deep layer settings.

Via

Access Paper or Ask Questions

Decomposing NeRF for Editing via Feature Field Distillation

May 31, 2022

Sosuke Kobayashi, Eiichi Matsumoto, Vincent Sitzmann

Figure 1 for Decomposing NeRF for Editing via Feature Field Distillation

Figure 2 for Decomposing NeRF for Editing via Feature Field Distillation

Figure 3 for Decomposing NeRF for Editing via Feature Field Distillation

Figure 4 for Decomposing NeRF for Editing via Feature Field Distillation

Abstract:Emerging neural radiance fields (NeRF) are a promising scene representation for computer graphics, enabling high-quality 3D reconstruction and novel view synthesis from image observations. However, editing a scene represented by a NeRF is challenging, as the underlying connectionist representations such as MLPs or voxel grids are not object-centric or compositional. In particular, it has been difficult to selectively edit specific regions or objects. In this work, we tackle the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes. We propose to distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors such as CLIP-LSeg or DINO into a 3D feature field optimized in parallel to the radiance field. Given a user-specified query of various modalities such as text, an image patch, or a point-and-click selection, 3D feature fields semantically decompose 3D space without the need for re-training and enable us to semantically select and edit regions in the radiance field. Our experiments validate that the distilled feature fields (DFFs) can transfer recent progress in 2D vision and language foundation models to 3D scene representations, enabling convincing 3D segmentation and selective editing of emerging neural graphics representations.

* https://pfnet-research.github.io/distilled-feature-fields/

Via

Access Paper or Ask Questions

Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

May 24, 2022

Sosuke Kobayashi, Shun Kiyono, Jun Suzuki, Kentaro Inui

Figure 1 for Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

Figure 2 for Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

Figure 3 for Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

Figure 4 for Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

Abstract:Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks, and their ensemble outperformed the standard ensemble on some tasks.

* Workshop on Challenges & Perspectives in Creating Large Language Models (BigScience) 2022

Via

Access Paper or Ask Questions

Instance-Based Neural Dependency Parsing

Sep 28, 2021

Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho Yokoi, Tatsuki Kuribayashi, Masashi Yoshikawa, Kentaro Inui

Figure 1 for Instance-Based Neural Dependency Parsing

Figure 2 for Instance-Based Neural Dependency Parsing

Figure 3 for Instance-Based Neural Dependency Parsing

Figure 4 for Instance-Based Neural Dependency Parsing

Abstract:Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations.

* 15 pages, accepted to TACL 2021

Via

Access Paper or Ask Questions

SHAPE: Shifted Absolute Position Embedding for Transformers

Sep 13, 2021

Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, Kentaro Inui

Figure 1 for SHAPE: Shifted Absolute Position Embedding for Transformers

Figure 2 for SHAPE: Shifted Absolute Position Embedding for Transformers

Figure 3 for SHAPE: Shifted Absolute Position Embedding for Transformers

Figure 4 for SHAPE: Shifted Absolute Position Embedding for Transformers

Abstract:Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent successful position representations, by randomly shifting absolute positions during training. We demonstrate that SHAPE is empirically comparable to its counterpart while being simpler and faster.

* EMNLP 2021 (short paper, main conference)

Via

Access Paper or Ask Questions

Efficient Estimation of Influence of a Training Instance

Dec 08, 2020

Sosuke Kobayashi, Sho Yokoi, Jun Suzuki, Kentaro Inui

Figure 1 for Efficient Estimation of Influence of a Training Instance

Figure 2 for Efficient Estimation of Influence of a Training Instance

Figure 3 for Efficient Estimation of Influence of a Training Instance

Figure 4 for Efficient Estimation of Influence of a Training Instance

Abstract:Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model's prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. By switching between dropout masks, we can use sub-networks that learned or did not learn each training instance and estimate its influence. Through experiments with BERT and VGGNet on classification datasets, we demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.

* This is an extended version of the paper presented at SustaiNLP 2020

Via

Access Paper or Ask Questions