Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yasuhiro Fujita

Preferred Elements, Inc.

PLaMo 2 Technical Report

Sep 05, 2025

Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger(+14 more)

Abstract:In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

Via

Access Paper or Ask Questions

Experience Replay with Random Reshuffling

Mar 04, 2025

Yasuhiro Fujita

Figure 1 for Experience Replay with Random Reshuffling

Figure 2 for Experience Replay with Random Reshuffling

Figure 3 for Experience Replay with Random Reshuffling

Figure 4 for Experience Replay with Random Reshuffling

Abstract:Experience replay is a key component in reinforcement learning for stabilizing learning and improving sample efficiency. Its typical implementation samples transitions with replacement from a replay buffer. In contrast, in supervised learning with a fixed dataset, it is a common practice to shuffle the dataset every epoch and consume data sequentially, which is called random reshuffling (RR). RR enjoys theoretically better convergence properties and has been shown to outperform with-replacement sampling empirically. To leverage the benefits of RR in reinforcement learning, we propose sampling methods that extend RR to experience replay, both in uniform and prioritized settings. We evaluate our sampling methods on Atari benchmarks, demonstrating their effectiveness in deep reinforcement learning.

Via

Access Paper or Ask Questions

Entropy Controllable Direct Preference Optimization

Nov 12, 2024

Motoki Omura, Yasuhiro Fujita, Toshiki Kataoka

Figure 1 for Entropy Controllable Direct Preference Optimization

Figure 2 for Entropy Controllable Direct Preference Optimization

Figure 3 for Entropy Controllable Direct Preference Optimization

Figure 4 for Entropy Controllable Direct Preference Optimization

Abstract:In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.

Via

Access Paper or Ask Questions

PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Oct 10, 2024

Kenshin Abe, Kaizaburo Chubachi, Yasuhiro Fujita, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Hiroyoshi Komatsu, Hiroaki Mikami, Tsuguo Mogami, Shogo Murai(+9 more)

Figure 1 for PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Figure 2 for PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Figure 3 for PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Figure 4 for PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

Abstract:We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens, with architecture such as QK Normalization and Z-Loss to ensure training stability during the training process. Post-training techniques, including Supervised Fine-Tuning and Direct Preference Optimization, were applied to refine the model's performance. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks, achieving results that are competitive with frontier models like GPT-4.

Via

Access Paper or Ask Questions

Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis

Jan 05, 2022

Tianhan Xu, Yasuhiro Fujita, Eiichi Matsumoto

Figure 1 for Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis

Figure 2 for Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis

Figure 3 for Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis

Figure 4 for Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis

Abstract:We propose a new method for reconstructing controllable implicit 3D human models from sparse multi-view RGB videos. Our method defines the neural scene representation on the mesh surface points and signed distances from the surface of a human body mesh. We identify an indistinguishability issue that arises when a point in 3D space is mapped to its nearest surface point on a mesh for learning surface-aligned neural scene representation. To address this issue, we propose projecting a point onto a mesh surface using a barycentric interpolation with modified vertex normals. Experiments with the ZJU-MoCap and Human3.6M datasets show that our approach achieves a higher quality in a novel-view and novel-pose synthesis than existing methods. We also demonstrate that our method easily supports the control of body shape and clothes.

Via

Access Paper or Ask Questions

Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators

Jul 16, 2020

Yasuhiro Fujita, Kota Uenishi, Avinash Ummadisingu, Prabhat Nagarajan, Shimpei Masuda, Mario Ynocente Castro

Figure 1 for Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators

Figure 2 for Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators

Figure 3 for Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators

Figure 4 for Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators

Abstract:Developing personal robots that can perform a diverse range of manipulation tasks in unstructured environments necessitates solving several challenges for robotic grasping systems. We take a step towards this broader goal by presenting the first RL-based system, to our knowledge, for a mobile manipulator that can (a) achieve targeted grasping generalizing to unseen target objects, (b) learn complex grasping strategies for cluttered scenes with occluded objects, and (c) perform active vision through its movable wrist camera to better locate objects. The system is informed of the desired target object in the form of a single, arbitrary-pose RGB image of that object, enabling the system to generalize to unseen objects without retraining. To achieve such a system, we combine several advances in deep reinforcement learning and present a large-scale distributed training system using synchronous SGD that seamlessly scales to multi-node, multi-GPU infrastructure to make rapid prototyping easier. We train and evaluate our system in a simulated environment, identify key components for improving performance, analyze its behaviors, and transfer to a real-world setup.

* Accepted at IROS 2020

Via

Access Paper or Ask Questions

Learning Latent State Spaces for Planning through Reward Prediction

Dec 09, 2019

Aaron Havens, Yi Ouyang, Prabhat Nagarajan, Yasuhiro Fujita

Figure 1 for Learning Latent State Spaces for Planning through Reward Prediction

Figure 2 for Learning Latent State Spaces for Planning through Reward Prediction

Figure 3 for Learning Latent State Spaces for Planning through Reward Prediction

Figure 4 for Learning Latent State Spaces for Planning through Reward Prediction

Abstract:Model-based reinforcement learning methods typically learn models for high-dimensional state spaces by aiming to reconstruct and predict the original observations. However, drawing inspiration from model-free reinforcement learning, we propose learning a latent dynamics model directly from rewards. In this work, we introduce a model-based planning framework which learns a latent reward prediction model and then plans in the latent state-space. The latent representation is learned exclusively from multi-step reward prediction which we show to be the only necessary information for successful planning. With this framework, we are able to benefit from the concise model-free representation, while still enjoying the data-efficiency of model-based algorithms. We demonstrate our framework in multi-pendulum and multi-cheetah environments where several pendulums or cheetahs are shown to the agent but only one of which produces rewards. In these environments, it is important for the agent to construct a concise latent representation to filter out irrelevant observations. We find that our method can successfully learn an accurate latent reward prediction model in the presence of the irrelevant information while existing model-based methods fail. Planning in the learned latent state-space shows strong performance and high sample efficiency over model-free and model-based baselines.

* Deep RL Workshop, Neurips 2019, Vancouver

Via

Access Paper or Ask Questions

ChainerRL: A Deep Reinforcement Learning Library

Dec 09, 2019

Yasuhiro Fujita, Toshiki Kataoka, Prabhat Nagarajan, Takahiro Ishikawa

Figure 1 for ChainerRL: A Deep Reinforcement Learning Library

Figure 2 for ChainerRL: A Deep Reinforcement Learning Library

Figure 3 for ChainerRL: A Deep Reinforcement Learning Library

Figure 4 for ChainerRL: A Deep Reinforcement Learning Library

Abstract:In this paper, we introduce ChainerRL, an open-source Deep Reinforcement Learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from the state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the original papers' experimental settings and reproduce published benchmark results for several algorithms. Lastly, ChainerRL offers a visualization tool that enables the qualitative inspection of trained agents. The ChainerRL source code can be found on GitHub: https://github.com/chainer/chainerrl .

* Accepted at NeurIPS 2019 Deep Reinforcement Learning Workshop

Via

Access Paper or Ask Questions

A Differentiable Gaussian-like Distribution on Hyperbolic Space for Gradient-Based Learning

Feb 08, 2019

Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, Masanori Koyama

Figure 1 for A Differentiable Gaussian-like Distribution on Hyperbolic Space for Gradient-Based Learning

Figure 2 for A Differentiable Gaussian-like Distribution on Hyperbolic Space for Gradient-Based Learning

Figure 3 for A Differentiable Gaussian-like Distribution on Hyperbolic Space for Gradient-Based Learning

Abstract:Hyperbolic space is a geometry that is known to be well-suited for representation learning of data with an underlying hierarchical structure. In this paper, we present a novel hyperbolic distribution called \textit{pseudo-hyperbolic Gaussian}, a Gaussian-like distribution on hyperbolic space whose density can be evaluated analytically and differentiated with respect to the parameters. Our distribution enables the gradient-based learning of the probabilistic models on hyperbolic space that could never have been considered before. Also, we can sample from this hyperbolic probability distribution without resorting to auxiliary means like rejection sampling. As applications of our distribution, we develop a hyperbolic-analog of variational autoencoder and a method of probabilistic word embedding on hyperbolic space. We demonstrate the efficacy of our distribution on various datasets including MNIST, Atari 2600 Breakout, and WordNet.

* 17 pages, 12 figures

Via

Access Paper or Ask Questions

Model-Based Reinforcement Learning via Meta-Policy Optimization

Sep 14, 2018

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, Pieter Abbeel

Figure 1 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Figure 2 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Figure 3 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Figure 4 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Abstract:Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.

* First 2 authors contributed equally. Accepted for Conference on Robot Learning (CoRL)

Via

Access Paper or Ask Questions