Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yoshiyuki Ohmura

Unsupervised Learning of Inter-Object Relationships via Group Homomorphism

Apr 22, 2026

Kyotaro Ushida, Takayuki Komatsu, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Abstract:While current deep learning models achieve high performance by learning statistical correlations from vast datasets,which stands in stark contrast to human learning. They lack the flexibility of humans-particularly preverbal infants-to autonomously acquire the underlying structure of the world from limited experience and adapt to novel situations. In this study, we propose an unsupervised representation learning method based on a hierarchical relationship in group operations, rather than statistical independence, aiming to build a computational model of the cognitive development of infants. The proposed model features an integrated architecture that simultaneously performs object segmentation and the extraction of motion laws from dynamic image sequences. By introducing the Homomorphism from algebra as a structural constraint within a neural network, the model structurally separates pixel-level changes into meaningful, decomposed transformation components, such as translation and deformation. Using interaction scenes (chasing and evading tasks) based on developmental science findings, we experimentally demonstrate that the model can segment multiple objects into individual slots without any ground-truth labels. Furthermore, we confirmed that relative movements between objects, such as approaching or receding, are accurately mapped and structured into a one-dimensional additive latent space. These results suggest that by introducing algebraic geometric constraints rather than relying solely on statistical correlation learning, physically interpretable "disentangled representations" can be acquired. This study contributes to the understanding of the process by which infants internalize environmental laws as structures and provides a new perspective for constructing artificial systems with developmental intelligence.

* Preprint. Under review at ICDL 2026

Via

Access Paper or Ask Questions

Exploration-assisted Bottleneck Transition Toward Robust and Data-efficient Deformable Object Manipulation

Mar 14, 2026

Yujiro Onishi, Ryo Takizawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Abstract:Imitation learning has demonstrated impressive results in robotic manipulation but fails under out-of-distribution (OOD) states. This limitation is particularly critical in Deformable Object Manipulation (DOM), where the near-infinite possible configurations render comprehensive data collection infeasible. Although several methods address OOD states, they typically require exhaustive data or highly precise perception. Such requirements are often impractical for DOM owing to its inherent complexities, including self-occlusion. To address the OOD problem in DOM, we propose a novel framework, Exploration-assisted Bottleneck Transition for Deformable Object Manipulation (ExBot), which addresses the OOD challenge through two key advantages. First, we introduce bottleneck states, standardized configurations that serve as starting points for task execution. This enables the reconceptualization of OOD challenges as the problem of transitioning diverse initial states to these bottleneck states, significantly reducing demonstration requirements. Second, to account for imperfect perception, we partition the OOD state space based on recognizability and employ dual action primitives. This approach enables ExBot to manipulate even unrecognizable states without requiring accurate perception. By concentrating demonstrations around bottleneck states and leveraging exploration to alter perceptual conditions, ExBot achieves both data efficiency and robustness to severe OOD scenarios. Real-world experiments on rope and cloth manipulation demonstrate successful task completion from diverse OOD states, including severe self-occlusions.

Via

Access Paper or Ask Questions

Why Consciousness Should Explain Physical Phenomena: Toward a Testable Theory

Nov 19, 2025

Yoshiyuki Ohmura, Yasuo Kuniyoshi

Abstract:The reductionist approach commonly employed in scientific methods presupposes that both macro and micro phenomena can be explained by micro-level laws alone. This assumption implies intra-level causal closure, rendering all macro phenomena epiphenomenal. However, the integrative nature of consciousness suggests that it is a macro phenomenon. To ensure scientific testability and reject epiphenomenalism, the reductionist assumption of intra-level causal closure must be rejected. This implies that even neural-level behavior cannot be explained by observable neural-level laws alone. Therefore, a new methodology is necessary to acknowledge the causal efficacy of macro-level phenomena. We model the brain as operating under dual laws at different levels. This model includes hypothetical macro-level psychological laws that are not determined solely by micro-level neural laws, as well as the causal effects from macro to micro levels. In this study, we propose a constructive approach that explains both mental and physical phenomena through the interaction between these two sets of laws.

Via

Access Paper or Ask Questions

Feature-Based Lie Group Transformer for Real-World Applications

Jun 06, 2025

Takayuki Komatsu, Yoshiyuki Ohmura, Kayato Nishitsunoi, Yasuo Kuniyoshi

Figure 1 for Feature-Based Lie Group Transformer for Real-World Applications

Figure 2 for Feature-Based Lie Group Transformer for Real-World Applications

Figure 3 for Feature-Based Lie Group Transformer for Real-World Applications

Figure 4 for Feature-Based Lie Group Transformer for Real-World Applications

Abstract:The main goal of representation learning is to acquire meaningful representations from real-world sensory inputs without supervision. Representation learning explains some aspects of human development. Various neural network (NN) models have been proposed that acquire empirically good representations. However, the formulation of a good representation has not been established. We recently proposed a method for categorizing changes between a pair of sensory inputs. A unique feature of this approach is that transformations between two sensory inputs are learned to satisfy algebraic structural constraints. Conventional representation learning often assumes that disentangled independent feature axes is a good representation; however, we found that such a representation cannot account for conditional independence. To overcome this problem, we proposed a new method using group decomposition in Galois algebra theory. Although this method is promising for defining a more general representation, it assumes pixel-to-pixel translation without feature extraction, and can only process low-resolution images with no background, which prevents real-world application. In this study, we provide a simple method to apply our group decomposition theory to a more realistic scenario by combining feature extraction and object segmentation. We replace pixel translation with feature translation and formulate object segmentation as grouping features under the same transformation. We validated the proposed method on a practical dataset containing both real-world object and background. We believe that our model will lead to a better understanding of human development of object recognition in the real world.

Via

Access Paper or Ask Questions

Learning Conditionally Independent Transformations using Normal Subgroups in Group Theory

Apr 06, 2025

Kayato Nishitsunoi, Yoshiyuki Ohmura, Takayuki Komatsu, Yasuo Kuniyoshi

Figure 1 for Learning Conditionally Independent Transformations using Normal Subgroups in Group Theory

Figure 2 for Learning Conditionally Independent Transformations using Normal Subgroups in Group Theory

Figure 3 for Learning Conditionally Independent Transformations using Normal Subgroups in Group Theory

Figure 4 for Learning Conditionally Independent Transformations using Normal Subgroups in Group Theory

Abstract:Humans develop certain cognitive abilities to recognize objects and their transformations without explicit supervision, highlighting the importance of unsupervised representation learning. A fundamental challenge in unsupervised representation learning is to separate different transformations in learned feature representations. Although algebraic approaches have been explored, a comprehensive theoretical framework remains underdeveloped. Existing methods decompose transformations based on algebraic independence, but these methods primarily focus on commutative transformations and do not extend to cases where transformations are conditionally independent but noncommutative. To extend current representation learning frameworks, we draw inspiration from Galois theory, where the decomposition of groups through normal subgroups provides an approach for the analysis of structured transformations. Normal subgroups naturally extend commutativity under certain conditions and offer a foundation for the categorization of transformations, even when they do not commute. In this paper, we propose a novel approach that leverages normal subgroups to enable the separation of conditionally independent transformations, even in the absence of commutativity. Through experiments on geometric transformations in images, we show that our method successfully categorizes conditionally independent transformations, such as rotation and translation, in an unsupervised manner, suggesting a close link between group decomposition via normal subgroups and transformation categorization in representation learning.

* 8 pages, 10 figures, conference paper

Via

Access Paper or Ask Questions

Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze and Bottleneck

Feb 26, 2025

Ryo Takizawa, Izumi Karino, Koki Nakagawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Abstract:Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of the learned motions even when the object positions and end-effector poses differ from those in the provided demonstrations. By leveraging gaze information and motion bottlenecks, both crucial features for object manipulation, GazeBot achieves high generalization performance compared with state-of-the-art imitation learning methods, without sacrificing its dexterity and reactivity. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided. Videos and code are available at https://crumbyrobotics.github.io/gazebot.

Via

Access Paper or Ask Questions

Unsupervised categorization of similarity measures

Feb 12, 2025

Yoshiyuki Ohmura, Wataru Shimaya, Yasuo Kuniyoshi

Abstract:In general, objects can be distinguished on the basis of their features, such as color or shape. In particular, it is assumed that similarity judgments about such features can be processed independently in different metric spaces. However, the unsupervised categorization mechanism of metric spaces corresponding to object features remains unknown. Here, we show that the artificial neural network system can autonomously categorize metric spaces through representation learning to satisfy the algebraic independence between neural networks, and project sensory information onto multiple high-dimensional metric spaces to independently evaluate the differences and similarities between features. Conventional methods often constrain the axes of the latent space to be mutually independent or orthogonal. However, the independent axes are not suitable for categorizing metric spaces. High-dimensional metric spaces that are independent of each other are not uniquely determined by the mutually independent axes, because any combination of independent axes can form mutually independent spaces. In other words, the mutually independent axes cannot be used to naturally categorize different feature spaces, such as color space and shape space. Therefore, constraining the axes to be mutually independent makes it difficult to categorize high-dimensional metric spaces. To overcome this problem, we developed a method to constrain only the spaces to be mutually independent and not the composed axes to be independent. Our theory provides general conditions for the unsupervised categorization of independent metric spaces, thus advancing the mathematical theory of functional differentiation of neural networks.

* arXiv admin note: substantial text overlap with arXiv:2306.00239

Via

Access Paper or Ask Questions

Understanding via Gaze: Gaze-based Task Decomposition for Imitation Learning of Robot Manipulation

Jan 25, 2025

Ryo Takizawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Abstract:In imitation learning for robotic manipulation, decomposing object manipulation tasks into multiple semantic actions is essential. This decomposition enables the reuse of learned skills in varying contexts and the combination of acquired skills to perform novel tasks, rather than merely replicating demonstrated motions. Gaze, an evolutionary tool for understanding ongoing events, plays a critical role in human object manipulation, where it strongly correlates with motion planning. In this study, we propose a simple yet robust task decomposition method based on gaze transitions. We hypothesize that an imitation agent's gaze control, fixating on specific landmarks and transitioning between them, naturally segments demonstrated manipulations into sub-tasks. Notably, our method achieves consistent task decomposition across all demonstrations, which is desirable in contexts such as machine learning. Using teleoperation, a common modality in imitation learning for robotic manipulation, we collected demonstration data for various tasks, applied our segmentation method, and evaluated the characteristics and consistency of the resulting sub-tasks. Furthermore, through extensive testing across a wide range of hyperparameter variations, we demonstrated that the proposed method possesses the robustness necessary for application to different robotic systems.

Via

Access Paper or Ask Questions

Multi-task robot data for dual-arm fine manipulation

Jan 26, 2024

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Abstract:In the field of robotic manipulation, deep imitation learning is recognized as a promising approach for acquiring manipulation skills. Additionally, learning from diverse robot datasets is considered a viable method to achieve versatility and adaptability. In such research, by learning various tasks, robots achieved generality across multiple objects. However, such multi-task robot datasets have mainly focused on single-arm tasks that are relatively imprecise, not addressing the fine-grained object manipulation that robots are expected to perform in the real world. This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation. To this end, we have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling, and this data is publicly available. Additionally, this dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation. We applied the dataset to our Dual-Action and Attention (DAA), a model designed for fine-grained dual arm manipulation tasks and robust against covariate shifts. The model was tested with over 7k total trials in real robot manipulation tasks, demonstrating its capability in fine manipulation. The dataset is available at https://sites.google.com/view/multi-task-fine.

* 10 pages, The dataset is available at https://sites.google.com/view/multi-task-fine

Via

Access Paper or Ask Questions

Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

Oct 05, 2023

Takayuki Komatsu, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Figure 1 for Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

Figure 2 for Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

Figure 3 for Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

Figure 4 for Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

Abstract:Multi-object representation learning aims to represent complex real-world visual input using the composition of multiple objects. Representation learning methods have often used unsupervised learning to segment an input image into individual objects and encode these objects into each latent vector. However, it is not clear how previous methods have achieved the appropriate segmentation of individual objects. Additionally, most of the previous methods regularize the latent vectors using a Variational Autoencoder (VAE). Therefore, it is not clear whether VAE regularization contributes to appropriate object segmentation. To elucidate the mechanism of object segmentation in multi-object representation learning, we conducted an ablation study on MONet, which is a typical method. MONet represents multiple objects using pairs that consist of an attention mask and the latent vector corresponding to the attention mask. Each latent vector is encoded from the input image and attention mask. Then, the component image and attention mask are decoded from each latent vector. The loss function of MONet consists of 1) the sum of reconstruction losses between the input image and decoded component image, 2) the VAE regularization loss of the latent vector, and 3) the reconstruction loss of the attention mask to explicitly encode shape information. We conducted an ablation study on these three loss functions to investigate the effect on segmentation performance. Our results showed that the VAE regularization loss did not affect segmentation performance and the others losses did affect it. Based on this result, we hypothesize that it is important to maximize the attention mask of the image region best represented by a single latent vector corresponding to the attention mask. We confirmed this hypothesis by evaluating a new loss function with the same mechanism as the hypothesis.

Via

Access Paper or Ask Questions