Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hironobu Fujiyoshi

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Oct 11, 2024

Nguyen Huu Bao Long, Chenyu Zhang, Yuzhi Shi, Tsubasa Hirakawa, Takayoshi Yamashita, Tohgoroh Matsui, Hironobu Fujiyoshi

Figure 1 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Figure 2 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Figure 3 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Figure 4 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Abstract:Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}

* ACCV 2024
* 20 pages, 7 figures. arXiv admin note: text overlap with arXiv:2303.08810 by other authors

Via

Access Paper or Ask Questions

Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Jul 18, 2024

Takumi Komatsu, Motonari Kambara, Shumpei Hatanaka, Haruka Matsuo, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Figure 1 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 2 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 3 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 4 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Abstract:Domestic service robots (DSRs) that support people in everyday environments have been widely investigated. However, their ability to predict and describe future risks resulting from their own actions remains insufficient. In this study, we focus on the linguistic explainability of DSRs. Most existing methods do not explicitly model the region of possible collisions; thus, they do not properly generate descriptions of these regions. In this paper, we propose the Nearest Neighbor Future Captioning Model that introduces the Nearest Neighbor Language Model for future captioning of possible collisions, which enhances the model output with a nearest neighbors retrieval mechanism. Furthermore, we introduce the Collision Attention Module that attends regions of possible collisions, which enables our model to generate descriptions that adequately reflect the objects associated with possible collisions. To validate our method, we constructed a new dataset containing samples of collisions that can occur when a DSR places an object in a simulation environment. The experimental results demonstrated that our method outperformed baseline methods, based on the standard metrics. In particular, on CIDEr-D, the baseline method obtained 25.09 points, whereas our method obtained 33.08 points.

* Accepted for presentation at Advanced Robotics 24

Via

Access Paper or Ask Questions

Layer-Wise Relevance Propagation with Conservation Property for ResNet

Jul 12, 2024

Seitaro Otsuki, Tsumugi Iida, Félix Doublet, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Figure 1 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Figure 2 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Figure 3 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Figure 4 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Abstract:The transparent formulation of explanation methods is essential for elucidating the predictions of neural networks, which are typically black-box models. Layer-wise Relevance Propagation (LRP) is a well-established method that transparently traces the flow of a model's prediction backward through its architecture by backpropagating relevance scores. However, the conventional LRP does not fully consider the existence of skip connections, and thus its application to the widely used ResNet architecture has not been thoroughly explored. In this study, we extend LRP to ResNet models by introducing Relevance Splitting at points where the output from a skip connection converges with that from a residual block. Our formulation guarantees the conservation property throughout the process, thereby preserving the integrity of the generated explanations. To evaluate the effectiveness of our approach, we conduct experiments on ImageNet and the Caltech-UCSD Birds-200-2011 dataset. Our method achieves superior performance to that of baseline methods on standard evaluation metrics such as the Insertion-Deletion score while maintaining its conservation property. We will release our code for further research at https://5ei74r0.github.io/lrp-for-resnet.page/

* Accepted for presentation at ECCV2024

Via

Access Paper or Ask Questions

Action Q-Transformer: Visual Explanation in Deep Reinforcement Learning with Encoder-Decoder Model using Action Query

Jun 24, 2023

Hidenori Itaya, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Abstract:The excellent performance of Transformer in supervised learning has led to growing interest in its potential application to deep reinforcement learning (DRL) to achieve high performance on a wide variety of problems. However, the decision making of a DRL agent is a black box, which greatly hinders the application of the agent to real-world problems. To address this problem, we propose the Action Q-Transformer (AQT), which introduces a transformer encoder-decoder structure to Q-learning based DRL methods. In AQT, the encoder calculates the state value function and the decoder calculates the advantage function to promote the acquisition of different attentions indicating the agent's decision-making. The decoder in AQT utilizes action queries, which represent the information of each action, as queries. This enables us to obtain the attentions for the state value and for each action. By acquiring and visualizing these attentions that detail the agent's decision-making, we achieve a DRL model with high interpretability. In this paper, we show that visualization of attention in Atari 2600 games enables detailed analysis of agents' decision-making in various game tasks. Further, experimental results demonstrate that our method can achieve higher performance than the baseline in some games.

* 16 pages, 8 figures, 3 tables

Via

Access Paper or Ask Questions

Learning from AI: An Interactive Learning Method Using a DNN Model Incorporating Expert Knowledge as a Teacher

Jun 04, 2023

Kohei Hattori, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi

Abstract:Visual explanation is an approach for visualizing the grounds of judgment by deep learning, and it is possible to visually interpret the grounds of a judgment for a certain input by visualizing an attention map. As for deep-learning models that output erroneous decision-making grounds, a method that incorporates expert human knowledge in the model via an attention map in a manner that improves explanatory power and recognition accuracy is proposed. In this study, based on a deep-learning model that incorporates the knowledge of experts, a method by which a learner "learns from AI" the grounds for its decisions is proposed. An "attention branch network" (ABN), which has been fine-tuned with attention maps modified by experts, is prepared as a teacher. By using an interactive editing tool for the fine-tuned ABN and attention maps, the learner learns by editing the attention maps and changing the inference results. By repeatedly editing the attention maps and making inferences so that the correct recognition results are output, the learner can acquire the grounds for the expert's judgments embedded in the ABN. The results of an evaluation experiment with subjects show that learning using the proposed method is more efficient than the conventional method.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

Masking and Mixing Adversarial Training

Feb 16, 2023

Hiroki Adachi, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Yasunori Ishii, Kazuki Kozuka

Figure 1 for Masking and Mixing Adversarial Training

Figure 2 for Masking and Mixing Adversarial Training

Figure 3 for Masking and Mixing Adversarial Training

Figure 4 for Masking and Mixing Adversarial Training

Abstract:While convolutional neural networks (CNNs) have achieved excellent performances in various computer vision tasks, they often misclassify with malicious samples, a.k.a. adversarial examples. Adversarial training is a popular and straightforward technique to defend against the threat of adversarial examples. Unfortunately, CNNs must sacrifice the accuracy of standard samples to improve robustness against adversarial examples when adversarial training is used. In this work, we propose Masking and Mixing Adversarial Training (M2AT) to mitigate the trade-off between accuracy and robustness. We focus on creating diverse adversarial examples during training. Specifically, our approach consists of two processes: 1) masking a perturbation with a binary mask and 2) mixing two partially perturbed images. Experimental results on CIFAR-10 dataset demonstrate that our method achieves better robustness against several adversarial attacks than previous methods.

Via

Access Paper or Ask Questions

Data Augmentation by Selecting Mixed Classes Considering Distance Between Classes

Sep 12, 2022

Shungo Fujii, Yasunori Ishii, Kazuki Kozuka, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi

Figure 1 for Data Augmentation by Selecting Mixed Classes Considering Distance Between Classes

Figure 2 for Data Augmentation by Selecting Mixed Classes Considering Distance Between Classes

Figure 3 for Data Augmentation by Selecting Mixed Classes Considering Distance Between Classes

Figure 4 for Data Augmentation by Selecting Mixed Classes Considering Distance Between Classes

Abstract:Data augmentation is an essential technique for improving recognition accuracy in object recognition using deep learning. Methods that generate mixed data from multiple data sets, such as mixup, can acquire new diversity that is not included in the training data, and thus contribute significantly to accuracy improvement. However, since the data selected for mixing are randomly sampled throughout the training process, there are cases where appropriate classes or data are not selected. In this study, we propose a data augmentation method that calculates the distance between classes based on class probabilities and can select data from suitable classes to be mixed in the training process. Mixture data is dynamically adjusted according to the training trend of each class to facilitate training. The proposed method is applied in combination with conventional methods for generating mixed data. Evaluation experiments show that the proposed method improves recognition performance on general and long-tailed image recognition datasets.

Via

Access Paper or Ask Questions

Visual Explanation of Deep Q-Network for Robot Navigation by Fine-tuning Attention Branch

Aug 18, 2022

Yuya Maruyama, Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Figure 1 for Visual Explanation of Deep Q-Network for Robot Navigation by Fine-tuning Attention Branch

Figure 2 for Visual Explanation of Deep Q-Network for Robot Navigation by Fine-tuning Attention Branch

Figure 3 for Visual Explanation of Deep Q-Network for Robot Navigation by Fine-tuning Attention Branch

Figure 4 for Visual Explanation of Deep Q-Network for Robot Navigation by Fine-tuning Attention Branch

Abstract:Robot navigation with deep reinforcement learning (RL) achieves higher performance and performs well under complex environment. Meanwhile, the interpretation of the decision-making of deep RL models becomes a critical problem for more safety and reliability of autonomous robots. In this paper, we propose a visual explanation method based on an attention branch for deep RL models. We connect attention branch with pre-trained deep RL model and the attention branch is trained by using the selected action by the trained deep RL model as a correct label in a supervised learning manner. Because the attention branch is trained to output the same result as the deep RL model, the obtained attention maps are corresponding to the agent action with higher interpretability. Experimental results with robot navigation task show that the proposed method can generate interpretable attention maps for a visual explanation.

* 8 pages, 8 figures, 1 table

Via

Access Paper or Ask Questions

Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Jul 27, 2022

Tomoya Nitta, Tsubasa Hirakawa, Hironobu Fujiyoshi, Toru Tamaki

Figure 1 for Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Figure 2 for Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Figure 3 for Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Figure 4 for Object-ABN: Learning to Generate Sharp Attention Maps for Action Recognition

Abstract:In this paper we propose an extension of the Attention Branch Network (ABN) by using instance segmentation for generating sharper attention maps for action recognition. Methods for visual explanation such as Grad-CAM usually generate blurry maps which are not intuitive for humans to understand, particularly in recognizing actions of people in videos. Our proposed method, Object-ABN, tackles this issue by introducing a new mask loss that makes the generated attention maps close to the instance segmentation result. Further the PC loss and multiple attention maps are introduced to enhance the sharpness of the maps and improve the performance of classification. Experimental results with UCF101 and SSv2 shows that the generated maps by the proposed method are much clearer qualitatively and quantitatively than those of the original ABN.

* 9 pages

Via

Access Paper or Ask Questions

ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Oct 29, 2021

Masahiro Mitsuhara, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi

Figure 1 for ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Figure 2 for ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Figure 3 for ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Figure 4 for ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video Recognition

Abstract:It is difficult for people to interpret the decision-making in the inference process of deep neural networks. Visual explanation is one method for interpreting the decision-making of deep learning. It analyzes the decision-making of 2D CNNs by visualizing an attention map that highlights discriminative regions. Visual explanation for interpreting the decision-making process in video recognition is more difficult because it is necessary to consider not only spatial but also temporal information, which is different from the case of still images. In this paper, we propose a visual explanation method called spatio-temporal attention branch network (ST-ABN) for video recognition. It enables visual explanation for both spatial and temporal information. ST-ABN acquires the importance of spatial and temporal information during network inference and applies it to recognition processing to improve recognition performance and visual explainability. Experimental results with Something-Something datasets V1 \& V2 demonstrated that ST-ABN enables visual explanation that takes into account spatial and temporal information simultaneously and improves recognition performance.

* 15 pages, 3 figures

Via

Access Paper or Ask Questions