Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sekitoshi Kanai

Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Apr 17, 2025

Shin'ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, Daiki Chijiwa

Abstract:Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.

* Accepted to CVPR 2025; Code: https://github.com/yshinya6/clip-refine

Via

Access Paper or Ask Questions

Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models

Aug 29, 2024

Sekitoshi Kanai, Yasutoshi Ida, Kazuki Adachi, Mihiro Uchida, Tsukasa Yoshida, Shin'ya Yamaguchi

Figure 1 for Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models

Figure 2 for Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models

Figure 3 for Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models

Figure 4 for Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models

Abstract:This study investigates a method to evaluate time-series datasets in terms of the performance of deep neural networks (DNNs) with state space models (deep SSMs) trained on the dataset. SSMs have attracted attention as components inside DNNs to address time-series data. Since deep SSMs have powerful representation capacities, training datasets play a crucial role in solving a new task. However, the effectiveness of training datasets cannot be known until deep SSMs are actually trained on them. This can increase the cost of data collection for new tasks, as a trial-and-error process of data collection and time-consuming training are needed to achieve the necessary performance. To advance the practical use of deep SSMs, the metric of datasets to estimate the performance early in the training can be one key element. To this end, we introduce the concept of data evaluation methods used in system identification. In system identification of linear dynamical systems, the effectiveness of datasets is evaluated by using the spectrum of input signals. We introduce this concept to deep SSMs, which are nonlinear dynamical systems. We propose the K-spectral metric, which is the sum of the top-K spectra of signals inside deep SSMs, by focusing on the fact that each layer of a deep SSM can be regarded as a linear dynamical system. Our experiments show that the K-spectral metric has a large absolute value of the correlation coefficient with the performance and can be used to evaluate the quality of training datasets.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks

Mar 15, 2024

Shin'ya Yamaguchi, Sekitoshi Kanai, Kazuki Adachi, Daiki Chijiwa

Abstract:While fine-tuning is a de facto standard method for training deep neural networks, it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However, these methods require auxiliary source information (e.g., source labels or datasets) or heavy additional computations. In this paper, we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end, AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore, AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization, which requires auxiliary source information and heavy computation costs.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

Aug 31, 2023

Satoshi Suzuki, Shin'ya Yamaguchi, Shoichiro Takeda, Sekitoshi Kanai, Naoki Makishima, Atsushi Ando, Ryo Masumura

Figure 1 for Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

Figure 2 for Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

Figure 3 for Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

Figure 4 for Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

Abstract:This paper addresses the tradeoff between standard accuracy on clean examples and robustness against adversarial examples in deep neural networks (DNNs). Although adversarial training (AT) improves robustness, it degrades the standard accuracy, thus yielding the tradeoff. To mitigate this tradeoff, we propose a novel AT method called ARREST, which comprises three components: (i) adversarial finetuning (AFT), (ii) representation-guided knowledge distillation (RGKD), and (iii) noisy replay (NR). AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is standardly pretrained on clean examples. RGKD and NR respectively entail a regularization term and an algorithm to preserve latent representations of clean examples during AFT. RGKD penalizes the distance between the representations of the standardly pretrained and AFT DNNs. NR switches input adversarial examples to nonadversarial ones when the representation changes significantly during AFT. By combining these components, ARREST achieves both high standard accuracy and robustness. Experimental results demonstrate that ARREST mitigates the tradeoff more effectively than previous AT-based methods do.

* Accepted by International Conference on Computer Vision (ICCV) 2023

Via

Access Paper or Ask Questions

Regularizing Neural Networks with Meta-Learning Generative Models

Jul 26, 2023

Shin'ya Yamaguchi, Daiki Chijiwa, Sekitoshi Kanai, Atsutoshi Kumagai, Hisashi Kashima

Abstract:This paper investigates methods for improving generative data augmentation for deep learning. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small dataset settings. A key challenge of generative data augmentation is that the synthetic data contain uninformative samples that degrade accuracy. This is because the synthetic samples do not perfectly represent class categories in real data and uniform sampling does not necessarily provide useful samples for tasks. In this paper, we present a novel strategy for generative data augmentation called meta generative regularization (MGR). To avoid the degradation of generative data augmentation, MGR utilizes synthetic samples in the regularization term for feature extractors instead of in the loss function, e.g., cross-entropy. These synthetic samples are dynamically determined to minimize the validation losses through meta-learning. We observed that MGR can avoid the performance degradation of na\"ive generative data augmentation and boost the baselines. Experiments on six datasets showed that MGR is effective particularly when datasets are smaller and stably outperforms baselines.

* Accepted to Data-centric Machine Learning Research (DMLR) Workshop at ICML 2023

Via

Access Paper or Ask Questions

Fast Regularized Discrete Optimal Transport with Group-Sparse Regularizers

Mar 14, 2023

Yasutoshi Ida, Sekitoshi Kanai, Kazuki Adachi, Atsutoshi Kumagai, Yasuhiro Fujiwara

Figure 1 for Fast Regularized Discrete Optimal Transport with Group-Sparse Regularizers

Figure 2 for Fast Regularized Discrete Optimal Transport with Group-Sparse Regularizers

Figure 3 for Fast Regularized Discrete Optimal Transport with Group-Sparse Regularizers

Figure 4 for Fast Regularized Discrete Optimal Transport with Group-Sparse Regularizers

Abstract:Regularized discrete optimal transport (OT) is a powerful tool to measure the distance between two discrete distributions that have been constructed from data samples on two different domains. While it has a wide range of applications in machine learning, in some cases the sampled data from only one of the domains will have class labels such as unsupervised domain adaptation. In this kind of problem setting, a group-sparse regularizer is frequently leveraged as a regularization term to handle class labels. In particular, it can preserve the label structure on the data samples by corresponding the data samples with the same class label to one group-sparse regularization term. As a result, we can measure the distance while utilizing label information by solving the regularized optimization problem with gradient-based algorithms. However, the gradient computation is expensive when the number of classes or data samples is large because the number of regularization terms and their respective sizes also turn out to be large. This paper proposes fast discrete OT with group-sparse regularizers. Our method is based on two ideas. The first is to safely skip the computations of the gradients that must be zero. The second is to efficiently extract the gradients that are expected to be nonzero. Our method is guaranteed to return the same value of the objective function as that of the original method. Experiments show that our method is up to 8.6 times faster than the original method without degrading accuracy.

* This is an extended version of the paper accepted by the 37th AAAI Conference on Artificial Intelligence (AAAI 2023)

Via

Access Paper or Ask Questions

Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks

Oct 04, 2022

Kentaro Ohno, Sekitoshi Kanai, Yasutoshi Ida

Figure 1 for Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks

Figure 2 for Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks

Figure 3 for Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks

Figure 4 for Fast Saturating Gate for Learning Long Time Scales with Recurrent Neural Networks

Abstract:Gate functions in recurrent models, such as an LSTM and GRU, play a central role in learning various time scales in modeling time series data by using a bounded activation function. However, it is difficult to train gates to capture extremely long time scales due to gradient vanishing of the bounded function for large inputs, which is known as the saturation problem. We closely analyze the relation between saturation of the gate function and efficiency of the training. We prove that the gradient vanishing of the gate function can be mitigated by accelerating the convergence of the saturating function, i.e., making the output of the function converge to 0 or 1 faster. Based on the analysis results, we propose a gate function called fast gate that has a doubly exponential convergence rate with respect to inputs by simple function composition. We empirically show that our method outperforms previous methods in accuracy and computational efficiency on benchmark tasks involving extremely long time scales.

* 9 pages of main texts with 4 pages appendices, 12 figures

Via

Access Paper or Ask Questions

Switching One-Versus-the-Rest Loss to Increase the Margin of Logits for Adversarial Robustness

Jul 21, 2022

Sekitoshi Kanai, Shin'ya Yamaguchi, Masanori Yamada, Hiroshi Takahashi, Yasutoshi Ida

Figure 1 for Switching One-Versus-the-Rest Loss to Increase the Margin of Logits for Adversarial Robustness

Figure 2 for Switching One-Versus-the-Rest Loss to Increase the Margin of Logits for Adversarial Robustness

Figure 3 for Switching One-Versus-the-Rest Loss to Increase the Margin of Logits for Adversarial Robustness

Figure 4 for Switching One-Versus-the-Rest Loss to Increase the Margin of Logits for Adversarial Robustness

Abstract:Defending deep neural networks against adversarial examples is a key challenge for AI safety. To improve the robustness effectively, recent methods focus on important data points near the decision boundary in adversarial training. However, these methods are vulnerable to Auto-Attack, which is an ensemble of parameter-free attacks for reliable evaluation. In this paper, we experimentally investigate the causes of their vulnerability and find that existing methods reduce margins between logits for the true label and the other labels while keeping their gradient norms non-small values. Reduced margins and non-small gradient norms cause their vulnerability since the largest logit can be easily flipped by the perturbation. Our experiments also show that the histogram of the logit margins has two peaks, i.e., small and large logit margins. From the observations, we propose switching one-versus-the-rest loss (SOVR), which uses one-versus-the-rest loss when data have small logit margins so that it increases the margins. We find that SOVR increases logit margins more than existing methods while keeping gradient norms small and outperforms them in terms of the robustness against Auto-Attack.

* 20 pages, 16 figures

Via

Access Paper or Ask Questions

Transfer Learning with Pre-trained Conditional Generative Models

Apr 27, 2022

Shin'ya Yamaguchi, Sekitoshi Kanai, Atsutoshi Kumagai, Daiki Chijiwa, Hisashi Kashima

Figure 1 for Transfer Learning with Pre-trained Conditional Generative Models

Figure 2 for Transfer Learning with Pre-trained Conditional Generative Models

Figure 3 for Transfer Learning with Pre-trained Conditional Generative Models

Figure 4 for Transfer Learning with Pre-trained Conditional Generative Models

Abstract:Transfer learning is crucial in training deep neural networks on new target tasks. Current transfer learning methods generally assume at least one of (i) source and target task label spaces must overlap, (ii) source datasets are available, and (iii) target network architectures are consistent with source ones. However, these all assumptions are difficult to hold in practical settings because the target task rarely has the same labels as the source task, the source dataset access is restricted due to licensing and storage costs, and the target architecture is often specialized to each task. To transfer source knowledge without these assumptions, we propose a transfer learning method that uses deep generative models and is composed of the following two stages: pseudo pre-training (PP) and pseudo semi-supervised learning (P-SSL). PP trains a target architecture with a synthesized dataset by using conditional source generative models. P-SSL applies SSL algorithms to labeled target data and unlabeled pseudo samples, which are generated by cascading the source classifier and generative models to condition them with target samples. Our experimental results indicate that our method can outperform baselines of scratch training and knowledge distillation.

* 22 pages, 6 figures

Via

Access Paper or Ask Questions

F-Drop&Match: GANs with a Dead Zone in the High-Frequency Domain

Jun 04, 2021

Shin'ya Yamaguchi, Sekitoshi Kanai

Figure 1 for F-Drop&Match: GANs with a Dead Zone in the High-Frequency Domain

Figure 2 for F-Drop&Match: GANs with a Dead Zone in the High-Frequency Domain

Figure 3 for F-Drop&Match: GANs with a Dead Zone in the High-Frequency Domain

Figure 4 for F-Drop&Match: GANs with a Dead Zone in the High-Frequency Domain

Abstract:Generative adversarial networks built from deep convolutional neural networks (GANs) lack the ability to exactly replicate the high-frequency components of natural images. To alleviate this issue, we introduce two novel training techniques called frequency dropping (F-Drop) and frequency matching (F-Match). The key idea of F-Drop is to filter out unnecessary high-frequency components from the input images of the discriminators. This simple modification prevents the discriminators from being confused by perturbations of the high-frequency components. In addition, F-Drop makes the GANs focus on fitting in the low-frequency domain, in which there are the dominant components of natural images. F-Match minimizes the difference between real and fake images in the frequency domain for generating more realistic images. F-Match is implemented as a regularization term in the objective functions of the generators; it penalizes the batch mean error in the frequency domain. F-Match helps the generators to fit in the high-frequency domain filtered out by F-Drop to the real image. We experimentally demonstrate that the combination of F-Drop and F-Match improves the generative performance of GANs in both the frequency and spatial domain on multiple image benchmarks (CIFAR, TinyImageNet, STL-10, CelebA, and ImageNet).

* Under Review

Via

Access Paper or Ask Questions