Abstract:Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost. Our code is available at https://github.com/lyymuwu/TIMO.
Abstract:Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity. Our code is available at https://github.com/lingeringlight/START.
Abstract:Domain generalization (DG) aims to enhance the model robustness against domain shifts without accessing target domains. A prevalent category of methods for DG is data augmentation, which focuses on generating virtual samples to simulate domain shifts. However, existing augmentation techniques in DG are mainly tailored for convolutional neural networks (CNNs), with limited exploration in token-based architectures, i.e., vision transformer (ViT) and multi-layer perceptrons (MLP) models. In this paper, we study the impact of prior CNN-based augmentation methods on token-based models, revealing their performance is suboptimal due to the lack of incentivizing the model to learn holistic shape information. To tackle the issue, we propose the SEmantic-aware Token Augmentation (SETA) method. SETA transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information. To further enhance the generalization ability of the model, we introduce two stylized variants of our method combined with two state-of-the-art style augmentation methods in DG. We provide a theoretical insight into our method, demonstrating its effectiveness in reducing the generalization risk bound. Comprehensive experiments on five benchmarks prove that our method achieves SOTA performances across various ViT and MLP architectures. Our code is available at https://github.com/lingeringlight/SETA.
Abstract:Domain generalization (DG) intends to train a model on multiple source domains to ensure that it can generalize well to an arbitrary unseen target domain. The acquisition of domain-invariant representations is pivotal for DG as they possess the ability to capture the inherent semantic information of the data, mitigate the influence of domain shift, and enhance the generalization capability of the model. Adopting multiple perspectives, such as the sample and the feature, proves to be effective. The sample perspective facilitates data augmentation through data manipulation techniques, whereas the feature perspective enables the extraction of meaningful generalization features. In this paper, we focus on improving the generalization ability of the model by compelling it to acquire domain-invariant representations from both the sample and feature perspectives by disentangling spurious correlations and enhancing potential correlations. 1) From the sample perspective, we develop a frequency restriction module, guiding the model to focus on the relevant correlations between object features and labels, thereby disentangling spurious correlations. 2) From the feature perspective, the simple Tail Interaction module implicitly enhances potential correlations among all samples from all source domains, facilitating the acquisition of domain-invariant representations across multiple domains for the model. The experimental results show that Convolutional Neural Networks (CNNs) or Multi-Layer Perceptrons (MLPs) with a strong baseline embedded with these two modules can achieve superior results, e.g., an average accuracy of 92.30% on Digits-DG.
Abstract:Deep Neural Networks have exhibited considerable success in various visual tasks. However, when applied to unseen test datasets, state-of-the-art models often suffer performance degradation due to domain shifts. In this paper, we introduce a novel approach for domain generalization from a novel perspective of enhancing the robustness of channels in feature maps to domain shifts. We observe that models trained on source domains contain a substantial number of channels that exhibit unstable activations across different domains, which are inclined to capture domain-specific features and behave abnormally when exposed to unseen target domains. To address the issue, we propose a DomainDrop framework to continuously enhance the channel robustness to domain shifts, where a domain discriminator is used to identify and drop unstable channels in feature maps of each network layer during forward propagation. We theoretically prove that our framework could effectively lower the generalization bound. Extensive experiments on several benchmarks indicate that our framework achieves state-of-the-art performance compared to other competing methods. Our code is available at https://github.com/lingeringlight/DomainDrop.
Abstract:Domain generalization (DG) aims to learn a model that generalizes well to unseen target domains utilizing multiple source domains without re-training. Most existing DG works are based on convolutional neural networks (CNNs). However, the local operation of the convolution kernel makes the model focus too much on local representations (e.g., texture), which inherently causes the model more prone to overfit to the source domains and hampers its generalization ability. Recently, several MLP-based methods have achieved promising results in supervised learning tasks by learning global interactions among different patches of the image. Inspired by this, in this paper, we first analyze the difference between CNN and MLP methods in DG and find that MLP methods exhibit a better generalization ability because they can better capture the global representations (e.g., structure) than CNN methods. Then, based on a recent lightweight MLP method, we obtain a strong baseline that outperforms most state-of-the-art CNN-based methods. The baseline can learn global structure representations with a filter to suppress structure irrelevant information in the frequency space. Moreover, we propose a dynAmic LOw-Frequency spectrum Transform (ALOFT) that can perturb local texture features while preserving global structure features, thus enabling the filter to remove structure-irrelevant information sufficiently. Extensive experiments on four benchmarks have demonstrated that our method can achieve great performance improvement with a small number of parameters compared to SOTA CNN-based DG methods. Our code is available at https://github.com/lingeringlight/ALOFT/.
Abstract:By training a model on multiple observed source domains, domain generalization aims to generalize well to arbitrary unseen target domains without further training. Existing works mainly focus on learning domain-invariant features to improve the generalization ability. However, since target domain is not available during training, previous methods inevitably suffer from overfitting in source domains. To tackle this issue, we develop an effective dropout-based framework to enlarge the region of the model's attention, which can effectively mitigate the overfitting problem. Particularly, different from the typical dropout scheme, which normally conducts the dropout on the fixed layer, first, we randomly select one layer, and then we randomly select its channels to conduct dropout. Besides, we leverage the progressive scheme to add the ratio of the dropout during training, which can gradually boost the difficulty of training model to enhance the robustness of the model. Moreover, to further alleviate the impact of the overfitting issue, we leverage the augmentation schemes on image-level and feature-level to yield a strong baseline model. We conduct extensive experiments on multiple benchmark datasets, which show our method can outperform the state-of-the-art methods.