Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Min-Chun Hu

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Dec 31, 2024

Edwin Arkel Rios, Jansen Christopher Yuanda, Vincent Leon Ghanz, Cheng-Wei Yu, Bo-Cheng Lai, Min-Chun Hu

Figure 1 for Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Figure 2 for Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Figure 3 for Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Figure 4 for Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Abstract:Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{https://github.com/arkel23/CLCA}

* Accepted to ICASSP 2025. Main: 5 pages, 4 figures, 1 table

Via

Access Paper or Ask Questions

Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Sep 17, 2024

Edwin Arkel Rios, Femiloye Oyerinde, Min-Chun Hu, Bo-Cheng Lai

Figure 1 for Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Figure 2 for Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Figure 3 for Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Figure 4 for Down-Sampling Inter-Layer Adapter for Parameter and Computation Efficient Ultra-Fine-Grained Image Recognition

Abstract:Ultra-fine-grained image recognition (UFGIR) categorizes objects with extremely small differences between classes, such as distinguishing between cultivars within the same species, as opposed to species-level classification in fine-grained image recognition (FGIR). The difficulty of this task is exacerbated due to the scarcity of samples per category. To tackle these challenges we introduce a novel approach employing down-sampling inter-layer adapters in a parameter-efficient setting, where the backbone parameters are frozen and we only fine-tune a small set of additional modules. By integrating dual-branch down-sampling, we significantly reduce the number of parameters and floating-point operations (FLOPs) required, making our method highly efficient. Comprehensive experiments on ten datasets demonstrate that our approach obtains outstanding accuracy-cost performance, highlighting its potential for practical applications in resource-constrained environments. In particular, our method increases the average accuracy by at least 6.8\% compared to other methods in the parameter-efficient setting while requiring at least 123x less trainable parameters compared to current state-of-the-art UFGIR methods and reducing the FLOPs by 30\% in average compared to other methods.

* Accepted to ECCV 2024 Workshop on Efficient Deep Learning for Foundation Models (EFM). Main: 13 pages, 3 figures, 2 tables. Appendix: 3 pages, 1 table. Total: 16 pages, 3 figures, 4 tables

Via

Access Paper or Ask Questions

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Jul 17, 2024

Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai

Abstract:Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by a feature extraction backbone followed by a high-level feature refinement step. Recently, many studies have shown the potential behind vision transformers as a backbone for fine-grained recognition, but their usage of its attention mechanism to select discriminative tokens can be computationally expensive. In this work, we propose a novel and computationally inexpensive metric to identify discriminative regions in an image. We compare the similarity between the global representation of an image given by the CLS token, a learnable token used by transformers for classification, and the local representation of individual patches. We select the regions with the highest similarity to obtain crops, which are forwarded through the same transformer encoder. Finally, high-level features of the original and cropped representations are further refined together in order to make more robust predictions. Through extensive experimental evaluation we demonstrate the effectiveness of our proposed method, obtaining favorable results in terms of accuracy across a variety of datasets. Furthermore, our method achieves these results at a much lower computational cost compared to the alternatives. Code and checkpoints are available at: \url{https://github.com/arkel23/GLSim}.

* Main: 12 pages, 5 figures, 5 tables. Appendix: 9 pages, 9 figures, 10 tables. Total: 21 pages, 14 figures, 15 tables

Via

Access Paper or Ask Questions

STR-GQN: Scene Representation and Rendering for Unknown Cameras Based on Spatial Transformation Routing

Aug 06, 2021

Wen-Cheng Chen, Min-Chun Hu, Chu-Song Chen

Figure 1 for STR-GQN: Scene Representation and Rendering for Unknown Cameras Based on Spatial Transformation Routing

Figure 2 for STR-GQN: Scene Representation and Rendering for Unknown Cameras Based on Spatial Transformation Routing

Figure 3 for STR-GQN: Scene Representation and Rendering for Unknown Cameras Based on Spatial Transformation Routing

Figure 4 for STR-GQN: Scene Representation and Rendering for Unknown Cameras Based on Spatial Transformation Routing

Abstract:Geometry-aware modules are widely applied in recent deep learning architectures for scene representation and rendering. However, these modules require intrinsic camera information that might not be obtained accurately. In this paper, we propose a Spatial Transformation Routing (STR) mechanism to model the spatial properties without applying any geometric prior. The STR mechanism treats the spatial transformation as the message passing process, and the relation between the view poses and the routing weights is modeled by an end-to-end trainable neural network. Besides, an Occupancy Concept Mapping (OCM) framework is proposed to provide explainable rationals for scene-fusion processes. We conducted experiments on several datasets and show that the proposed STR mechanism improves the performance of the Generative Query Network (GQN). The visualization results reveal that the routing process can pass the observed information from one location of some view to the associated location in the other view, which demonstrates the advantage of the proposed model in terms of spatial cognition.

Via

Access Paper or Ask Questions

SyncGAN: Synchronize the Latent Space of Cross-modal Generative Adversarial Networks

Apr 02, 2018

Wen-Cheng Chen, Chien-Wen Chen, Min-Chun Hu

Figure 1 for SyncGAN: Synchronize the Latent Space of Cross-modal Generative Adversarial Networks

Figure 2 for SyncGAN: Synchronize the Latent Space of Cross-modal Generative Adversarial Networks

Figure 3 for SyncGAN: Synchronize the Latent Space of Cross-modal Generative Adversarial Networks

Figure 4 for SyncGAN: Synchronize the Latent Space of Cross-modal Generative Adversarial Networks

Abstract:Generative adversarial network (GAN) has achieved impressive success on cross-domain generation, but it faces difficulty in cross-modal generation due to the lack of a common distribution between heterogeneous data. Most existing methods of conditional based cross-modal GANs adopt the strategy of one-directional transfer and have achieved preliminary success on text-to-image transfer. Instead of learning the transfer between different modalities, we aim to learn a synchronous latent space representing the cross-modal common concept. A novel network component named synchronizer is proposed in this work to judge whether the paired data is synchronous/corresponding or not, which can constrain the latent space of generators in the GANs. Our GAN model, named as SyncGAN, can successfully generate synchronous data (e.g., a pair of image and sound) from identical random noise. For transforming data from one modality to another, we recover the latent code by inverting the mappings of a generator and use it to generate data of different modality. In addition, the proposed model can achieve semi-supervised learning, which makes our model more flexible for practical applications.

* 9 pages, Part of this work is accepted by IEEE International Conference on Multimedia Expo 2018

Via

Access Paper or Ask Questions