Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qiang Yu

Combining Aggregated Attention and Transformer Architecture for Accurate and Efficient Performance of Spiking Neural Networks

Dec 18, 2024

Hangming Zhang, Alexander Sboev, Roman Rybka, Qiang Yu

Abstract:Spiking Neural Networks have attracted significant attention in recent years due to their distinctive low-power characteristics. Meanwhile, Transformer models, known for their powerful self-attention mechanisms and parallel processing capabilities, have demonstrated exceptional performance across various domains, including natural language processing and computer vision. Despite the significant advantages of both SNNs and Transformers, directly combining the low-power benefits of SNNs with the high performance of Transformers remains challenging. Specifically, while the sparse computing mode of SNNs contributes to reduced energy consumption, traditional attention mechanisms depend on dense matrix computations and complex softmax operations. This reliance poses significant challenges for effective execution in low-power scenarios. Given the tremendous success of Transformers in deep learning, it is a necessary step to explore the integration of SNNs and Transformers to harness the strengths of both. In this paper, we propose a novel model architecture, Spike Aggregation Transformer (SAFormer), that integrates the low-power characteristics of SNNs with the high-performance advantages of Transformer models. The core contribution of SAFormer lies in the design of the Spike Aggregated Self-Attention (SASA) mechanism, which significantly simplifies the computation process by calculating attention weights using only the spike matrices query and key, thereby effectively reducing energy consumption. Additionally, we introduce a Depthwise Convolution Module (DWC) to enhance the feature extraction capabilities, further improving overall accuracy. We evaluated and demonstrated that SAFormer outperforms state-of-the-art SNNs in both accuracy and energy consumption, highlighting its significant advantages in low-power and high-performance computing.

Via

Access Paper or Ask Questions

Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Oct 11, 2024

Kun Ding, Qiang Yu, Haojian Zhang, Gaofeng Meng, Shiming Xiang

Figure 1 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Figure 2 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Figure 3 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Figure 4 for Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation

Abstract:Cache-based approaches stand out as both effective and efficient for adapting vision-language models (VLMs). Nonetheless, the existing cache model overlooks three crucial aspects. 1) Pre-trained VLMs are mainly optimized for image-text similarity, neglecting the importance of image-image similarity, leading to a gap between pre-training and adaptation. 2) The current cache model is based on the Nadaraya-Watson (N-W) estimator, which disregards the intricate relationships among training samples while constructing weight function. 3) Under the condition of limited samples, the logits generated by cache model are of high uncertainty, directly using these logits without accounting for the confidence could be problematic. This work presents three calibration modules aimed at addressing the above challenges. Similarity Calibration refines the image-image similarity by using unlabeled images. We add a learnable projection layer with residual connection on top of the pre-trained image encoder of CLIP and optimize the parameters by minimizing self-supervised contrastive loss. Weight Calibration introduces a precision matrix into the weight function to adequately model the relation between training samples, transforming the existing cache model to a Gaussian Process (GP) regressor, which could be more accurate than N-W estimator. Confidence Calibration leverages the predictive variances computed by GP Regression to dynamically re-scale the logits of cache model, ensuring that the cache model's outputs are appropriately adjusted based on their confidence levels. Besides, to reduce the high complexity of GPs, we further propose a group-based learning strategy. Integrating the above designs, we propose both training-free and training-required variants. Extensive experiments on 11 few-shot classification datasets validate that the proposed methods can achieve state-of-the-art performance.

* submitted to IJCV

Via

Access Paper or Ask Questions

Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning

Mar 31, 2024

Kun Ding, Haojian Zhang, Qiang Yu, Ying Wang, Shiming Xiang, Chunhong Pan

Abstract:We propose a generalized method for boosting the generalization ability of pre-trained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting.

* Accepted by AAAI2024

Via

Access Paper or Ask Questions

Compositional Kronecker Context Optimization for Vision-Language Models

Mar 18, 2024

Kun Ding, Xiaohui Li, Qiang Yu, Ying Wang, Haojian Zhang, Shiming Xiang

Abstract:Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.

Via

Access Paper or Ask Questions

Prompt Tuning with Soft Context Sharing for Vision-Language Models

Aug 29, 2022

Kun Ding, Ying Wang, Pengzhang Liu, Qiang Yu, Haojian Zhang, Shiming Xiang, Chunhong Pan

Figure 1 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Figure 2 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Figure 3 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Figure 4 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Abstract:Vision-language models have recently shown great potential on many computer vision tasks. Meanwhile, prior work demonstrates prompt tuning designed for vision-language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In real-world applications, many few-shot tasks are correlated, particularly in a specialized area. However, such information is ignored by previous work. Inspired by the fact that modeling task relationships by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to fine-tune pre-trained vision-language models on multiple target few-shot tasks, simultaneously. Specifically, we design a task-shared meta network to generate prompt vector for each task using pre-defined task name together with a learnable meta prompt as input. As such, the prompt vectors of all tasks will be shared in a soft manner. The parameters of this shared meta network as well as the meta prompt vector are tuned on the joint training set of all target tasks. Extensive experiments on three multi-task few-shot datasets show that SoftCPT outperforms the representative single-task prompt tuning method CoOp [78] by a large margin, implying the effectiveness of multi-task learning in vision-language prompt tuning. The source code and data will be made publicly available.

Via

Access Paper or Ask Questions

Consensus Graph Representation Learning for Better Grounded Image Captioning

Dec 02, 2021

Wenqiao Zhang, Haochen Shi, Siliang Tang, Jun Xiao, Qiang Yu, Yueting Zhuang

Figure 1 for Consensus Graph Representation Learning for Better Grounded Image Captioning

Figure 2 for Consensus Graph Representation Learning for Better Grounded Image Captioning

Figure 3 for Consensus Graph Representation Learning for Better Grounded Image Captioning

Figure 4 for Consensus Graph Representation Learning for Better Grounded Image Captioning

Abstract:The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above - exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1LOC).

* 9 pages, 5 figures, AAAI 2021

Via

Access Paper or Ask Questions

Disentangled Motif-aware Graph Learning for Phrase Grounding

Apr 13, 2021

Zongshen Mu, Siliang Tang, Jie Tan, Qiang Yu, Yueting Zhuang

Figure 1 for Disentangled Motif-aware Graph Learning for Phrase Grounding

Figure 2 for Disentangled Motif-aware Graph Learning for Phrase Grounding

Figure 3 for Disentangled Motif-aware Graph Learning for Phrase Grounding

Figure 4 for Disentangled Motif-aware Graph Learning for Phrase Grounding

Abstract:In this paper, we propose a novel graph learning framework for phrase grounding in the image. Developing from the sequential to the dense graph model, existing works capture coarse-grained context but fail to distinguish the diversity of context among phrases and image regions. In contrast, we pay special attention to different motifs implied in the context of the scene graph and devise the disentangled graph network to integrate the motif-aware contextual information into representations. Besides, we adopt interventional strategies at the feature and the structure levels to consolidate and generalize representations. Finally, the cross-modal attention network is utilized to fuse intra-modal features, where each phrase can be computed similarity with regions to select the best-grounded one. We validate the efficiency of disentangled and interventional graph network (DIGN) through a series of ablation studies, and our model achieves state-of-the-art performance on Flickr30K Entities and ReferIt Game benchmarks.

* 10 pages, 6 figures, AAAI 2021 conference

Via

Access Paper or Ask Questions

Synaptic Learning with Augmented Spikes

May 11, 2020

Qiang Yu, Shiming Song, Chenxiang Ma, Linqiang Pan, Kay Chen Tan

Figure 1 for Synaptic Learning with Augmented Spikes

Figure 2 for Synaptic Learning with Augmented Spikes

Figure 3 for Synaptic Learning with Augmented Spikes

Figure 4 for Synaptic Learning with Augmented Spikes

Abstract:Traditional neuron models use analog values for information representation and computation, while all-or-nothing spikes are employed in the spiking ones. With a more brain-like processing paradigm, spiking neurons are more promising for improvements on efficiency and computational capability. They extend the computation of traditional neurons with an additional dimension of time carried by all-or-nothing spikes. Could one benefit from both the accuracy of analog values and the time-processing capability of spikes? In this paper, we introduce a concept of augmented spikes to carry complementary information with spike coefficients in addition to spike latencies. New augmented spiking neuron model and synaptic learning rules are proposed to process and learn patterns of augmented spikes. We provide systematic insight into the properties and characteristics of our methods, including classification of augmented spike patterns, learning capacity, construction of causality, feature detection, robustness and applicability to practical tasks such as acoustic and visual pattern recognition. The remarkable results highlight the effectiveness and potential merits of our methods. Importantly, our augmented approaches are versatile and can be easily generalized to other spike-based systems, contributing to a potential development for them including neuromorphic computing.

* 13 pages

Via

Access Paper or Ask Questions

Constructing Accurate and Efficient Deep Spiking Neural Networks with Double-threshold and Augmented Schemes

May 05, 2020

Qiang Yu, Chenxiang Ma, Shiming Song, Gaoyan Zhang, Jianwu Dang, Kay Chen Tan

Figure 1 for Constructing Accurate and Efficient Deep Spiking Neural Networks with Double-threshold and Augmented Schemes

Figure 2 for Constructing Accurate and Efficient Deep Spiking Neural Networks with Double-threshold and Augmented Schemes

Figure 3 for Constructing Accurate and Efficient Deep Spiking Neural Networks with Double-threshold and Augmented Schemes

Figure 4 for Constructing Accurate and Efficient Deep Spiking Neural Networks with Double-threshold and Augmented Schemes

Abstract:Spiking neural networks (SNNs) are considered as a potential candidate to overcome current challenges such as the high-power consumption encountered by artificial neural networks (ANNs), however there is still a gap between them with respect to the recognition accuracy on practical tasks. A conversion strategy was thus introduced recently to bridge this gap by mapping a trained ANN to an SNN. However, it is still unclear that to what extent this obtained SNN can benefit both the accuracy advantage from ANN and high efficiency from the spike-based paradigm of computation. In this paper, we propose two new conversion methods, namely TerMapping and AugMapping. The TerMapping is a straightforward extension of a typical threshold-balancing method with a double-threshold scheme, while the AugMapping additionally incorporates a new scheme of augmented spike that employs a spike coefficient to carry the number of typical all-or-nothing spikes occurring at a time step. We examine the performance of our methods based on MNIST, Fashion-MNIST and CIFAR10 datasets. The results show that the proposed double-threshold scheme can effectively improve accuracies of the converted SNNs. More importantly, the proposed AugMapping is more advantageous for constructing accurate, fast and efficient deep SNNs as compared to other state-of-the-art approaches. Our study therefore provides new approaches for further integration of advanced techniques in ANNs to improve the performance of SNNs, which could be of great merit to applied developments with spike-based neuromorphic computing.

* 13 pages

Via

Access Paper or Ask Questions

Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning

May 02, 2020

Qiang Yu, Shenglan Li, Huajin Tang, Longbiao Wang, Jianwu Dang, Kay Chen Tan

Figure 1 for Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning

Figure 2 for Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning

Figure 3 for Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning

Figure 4 for Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning

Abstract:Spikes are the currency in central nervous systems for information transmission and processing. They are also believed to play an essential role in low-power consumption of the biological systems, whose efficiency attracts increasing attentions to the field of neuromorphic computing. However, efficient processing and learning of discrete spikes still remains as a challenging problem. In this paper, we make our contributions towards this direction. A simplified spiking neuron model is firstly introduced with effects of both synaptic input and firing output on membrane potential being modeled with an impulse function. An event-driven scheme is then presented to further improve the processing efficiency. Based on the neuron model, we propose two new multi-spike learning rules which demonstrate better performance over other baselines on various tasks including association, classification, feature detection. In addition to efficiency, our learning rules demonstrate a high robustness against strong noise of different types. They can also be generalized to different spike coding schemes for the classification task, and notably single neuron is capable of solving multi-category classifications with our learning rules. In the feature detection task, we re-examine the ability of unsupervised STDP with its limitations being presented, and find a new phenomenon of losing selectivity. In contrast, our proposed learning rules can reliably solve the task over a wide range of conditions without specific constraints being applied. Moreover, our rules can not only detect features but also discriminate them. The improved performance of our methods would contribute to neuromorphic computing as a preferable choice.

* 13 pages

Via

Access Paper or Ask Questions