Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Mar 13, 2023

Wenxiao Wang, Wei Chen, Qibo Qiu, Long Chen, Boxi Wu, Binbin Lin, Xiaofei He, Wei Liu

Figure 1 for CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Figure 2 for CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Figure 3 for CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Figure 4 for CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Share this with someone who'll enjoy it:

Abstract:While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e. the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code will be available at: https://github.com/cheerss/CrossFormer.

* 15 pages, 7 figures

View paper on

Share this with someone who'll enjoy it:

Title:CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention

Paper and Code