https://github.com/nkusyl/EViT.
Thanks to the advancement of deep learning technology, vision transformer has demonstrated competitive performance in various computer vision tasks. Unfortunately, vision transformer still faces some challenges such as high computational complexity and absence of desirable inductive bias. To alleviate these problems, a novel Bi-Fovea Self-Attention (BFSA) is proposed, inspired by the physiological structure and characteristics of bi-fovea vision in eagle eyes. This BFSA can simulate the shallow fovea and deep fovea functions of eagle vision, enable the network to extract feature representations of targets from coarse to fine, facilitate the interaction of multi-scale feature representations. Additionally, a Bionic Eagle Vision (BEV) block based on BFSA is designed in this study. It combines the advantages of CNNs and Vision Transformers to enhance the ability of global and local feature representations of networks. Furthermore, a unified and efficient general pyramid backbone network family is developed by stacking the BEV blocks in this study, called Eagle Vision Transformers (EViTs). Experimental results on various computer vision tasks including image classification, object detection, instance segmentation and other transfer learning tasks show that the proposed EViTs perform effectively by comparing with the baselines under same model size and exhibit higher speed on graphics processing unit than other models. Code is available at