Extremely large-scale massive multiple-input multiple-output (XL-MIMO) is one of the key technologies for next-generation wireless communication systems. However, acquiring the accurate high-dimensional channel matrix of XL-MIMO remains a pressing challenge due to the intractable channel property and the high complexity. In this paper, a Mixed Attention Transformer based Channel Estimation Neural Network (MAT-CENet) is developed, which is inspired by the Transformer encoder structure as well as organically integrates the feature map attention and spatial attention mechanisms to better grasp the unique characteristics of the XL-MIMO channel. By incorporating the multi-head attention layer as the core enabler, the insightful feature importance is captured and exploited effectively. A comprehensive complexity analysis for the proposed MAT-CENet is also provided. Simulation results show that MAT-CENet outperforms the state of the art in different propagation scenarios of near-, far- and hybrid-fields.