Augmented Reality (AR) is a major immersive media technology that enriches our perception of reality by overlaying digital content (the foreground) onto physical environments (the background). It has far-reaching applications, from entertainment and gaming to education, healthcare, and industrial training. Nevertheless, challenges such as visual confusion and classical distortions can result in user discomfort when using the technology. Evaluating AR quality of experience becomes essential to measure user satisfaction and engagement, facilitating the refinement necessary for creating immersive and robust experiences. Though, the scarcity of data and the distinctive characteristics of AR technology render the development of effective quality assessment metrics challenging. This paper presents a deep learning-based objective metric designed specifically for assessing image quality for AR scenarios. The approach entails four key steps, (1) fine-tuning a self-supervised pre-trained vision transformer to extract prominent features from reference images and distilling this knowledge to improve representations of distorted images, (2) quantifying distortions by computing shift representations, (3) employing cross-attention-based decoders to capture perceptual quality features, and (4) integrating regularization techniques and label smoothing to address the overfitting problem. To validate the proposed approach, we conduct extensive experiments on the ARIQA dataset. The results showcase the superior performance of our proposed approach across all model variants, namely TransformAR, TransformAR-KD, and TransformAR-KD+ in comparison to existing state-of-the-art methods.