Smartphone-based indoor localization has emerged as a cost-effective and accurate solution to localize mobile and IoT devices indoors. However, the challenges of device heterogeneity and temporal variations have hindered its widespread adoption and accuracy. Towards jointly addressing these challenges comprehensively, we propose STELLAR, a novel framework implementing a contrastive learning approach that leverages a Siamese multi-headed attention neural network. STELLAR is the first solution that simultaneously tackles device heterogeneity and temporal variations in indoor localization, without the need for retraining the model (re-calibration-free). Our evaluations across diverse indoor environments show 8-75% improvements in accuracy compared to state-of-the-art techniques, to effectively address the device heterogeneity challenge. Moreover, STELLAR outperforms existing methods by 18-165% over 2 years of temporal variations, showcasing its robustness and adaptability.