Accurate spatio-temporal prediction is crucial for the sustainable development of smart cities. However, current approaches often struggle to capture important spatio-temporal relationships, particularly overlooking global relations among distant city regions. Most existing techniques predominantly rely on Convolutional Neural Networks (CNNs) to capture global relations. However, CNNs exhibit neighbourhood bias, making them insufficient for capturing distant relations. To address this limitation, we propose ST-SampleNet, a novel transformer-based architecture that combines CNNs with self-attention mechanisms to capture both local and global relations effectively. Moreover, as the number of regions increases, the quadratic complexity of self-attention becomes a challenge. To tackle this issue, we introduce a lightweight region sampling strategy that prunes non-essential regions and enhances the efficiency of our approach. Furthermore, we introduce a spatially constrained position embedding that incorporates spatial neighbourhood information into the self-attention mechanism, aiding in semantic interpretation and improving the performance of ST-SampleNet. Our experimental evaluation on three real-world datasets demonstrates the effectiveness of ST-SampleNet. Additionally, our efficient variant achieves a 40% reduction in computational costs with only a marginal compromise in performance, approximately 1%.