Spatial transcriptomics data is invaluable for understanding the spatial organization of gene expression in tissues. There have been consistent efforts in studying how to effectively utilize the associated spatial information for refining gene expression modeling. We introduce a class of distance-preserving generative models for spatial transcriptomics, which utilizes the provided spatial information to regularize the learned representation space of gene expressions to have a similar pair-wise distance structure. This helps the latent space to capture meaningful encodings of genes in spatial proximity. We carry out theoretical analysis over a tractable loss function for this purpose and formalize the overall learning objective as a regularized evidence lower bound. Our framework grants compatibility with any variational-inference-based generative models for gene expression modeling. Empirically, we validate our proposed method on the mouse brain tissues Visium dataset and observe improved performance with variational autoencoders and scVI used as backbone models.