Seeking high-quality neural latent representations to reveal the intrinsic correlation between neural activity and behavior or sensory stimulation has attracted much interest. Currently, some deep latent variable models rely on behavioral information (e.g., movement direction and position) as an aid to build expressive embeddings while being restricted by fixed time scales. Visual neural activity from passive viewing lacks clearly correlated behavior or task information, and high-dimensional visual stimulation leads to intricate neural dynamics. To cope with such conditions, we propose Time-Dependent SwapVAE, following the approach of separating content and style spaces in Swap-VAE, on the basis of which we introduce state variables to construct conditional distributions with temporal dependence for the above two spaces. Our model progressively generates latent variables along neural activity sequences, and we apply self-supervised contrastive learning to shape its latent space. In this way, it can effectively analyze complex neural dynamics from sequences of arbitrary length, even without task or behavioral data as auxiliary inputs. We compare TiDe-SwapVAE with alternative models on synthetic data and neural data from mouse visual cortex. The results show that our model not only accurately decodes complex visual stimuli but also extracts explicit temporal neural dynamics, demonstrating that it builds latent representations more relevant to visual stimulation.