User localization and tracking in the upcoming generation of wireless networks have the potential to be revolutionized by technologies such as the Dynamic Metasurface Antennas (DMAs). Commonly proposed algorithmic approaches rely on assumptions about relatively dominant Line-of-Sight (LoS) paths, or require pilot transmission sequences whose length is comparable to the number of DMA elements, thus, leading to limited effectiveness and considerable measurement overheads in blocked LoS and dynamic multipath environments. In this paper, we present a two-stage machine-learning-based approach for user tracking, specifically designed for non-LoS multipath settings. A newly proposed attention-based Neural Network (NN) is first trained to map noisy channel responses to potential user positions, regardless of user mobility patterns. This architecture constitutes a modification of the prominent vision transformer, specifically modified for extracting information from high-dimensional frequency response signals. As a second stage, the NN's predictions for the past user positions are passed through a learnable autoregressive model to exploit the time-correlated channel information and obtain the final position predictions. The channel estimation procedure leverages a DMA receive architecture with partially-connected radio frequency chains, which results to reduced numbers of pilots. The numerical evaluation over an outdoor ray-tracing scenario illustrates that despite LoS blockage, this methodology is capable of achieving high position accuracy across various multipath settings.