Building spatiotemporal activity models for people's activities in urban spaces is important for understanding the ever-increasing complexity of urban dynamics. With the emergence of Geo-Tagged Social Media (GTSM) records, previous studies demonstrate the potential of GTSM records for spatiotemporal activity modeling. State-of-the-art methods for this task embed different modalities (location, time, and text) of GTSM records into a single embedding space. However, they ignore Non-GeoTagged Social Media (NGTSM) records, which generally account for the majority of posts (e.g., more than 95\% in Twitter), and could represent a great source of information to alleviate the sparsity of GTSM records. Furthermore, in the current spatiotemporal embedding techniques, less focus has been given to the users, who exhibit spatially motivated behaviors. To bridge this research gap, this work proposes USTAR, a novel online learning method for User-guided SpatioTemporal Activity Representation, which (1) embeds locations, time, and text along with users into the same embedding space to capture their correlations; (2) uses a novel collaborative filtering approach based on two different empirically studied user behaviors to incorporate both NGTSM and GTSM records in learning; and (3) introduces a novel sampling technique to learn spatiotemporal representations in an online fashion to accommodate recent information into the embedding space, while avoiding overfitting to recent records and frequently appearing units in social media streams. Our results show that USTAR substantially improves the state-of-the-art for region retrieval and keyword retrieval and its potential to be applied to other downstream applications such as local event detection.