With the rapid development of location based services, multimodal spatio-temporal (ST) data including trajectories, transportation modes, traffic flow and social check-ins are being collected for deep learning based methods. These deep learning based methods learn ST correlations to support the downstream tasks in the fields such as smart mobility, smart city and other intelligent transportation systems. Despite their effectiveness, ST data fusion and forecasting methods face practical challenges in real-world scenarios. First, forecasting performance for ST data-insufficient area is inferior, making it necessary to transfer meta knowledge from heterogeneous area to enhance the sparse representations. Second, it is nontrivial to accurately forecast in multi-transportation-mode scenarios due to the fine-grained ST features of similar transportation modes, making it necessary to distinguish and measure the ST correlations to alleviate the influence caused by entangled ST features. At last, partial data modalities (e.g., transportation mode) are lost due to privacy or technical issues in certain scenarios, making it necessary to effectively fuse the multimodal sparse ST features and enrich the ST representations. To tackle these challenges, our research work aim to develop effective fusion and forecasting methods for multimodal ST data in smart mobility scenario. In this paper, we will introduce our recent works that investigates the challenges in terms of various real-world applications and establish the open challenges in this field for future work.