Traffic flow forecasting has been regarded as a key problem of intelligent transport systems. In this work, we propose a hybrid multimodal deep learning method for short-term traffic flow forecasting, which jointly learns the spatial-temporal correlation features and interdependence of multi-modality traffic data by multimodal deep learning architecture. According to the highly nonlinear characteristics of multi-modality traffic data, the base module of our method consists of one-dimensional Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU). The former is to capture the local trend features and the latter is to capture long temporal dependencies. Then, we design a hybrid multimodal deep learning framework (HMDLF) for fusing share representation features of different modality traffic data based on multiple CNN-GRU modules. The experiment results indicate that the proposed multimodal deep learning framework is capable of dealing with complex nonlinear urban traffic flow forecasting with satisfying accuracy and effectiveness.