We present a novel framework for modeling traffic congestion events over road networks based on new mutually exciting spatio-temporal point process models with attention mechanisms and neural network embeddings. Using multi-modal data by combining count data from traffic sensors with police reports that report traffic incidents, we aim to capture two types of triggering effect for congestion events. Current traffic congestion at one location may cause future congestion over the road network, and traffic incidents may cause spread traffic congestion. To capture the non-homogeneous temporal dependence of the event on the past, we introduce a novel attention-based mechanism based on neural networks embedding for the point process model. To incorporate the directional spatial dependence induced by the road network, we adapt the "tail-up" model from the context of spatial statistics to the traffic network setting. We demonstrate the superior performance of our approach compared to the state-of-the-art methods for both synthetic and real data.