This paper proposes a novel deep learning framework for multi-modal motion prediction. The framework consists of three parts: recurrent neural networks to process the target agent's motion process, convolutional neural networks to process the rasterized environment representation, and a distance-based attention mechanism to process the interactions among different agents. We validate the proposed framework on a large-scale real-world driving dataset, Waymo open motion dataset, and compare its performance against other methods on the standard testing benchmark. The qualitative results manifest that the predicted trajectories given by our model are accurate, diverse, and in accordance with the road structure. The quantitative results on the standard benchmark reveal that our model outperforms other baseline methods in terms of prediction accuracy and other evaluation metrics. The proposed framework is the second-place winner of the 2021 Waymo open dataset motion prediction challenge.