State of the art music recommender systems mainly rely on either Matrix factorization-based collaborative filtering approaches or deep learning architectures. Deep learning models usually use metadata for content-based filtering or predict the next user interaction by learning from temporal sequences of user actions. Despite advances in deep learning for song recommendation, none has taken advantage of the sequential nature of songs by learning sequence models that are based on content. Aside from the importance of prediction accuracy, other significant aspects are important, such as explainability and solving the cold start problem. In this work, we propose a hybrid deep learning structure, called "SeER", that uses collaborative filtering (CF) and deep learning sequence models on the MIDI content of songs for recommendation in order to provide more accurate personalized recommendations; solve the item cold start problem; and generate a relevant explanation for a song recommendation. Our evaluation experiments show promising results compared to state of the art baseline and hybrid song recommender systems in terms of ranking evaluation.