In today's day and age when almost every industry has an online presence with users interacting in online marketplaces, personalized recommendations have become quite important. Traditionally, the problem of collaborative filtering has been tackled using Matrix Factorization which is linear in nature. We extend the work of [11] on using variational autoencoders (VAEs) for collaborative filtering with implicit feedback by proposing a hybrid, multi-modal approach. Our approach combines movie embeddings (learned from a sibling VAE network) with user ratings from the Movielens 20M dataset and applies it to the task of movie recommendation. We empirically show how the VAE network is empowered by incorporating movie embeddings. We also visualize movie and user embeddings by clustering their latent representations obtained from a VAE.