In this work, we propose to model the interaction between visual and textual features for multi-modal neural machine translation through a latent variable model. This latent variable can be seen as a stochastic embedding and it is used in the target-language decoder and also to predict image features. Importantly, even though in our model formulation we capture correlations between visual and textual features, we do not require that images be available at test time. We show that our latent variable MMT formulation improves considerably over strong baselines, including the multi-task learning approach of Elliott and Kadar (2017) and the conditional variational auto-encoder approach of Toyama et al. (2016). Finally, in an ablation study we show that (i) predicting image features in addition to only conditioning on them and (ii) imposing a constraint on the minimum amount of information encoded in the latent variable slightly improved translations.