We solve the problem of 6-DoF localisation and 3D dense reconstruction in spatial environments as approximate Bayesian inference in a deep generative approach which combines learned with engineered models. This principled treatment of uncertainty and probabilistic inference overcomes the shortcoming of current state-of-the-art solutions to rely on heavily engineered, heterogeneous pipelines. Variational inference enables us to use neural networks for system identification, while a differentiable raycaster is used for the emission model. This ensures that our model is amenable to end-to-end gradient-based optimisation. We evaluate our approach on realistic unmanned aerial vehicle flight data, nearing the performance of a state-of-the-art visual inertial odometry system. The applicability of the learned model to downstream tasks such as generative prediction and planning is investigated.