Many problems in machine learning and related application areas are fundamentally variants of conditional modeling and sampling across multi-aspect data, either multi-view, multi-modal, or simply multi-group. For example, sampling from the distribution of English sentences conditioned on a given French sentence or sampling audio waveforms conditioned on a given piece of text. Central to many of these problems is the issue of missing data: we can observe many English, French, or German sentences individually but only occasionally do we have data for a sentence pair. Motivated by these applications and inspired by recent progress in variational autoencoders for grouped data, we develop factVAE, a deep generative model capable of handling multi-aspect data, robust to missing observations, and with a prior that encourages disentanglement between the groups and the latent dimensions. The effectiveness of factVAE is demonstrated on a variety of rich real-world datasets, including motion capture poses and pictures of faces captured from varying poses and perspectives.