We present multimodal neural posterior estimation (MultiNPE), a method to integrate heterogeneous data from different sources in simulation-based inference with neural networks. Inspired by advances in attention-based deep fusion learning, it empowers researchers to analyze data from different domains and infer the parameters of complex mathematical models with increased accuracy. We formulate different multimodal fusion approaches for MultiNPE (early, late, and hybrid) and evaluate their performance in three challenging numerical experiments. MultiNPE not only outperforms na\"ive baselines on a benchmark model, but also achieves superior inference on representative scientific models from neuroscience and cardiology. In addition, we systematically investigate the impact of partially missing data on the different fusion strategies. Across our different experiments, late and hybrid fusion techniques emerge as the methods of choice for practical applications of multimodal simulation-based inference.