The COVID-19 pandemic has caused millions of cases and deaths and the AI-related scientific community, after being involved with detecting COVID-19 signs in medical images, has been now directing the efforts towards the development of methods that can predict the progression of the disease. This task is multimodal by its very nature and, recently, baseline results achieved on the publicly available AIforCOVID dataset have shown that chest X-ray scans and clinical information are useful to identify patients at risk of severe outcomes. While deep learning has shown superior performance in several medical fields, in most of the cases it considers unimodal data only. In this respect, when, which and how to fuse the different modalities is an open challenge in multimodal deep learning. To cope with these three questions here we present a novel approach optimizing the setup of a multimodal end-to-end model. It exploits Pareto multi-objective optimization working with a performance metric and the diversity score of multiple candidate unimodal neural networks to be fused. We test our method on the AIforCOVID dataset, attaining state-of-the-art results, not only outperforming the baseline performance but also being robust to external validation. Moreover, exploiting XAI algorithms we figure out a hierarchy among the modalities and we extract the features' intra-modality importance, enriching the trust on the predictions made by the model.