For numerous earth observation applications, one may benefit from various satellite sensors to address the reconstruction of some process or information of interest. A variety of satellite sensors deliver observation data with different sampling patterns due satellite orbits and/or their sensitivity to atmospheric conditions (e.g., clour cover, heavy rains,...). Beyond the ability to account for irregularly-sampled observations, the definition of model-driven inversion methods is often limited to specific case-studies where one can explicitly derive a physical model to relate the different observation sources. Here, we investigate how end-to-end learning schemes provide new means to address multimodal inversion problems. The proposed scheme combines a variational formulation with trainable observation operators, {\em a priori} terms and solvers. Through an application to space oceanography, we show how this scheme can successfully extract relevant information from satellite-derived sea surface temperature images and enhance the reconstruction of sea surface currents issued from satellite altimetry data.