In this paper, we present XST-GCNN (eXplainable Spatio-Temporal Graph Convolutional Neural Network), a novel architecture for processing heterogeneous and irregular Multivariate Time Series (MTS) data. Our approach captures temporal and feature dependencies within a unified spatio-temporal pipeline by leveraging a GCNN that uses a spatio-temporal graph aimed at optimizing predictive accuracy and interoperability. For graph estimation, we introduce techniques, including one based on the (heterogeneous) Gower distance. Once estimated, we propose two methods for graph construction: one based on the Cartesian product, treating temporal instants homogeneously, and another spatio-temporal approach with distinct graphs per time step. We also propose two GCNN architectures: a standard GCNN with a normalized adjacency matrix and a higher-order polynomial GCNN. In addition to accuracy, we emphasize explainability by designing an inherently interpretable model and performing a thorough interpretability analysis, identifying key feature-time combinations that drive predictions. We evaluate XST-GCNN using real-world Electronic Health Record data from University Hospital of Fuenlabrada to predict Multidrug Resistance (MDR) in ICU patients, a critical healthcare challenge linked to high mortality and complex treatments. Our architecture outperforms traditional models, achieving a mean ROC-AUC score of 81.03 +- 2.43. Furthermore, the interpretability analysis provides actionable insights into clinical factors driving MDR predictions, enhancing model transparency. This work sets a benchmark for tackling complex inference tasks with heterogeneous MTS, offering a versatile, interpretable solution for real-world applications.