Objective: In this work, we study the problem of cross-subject motor imagery (MI) decoding from electroenchephalography (EEG) data. Multi-subject EEG datasets present several kinds of domain shifts due to various inter-individual differences (e.g. brain anatomy, personality and cognitive profile). These domain shifts render multi-subject training a challenging task and also impede robust cross-subject generalization. Method: We propose a two-stage model ensemble architecture, built with multiple feature extractors (first stage) and a shared classifier (second stage), which we train end-to-end with two loss terms. The first loss applies curriculum learning, forcing each feature extractor to specialize to a subset of the training subjects and promoting feature diversity. The second loss is an intra-ensemble distillation objective that allows collaborative exchange of knowledge between the models of the ensemble. Results: We compare our method against several state-of-the-art techniques, conducting subject-independent experiments on two large MI datasets, namely Physionet and OpenBMI. Our algorithm outperforms all of the methods in both 5-fold cross-validation and leave-one-subject-out evaluation settings, using a substantially lower number of trainable parameters. Conclusion: We demonstrate that our model ensembling approach combining the powers of curriculum learning and collaborative training, leads to high learning capacity and robust performance. Significance: Our work addresses the issue of domain shifts in multi-subject EEG datasets, paving the way for calibration-free BCI systems.