Ensemble methods are generally regarded to be better than a single model if the base learners are deemed to be "accurate" and "diverse." Here we investigate a semi-supervised ensemble learning strategy to produce generalizable blind image quality assessment models. We train a multi-head convolutional network for quality prediction by maximizing the accuracy of the ensemble (as well as the base learners) on labeled data, and the disagreement (i.e., diversity) among them on unlabeled data, both implemented by the fidelity loss. We conduct extensive experiments to demonstrate the advantages of employing unlabeled data for BIQA, especially in model generalization and failure identification.