Recently, there has been a growing interest in applying machine learning methods to problems in engineering mechanics. In particular, there has been significant interest in applying deep learning techniques to predicting the mechanical behavior of heterogeneous materials and structures. Researchers have shown that deep learning methods are able to effectively predict mechanical behavior with low error for systems ranging from engineered composites, to geometrically complex metamaterials, to heterogeneous biological tissue. However, there has been comparatively little attention paid to deep learning model calibration, i.e., the match between predicted probabilities of outcomes and the true probabilities of outcomes. In this work, we perform a comprehensive investigation into ML model calibration across seven open access engineering mechanics datasets that cover three distinct types of mechanical problems. Specifically, we evaluate both model and model calibration error for multiple machine learning methods, and investigate the influence of ensemble averaging and post hoc model calibration via temperature scaling. Overall, we find that ensemble averaging of deep neural networks is both an effective and consistent tool for improving model calibration, while temperature scaling has comparatively limited benefits. Looking forward, we anticipate that this investigation will lay the foundation for future work in developing mechanics specific approaches to deep learning model calibration.