Machine Learning (ML) has emerged as a promising approach in healthcare, outperforming traditional statistical techniques. However, to establish ML as a reliable tool in clinical practice, adherence to best practices regarding data handling, experimental design, and model evaluation is crucial. This work summarizes and strictly observes such practices to ensure reproducible and reliable ML. Specifically, we focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare. We investigate the impact of different data augmentation techniques and model complexity on the overall performance. We consider MRI data from ADNI dataset to address a classification problem employing 3D Convolutional Neural Network (CNN). The experiments are designed to compensate for data scarcity and initial random parameters by utilizing cross-validation and multiple training trials. Within this framework, we train 15 predictive models, considering three different data augmentation strategies and five distinct 3D CNN architectures, each varying in the number of convolutional layers. Specifically, the augmentation strategies are based on affine transformations, such as zoom, shift, and rotation, applied concurrently or separately. The combined effect of data augmentation and model complexity leads to a variation in prediction performance up to 10% of accuracy. When affine transformation are applied separately, the model is more accurate, independently from the adopted architecture. For all strategies, the model accuracy followed a concave behavior at increasing number of convolutional layers, peaking at an intermediate value of layers. The best model (8 CL, (B)) is the most stable across cross-validation folds and training trials, reaching excellent performance both on the testing set and on an external test set.