In the fields of Experimental and Computational Aesthetics, numerous image datasets have been created over the last two decades. In the present work, we provide a comparative overview of twelve image datasets that include aesthetic ratings (beauty, liking or aesthetic quality) and investigate the reproducibility of results across different datasets. Specifically, we examine how consistently the ratings can be predicted by using either (A) a set of 20 previously studied statistical image properties, or (B) the layers of a convolutional neural network developed for object recognition. Our findings reveal substantial variation in the predictability of aesthetic ratings across the different datasets. However, consistent similarities were found for datasets containing either photographs or paintings, suggesting different relevant features in the aesthetic evaluation of these two image genres. To our surprise, statistical image properties and the convolutional neural network predict aesthetic ratings with similar accuracy, highlighting a significant overlap in the image information captured by the two methods. Nevertheless, the discrepancies between the datasets call into question the generalizability of previous research findings on single datasets. Our study underscores the importance of considering multiple datasets to improve the validity and generalizability of research results in the fields of experimental and computational aesthetics.