Abstract:To combat the rising energy consumption of recommender systems we implement a novel alternative for k-fold cross validation. This alternative, named e-fold cross validation, aims to minimize the number of folds to achieve a reduction in power usage while keeping the reliability and robustness of the test results high. We tested our method on 5 recommender system algorithms across 6 datasets and compared it with 10-fold cross validation. On average e-fold cross validation only needed 41.5% of the energy that 10-fold cross validation would need, while it's results only differed by 1.81%. We conclude that e-fold cross validation is a promising approach that has the potential to be an energy efficient but still reliable alternative to k-fold cross validation.
Abstract:This paper introduces e-fold cross-validation, an energy-efficient alternative to k-fold cross-validation. It dynamically adjusts the number of folds based on a stopping criterion. The criterion checks after each fold whether the standard deviation of the evaluated folds has consistently decreased or remained stable. Once met, the process stops early. We tested e-fold cross-validation on 15 datasets and 10 machine-learning algorithms. On average, it required 4 fewer folds than 10-fold cross-validation, reducing evaluation time, computational resources, and energy use by about 40%. Performance differences between e-fold and 10-fold cross-validation were less than 2% for larger datasets. More complex models showed even smaller discrepancies. In 96% of iterations, the results were within the confidence interval, confirming statistical significance. E-fold cross-validation offers a reliable and efficient alternative to k-fold, reducing computational costs while maintaining accuracy.
Abstract:As recommender systems become increasingly prevalent, the environmental impact and energy efficiency of training large-scale models have come under scrutiny. This paper investigates the potential for energy-efficient algorithm performance by optimizing dataset sizes through downsampling techniques in the context of Green Recommender Systems. We conducted experiments on the MovieLens 100K, 1M, 10M, and Amazon Toys and Games datasets, analyzing the performance of various recommender algorithms under different portions of dataset size. Our results indicate that while more training data generally leads to higher algorithm performance, certain algorithms, such as FunkSVD and BiasedMF, particularly with unbalanced and sparse datasets like Amazon Toys and Games, maintain high-quality recommendations with up to a 50% reduction in training data, achieving nDCG@10 scores within approximately 13% of full dataset performance. These findings suggest that strategic dataset reduction can decrease computational and environmental costs without substantially compromising recommendation quality. This study advances sustainable and green recommender systems by providing insights for reducing energy consumption while maintaining effectiveness.
Abstract:The recommender systems algorithm selection problem for ranking prediction on implicit feedback datasets is under-explored. Traditional approaches in recommender systems algorithm selection focus predominantly on rating prediction on explicit feedback datasets, leaving a research gap for ranking prediction on implicit feedback datasets. Algorithm selection is a critical challenge for nearly every practitioner in recommender systems. In this work, we take the first steps toward addressing this research gap. We evaluate the NDCG@10 of 24 recommender systems algorithms, each with two hyperparameter configurations, on 72 recommender systems datasets. We train four optimized machine-learning meta-models and one automated machine-learning meta-model with three different settings on the resulting meta-dataset. Our results show that the predictions of all tested meta-models exhibit a median Spearman correlation ranging from 0.857 to 0.918 with the ground truth. We show that the median Spearman correlation between meta-model predictions and the ground truth increases by an average of 0.124 when the meta-model is optimized to predict the ranking of algorithms instead of their performance. Furthermore, in terms of predicting the best algorithm for an unknown dataset, we demonstrate that the best optimized traditional meta-model, e.g., XGBoost, achieves a recall of 48.6%, outperforming the best tested automated machine learning meta-model, e.g., AutoGluon, which achieves a recall of 47.2%.
Abstract:As global warming soars, evaluating the environmental impact of research is more critical now than ever before. However, we find that few to no recommender systems research papers document their impact on the environment. Consequently, in this paper, we conduct a comprehensive analysis of the environmental impact of recommender system research by reproducing a characteristic recommender systems experimental pipeline. We focus on estimating the carbon footprint of recommender systems research papers, highlighting the evolution of the environmental impact of recommender systems research experiments over time. We thoroughly evaluated all 79 full papers from the ACM RecSys conference in the years 2013 and 2023 to analyze representative experimental pipelines for papers utilizing traditional, so-called good old-fashioned AI algorithms and deep learning algorithms, respectively. We reproduced these representative experimental pipelines, measured electricity consumption using a hardware energy meter, and converted the measured energy consumption into CO2 equivalents to estimate the environmental impact. Our results show that a recommender systems research paper utilizing deep learning algorithms emits approximately 42 times more CO2 equivalents than a paper utilizing traditional algorithms. Furthermore, on average, such a paper produces 3,297 kilograms of CO2 equivalents, which is more than one person produces by flying from New York City to Melbourne or the amount one tree sequesters in 300 years.
Abstract:Automated Machine Learning (AutoML) has greatly advanced applications of Machine Learning (ML) including model compression, machine translation, and computer vision. Recommender Systems (RecSys) can be seen as an application of ML. Yet, AutoML has found little attention in the RecSys community; nor has RecSys found notable attention in the AutoML community. Only few and relatively simple Automated Recommender Systems (AutoRecSys) libraries exist that adopt AutoML techniques. However, these libraries are based on student projects and do not offer the features and thorough development of AutoML libraries. We set out to determine how AutoML libraries perform in the scenario of an inexperienced user who wants to implement a recommender system. We compared the predictive performance of 60 AutoML, AutoRecSys, ML, and RecSys algorithms from 15 libraries, including a mean predictor baseline, on 14 explicit feedback RecSys datasets. To simulate the perspective of an inexperienced user, the algorithms were evaluated with default hyperparameters. We found that AutoML and AutoRecSys libraries performed best. AutoML libraries performed best for six of the 14 datasets (43%), but it was not always the same AutoML library performing best. The single-best library was the AutoRecSys library Auto-Surprise, which performed best on five datasets (36%). On three datasets (21%), AutoML libraries performed poorly, and RecSys libraries with default parameters performed best. Although, while obtaining 50% of all placements in the top five per dataset, RecSys algorithms fall behind AutoML on average. ML algorithms generally performed the worst.
Abstract:Automated machine learning (AutoML) systems commonly ensemble models post hoc to improve predictive performance, typically via greedy ensemble selection (GES). However, we believe that GES may not always be optimal, as it performs a simple deterministic greedy search. In this work, we introduce two novel population-based ensemble selection methods, QO-ES and QDO-ES, and compare them to GES. While QO-ES optimises solely for predictive performance, QDO-ES also considers the diversity of ensembles within the population, maintaining a diverse set of well-performing ensembles during optimisation based on ideas of quality diversity optimisation. The methods are evaluated using 71 classification datasets from the AutoML benchmark, demonstrating that QO-ES and QDO-ES often outrank GES, albeit only statistically significant on validation data. Our results further suggest that diversity can be beneficial for post hoc ensembling but also increases the risk of overfitting.
Abstract:Automated Machine Learning (AutoML) frameworks regularly use ensembles. Developers need to compare different ensemble techniques to select appropriate techniques for an AutoML framework from the many potential techniques. So far, the comparison of ensemble techniques is often computationally expensive, because many base models must be trained and evaluated one or multiple times. Therefore, we present Assembled-OpenML. Assembled-OpenML is a Python tool, which builds meta-datasets for ensembles using OpenML. A meta-dataset, called Metatask, consists of the data of an OpenML task, the task's dataset, and prediction data from model evaluations for the task. We can make the comparison of ensemble techniques computationally cheaper by using the predictions stored in a metatask instead of training and evaluating base models. To introduce Assembled-OpenML, we describe the first version of our tool. Moreover, we present an example of using Assembled-OpenML to compare a set of ensemble techniques. For this example comparison, we built a benchmark using Assembled-OpenML and implemented ensemble techniques expecting predictions instead of base models as input. In our example comparison, we gathered the prediction data of $1523$ base models for $31$ datasets. Obtaining the prediction data for all base models using Assembled-OpenML took ${\sim} 1$ hour in total. In comparison, obtaining the prediction data by training and evaluating just one base model on the most computationally expensive dataset took ${\sim} 37$ minutes.
Abstract:Many state-of-the-art automated machine learning (AutoML) systems use greedy ensemble selection (GES) by Caruana et al. (2004) to ensemble models found during model selection post hoc. Thereby, boosting predictive performance and likely following Auto-Sklearn 1's insight that alternatives, like stacking or gradient-free numerical optimization, overfit. Overfitting in Auto-Sklearn 1 is much more likely than in other AutoML systems because it uses only low-quality validation data for post hoc ensembling. Therefore, we were motivated to analyze whether Auto-Sklearn 1's insight holds true for systems with higher-quality validation data. Consequently, we compared the performance of covariance matrix adaptation evolution strategy (CMA-ES), state-of-the-art gradient-free numerical optimization, to GES on the 71 classification datasets from the AutoML benchmark for AutoGluon. We found that Auto-Sklearn's insight depends on the chosen metric. For the metric ROC AUC, CMA-ES overfits drastically and is outperformed by GES -- statistically significantly for multi-class classification. For the metric balanced accuracy, CMA-ES does not overfit and outperforms GES significantly. Motivated by the successful application of CMA-ES for balanced accuracy, we explored methods to stop CMA-ES from overfitting for ROC AUC. We propose a method to normalize the weights produced by CMA-ES, inspired by GES, that avoids overfitting for CMA-ES and makes CMA-ES perform better than or similar to GES for ROC AUC.
Abstract:Recommendation algorithms perform differently if the users, recommendation contexts, applications, and user interfaces vary even slightly. It is similarly observed in other fields, such as combinatorial problem solving, that algorithms perform differently for each instance presented. In those fields, meta-learning is successfully used to predict an optimal algorithm for each instance, to improve overall system performance. Per-instance algorithm selection has thus far been unsuccessful for recommender systems. In this paper we propose a per-instance meta-learner that clusters data instances and predicts the best algorithm for unseen instances according to cluster membership. We test our approach using 10 collaborative- and 4 content-based filtering algorithms, for varying clustering parameters, and find a significant improvement over the best performing base algorithm at alpha=0.053 (MAE: 0.7107 vs LightGBM 0.7214; t-test). We also explore the performances of our base algorithms on a ratings dataset and empirically show that the error of a perfect algorithm selector monotonically decreases for larger pools of algorithm. To the best of our knowledge, this is the first effective meta-learning technique for per-instance algorithm selection in recommender systems.