Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Apr 01, 2021

Florian Pargent, Florian Pfisterer, Janek Thomas, Bernd Bischl

Figure 1 for Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Figure 2 for Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Figure 3 for Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Figure 4 for Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Share this with someone who'll enjoy it:

Abstract:Because most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect during data analysis. An often encountered problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of those techniques on a subsequent algorithm's predictive performance, and -- if possible -- derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbours, support vector machine) using datasets from regression, binary- and multiclass- classification settings. Throughout our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditional encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective.

View paper on

Share this with someone who'll enjoy it:

Title:Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Paper and Code