Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Juan Martín Román

Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

Oct 20, 2022

Moacir Antonelli Ponti, Lucas de Angelis Oliveira, Juan Martín Román, Luis Argerich

Figure 1 for Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

Figure 2 for Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

Figure 3 for Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

Figure 4 for Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

Abstract:Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. We show results on detecting noisy labels in order to either remove them, improving models' metrics in synthetic and real datasets, as well as a productive dataset. Our methods achieved the best results overall when compared with confident learning and heuristics.

Via

Access Paper or Ask Questions