Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

Nov 27, 2023

Anusha Sabbineni, Nikhil Anand, Maria Minakova

Figure 1 for Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

Figure 2 for Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

Figure 3 for Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

Figure 4 for Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

Share this with someone who'll enjoy it:

Abstract:While data selection methods have been studied extensively in active learning, data pruning, and data augmentation settings, there is little evidence for the efficacy of these methods in industry scale settings, particularly in low-resource languages. Our work presents ways of assessing prospective training examples in those settings for their "usefulness" or "difficulty". We also demonstrate how these measures can be used in selecting important examples for training supervised machine learning models. We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these metrics to curate high quality datasets from a large pool of \textit{Weak Signal Labeled} data, which assigns no-defect high confidence hypotheses during inference as ground truth labels. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.

* Accepted to Efficient Natural Language and Speech Processing (ENLSP-III) workshop at NeurIPS '23

View paper on

Share this with someone who'll enjoy it:

Title:Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

Paper and Code