Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christophe G. Lambert

PULSNAR -- Positive unlabeled learning selected not at random: class proportion estimation when the SCAR assumption does not hold

Mar 14, 2023

Praveen Kumar, Christophe G. Lambert

Figure 1 for PULSNAR -- Positive unlabeled learning selected not at random: class proportion estimation when the SCAR assumption does not hold

Figure 2 for PULSNAR -- Positive unlabeled learning selected not at random: class proportion estimation when the SCAR assumption does not hold

Figure 3 for PULSNAR -- Positive unlabeled learning selected not at random: class proportion estimation when the SCAR assumption does not hold

Figure 4 for PULSNAR -- Positive unlabeled learning selected not at random: class proportion estimation when the SCAR assumption does not hold

Abstract:Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the selected completely at random (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $\alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate $\alpha$ or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.

Via

Access Paper or Ask Questions