Abstract:Human data annotation is critical in shaping the quality of machine learning (ML) and artificial intelligence (AI) systems. One significant challenge in this context is posed by annotation errors, as their effects can degrade the performance of ML models. This paper presents a predictive error model trained to detect potential errors in search relevance annotation tasks for three industry-scale ML applications (music streaming, video streaming, and mobile apps) and assesses its potential to enhance the quality and efficiency of the data annotation process. Drawing on real-world data from an extensive search relevance annotation program, we illustrate that errors can be predicted with moderate model performance (AUC=0.65-0.75) and that model performance generalizes well across applications (i.e., a global, task-agnostic model performs on par with task-specific models). We present model explainability analyses to identify which types of features are the main drivers of predictive performance. Additionally, we demonstrate the usefulness of the model in the context of auditing, where prioritizing tasks with high predicted error probabilities considerably increases the amount of corrected annotation errors (e.g., 40% efficiency gains for the music streaming application). These results underscore that automated error detection models can yield considerable improvements in the efficiency and quality of data annotation processes. Thus, our findings reveal critical insights into effective error management in the data annotation process, thereby contributing to the broader field of human-in-the-loop ML.
Abstract:We extend the graph convolutional network method for deep learning on graph data to higher order in terms of neighboring nodes. In order to construct representations for a node in a graph, in addition to the features of the node and its immediate neighboring nodes, we also include more distant nodes in the calculations. In experimenting with a number of publicly available citation graph datasets, we show that this higher order neighbor visiting pays off by outperforming the original model especially when we have a limited number of available labeled data points for the training of the model.
Abstract:In this paper, we employed a transfer learning technique to predict the Nusselt number for natural convection flows in enclosures. Specifically, we numerically simulated a benchmark problem in square enclosures described by the Rayleigh and Prandtl numbers using the finite volume method. Given that the ideal grid size depends on the value of these parameters, we performed our simulations using a combination of different grid systems. This allowed us to train an artificial neural network in a cost-effective manner. We adopted two approaches to this problem. First, we generated a multi-grid training dataset that included both the Rayleigh and Prandtl numbers as input variables. By monitoring the training losses for this dataset, we were able to detect any significant anomalies that stemmed from an insufficient grid size. We then revised the grid size or added more data points to denoise the dataset and transferred the learning from our original dataset to build a computational metamodel that predicts the Nusselt number. Furthermore, we sought to endow our neural network model with the ability to account for additional input features. Therefore, in our second approach, we applied a deep neural network architecture for transfer learning to this problem. Initially, we trained a neural network with a single input feature (Rayleigh), and then, extended the network to incorporate the effects of a second feature (Prandtl). This learning framework can be applied to other systems of natural convection in enclosures that presumably have higher physical complexity, while bringing the computational and training costs down.