Abstract:Representing nodes in a network as dense vectors node embeddings is important for understanding a given network and solving many downstream tasks. In particular, for weighted homophilous graphs where similar nodes are connected with larger edge weights, we desire node embeddings where node pairs with strong weights have closer embeddings. Although random walk based node embedding methods like node2vec and node2vec+ do work for weighted networks via including edge weights in the walk transition probabilities, our experiments show that the embedding result does not adequately reflect edge weights. In this paper, we propose ARGEW (Augmentation of Random walks by Graph Edge Weights), a novel augmentation method for random walks that expands the corpus in such a way that nodes with larger edge weights end up with closer embeddings. ARGEW can work with any random walk based node embedding method, because it is independent of the random sampling strategy itself and works on top of the already-performed walks. With several real-world networks, we demonstrate that with ARGEW, compared to not using it, the desired pattern that node pairs with larger edge weights have closer embeddings is much clearer. We also examine ARGEW's performance in node classification: node2vec with ARGEW outperforms pure node2vec and is not sensitive to hyperparameters (i.e. consistently good). In fact, it achieves similarly good results as supervised GCN, even without any node feature or label information during training. Finally, we explain why ARGEW works consistently well by exploring the coappearance distributions using a synthetic graph with clear structural roles.
Abstract:Self-training provides an effective means of using an extremely small amount of labeled data to create pseudo-labels for unlabeled data. Many state-of-the-art self-training approaches hinge on different regularization methods to prevent overfitting and improve generalization. Yet they still rely heavily on predictions initially trained with the limited labeled data as pseudo-labels and are likely to put overconfident label belief on erroneous classes depending on the first prediction. To tackle this issue in text classification, we introduce LST, a simple self-training method that uses a lexicon to guide the pseudo-labeling mechanism in a linguistically-enriched manner. We consistently refine the lexicon by predicting confidence of the unseen data to teach pseudo-labels better in the training iterations. We demonstrate that this simple yet well-crafted lexical knowledge achieves 1.0-2.0% better performance on 30 labeled samples per class for five benchmark datasets than the current state-of-the-art approaches.