Abstract:Continual learning remains challenging across various natural language understanding tasks. When models are updated with new training data, they risk catastrophic forgetting of prior knowledge. In the present work, we introduce a discrete key-value bottleneck for encoder-only language models, allowing for efficient continual learning by requiring only localized updates. Inspired by the success of a discrete key-value bottleneck in vision, we address new and NLP-specific challenges. We experiment with different bottleneck architectures to find the most suitable variants regarding language, and present a generic discrete key initialization technique for NLP that is task independent. We evaluate the discrete key-value bottleneck in four continual learning NLP scenarios and demonstrate that it alleviates catastrophic forgetting. We showcase that it offers competitive performance to other popular continual learning methods, with lower computational costs.
Abstract:Low isotropy in an embedding space impairs performance on tasks involving semantic inference. Our study investigates the impact of isotropy on semantic code search performance and explores post-processing techniques to mitigate this issue. We analyze various code language models, examine isotropy in their embedding spaces, and its influence on search effectiveness. We propose a modified ZCA whitening technique to control isotropy levels in embeddings. Our results demonstrate that Soft-ZCA whitening improves the performance of pre-trained code language models and can complement contrastive fine-tuning. The code for our experiments is available at https://github.com/drndr/code\_isotropy
Abstract:Assigning a subset of labels from a fixed pool of labels to a given input text is a text classification problem with many real-world applications, such as in recommender systems. Two separate research streams address this issue. Hierarchical Text Classification (HTC) focuses on datasets with smaller label pools of hundreds of entries, accompanied by a semantic label hierarchy. In contrast, eXtreme Multi-Label Text Classification (XML) considers very large label pools with up to millions of entries, in which the labels are not arranged in any particular manner. However, in XML, a common approach is to construct an artificial hierarchy without any semantic information before or during the training process. Here, we investigate how state-of-the-art models from one domain perform when trained and tested on datasets from the other domain. The HBGL and HGLCR models from the HTC domain are trained and tested on the datasets Wiki10-31K, AmazonCat-13K, and Amazon-670K from the XML domain. On the other side, the XML models CascadeXML and XR-Transformer are trained and tested on the datasets Web of Science, The New York Times Annotated Corpus, and RCV1-V2 from the HTC domain. HTC models, on the other hand, are not equipped to handle the size of XML datasets and achieve poor transfer results. The code and numerous files that are needed to reproduce our results can be obtained from https://github.com/FloHauss/XMC_HTC
Abstract:Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
Abstract:Morphology is a crucial factor for multilingual language modeling as it poses direct challenges for tokenization. Here, we seek to understand how tokenization influences the morphological knowledge encoded in multilingual language models. Specifically, we capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5. The two models share the same architecture, training objective, and training data and only differ in their tokenization strategies: subword tokenization vs. character-level tokenization. Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that multilingual language models learn the morphological systems of some languages better than others despite similar average performance and that morphological information is encoded in the middle and late layers, where characted-based models need a few more layers to yield commensurate probing accuracy. Finally, we show that languages with more irregularities benefit more from having a higher share of the pre-training data.
Abstract:We consider the problem of \textit{true} open-world semi-supervised node classification, in which nodes in a graph either belong to known or new classes, with the latter not present during training. Existing methods detect and reject new classes but fail to distinguish between different new classes. We adapt existing methods and show they do not solve the problem sufficiently. We introduce a novel end-to-end approach for classification into known classes and new classes based on class prototypes, which we call Prototypical Open-World Learning for Node Classification (POWN). Our method combines graph semi-supervised learning, self-supervised learning, and pseudo-labeling to learn prototype representations of new classes in a zero-shot way. In contrast to existing solutions from the vision domain, POWN does not require data augmentation techniques for node classification. Experiments on benchmark datasets demonstrate the effectiveness of POWN, where it outperforms baselines by up to $20\%$ accuracy on the small and up to $30\%$ on the large datasets. Source code is available at https://github.com/Bobowner/POWN.
Abstract:Language models and humans are two types of learning systems. Finding or facilitating commonalities could enable major breakthroughs in our understanding of the acquisition and evolution of language. Many theories of language evolution rely heavily on learning biases and learning pressures. Yet due to substantial differences in learning pressures, it is questionable whether the similarity between humans and machines is sufficient for insights to carry over and to be worth testing with human participants. Here, we review the emergent communication literature, a subfield of multi-agent reinforcement learning, from a language evolution perspective. We find that the emergent communication literature excels at designing and adapting models to recover initially absent linguistic phenomena of natural languages. Based on a short literature review, we identify key pressures that have recovered initially absent human patterns in emergent communication models: communicative success, efficiency, learnability, and other psycho-/sociolinguistic factors. We argue that this may serve as inspiration for how to design language models for language acquisition and language evolution research.
Abstract:Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code search tasks using natural language queries.These capabilities are heavily influenced by the quality and diversity of the available training data. Source code datasets used for training usually focus on the most popular languages and testing is mostly conducted on the same distributions, often overlooking low-resource programming languages. Motivated by the NLP generalization taxonomy proposed by Hupkes et.\,al., we propose a new benchmark dataset called GenCodeSearchNet (GeCS) which builds upon existing natural language code search datasets to systemically evaluate the programming language understanding generalization capabilities of language models. As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language that is often used by researchers outside the field of computer science. For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models in a zero-shot setting.
Abstract:We study the problem of lifelong graph learning in an open-world scenario, where a model needs to deal with new tasks and potentially unknown classes. We utilize Out-of-Distribution (OOD) detection methods to recognize new classes and adapt existing non-graph OOD detection methods to graph data. Crucially, we suggest performing new class detection by combining OOD detection methods with information aggregated from the graph neighborhood. Most OOD detection methods avoid determining a crisp threshold for deciding whether a vertex is OOD. To tackle this problem, we propose a Weakly-supervised Relevance Feedback (Open-WRF) method, which decreases the sensitivity to thresholds in OOD detection. We evaluate our approach on six benchmark datasets. Our results show that the proposed neighborhood aggregation method for OOD scores outperforms existing methods independent of the underlying graph neural network. Furthermore, we demonstrate that our Open-WRF method is more robust to threshold selection and analyze the influence of graph neighborhood on OOD detection. The aggregation and threshold methods are compatible with arbitrary graph neural networks and OOD detection methods, making our approach versatile and applicable to many real-world applications.
Abstract:Neural networks drive the success of natural language processing. A fundamental property of natural languages is their compositional structure, allowing us to describe new meanings systematically. However, neural networks notoriously struggle with systematic generalization and do not necessarily benefit from a compositional structure in emergent communication simulations. Here, we test how neural networks compare to humans in learning and generalizing a new language. We do this by closely replicating an artificial language learning study (conducted originally with human participants) and evaluating the memorization and generalization capabilities of deep neural networks with respect to the degree of structure in the input language. Our results show striking similarities between humans and deep neural networks: More structured linguistic input leads to more systematic generalization and better convergence between humans and neural network agents and between different neural agents. We then replicate this structure bias found in humans and our recurrent neural networks with a Transformer-based large language model (GPT-3), showing a similar benefit for structured linguistic input regarding generalization systematicity and memorization errors. These findings show that the underlying structure of languages is crucial for systematic generalization. Due to the correlation between community size and linguistic structure in natural languages, our findings underscore the challenge of automated processing of low-resource languages. Nevertheless, the similarity between humans and machines opens new avenues for language evolution research.