Abstract:New Intent Discovery (NID) aims at detecting known and previously undefined categories of user intent by utilizing limited labeled and massive unlabeled data. Most prior works often operate under the unrealistic assumption that the distribution of both familiar and new intent classes is uniform, overlooking the skewed and long-tailed distributions frequently encountered in real-world scenarios. To bridge the gap, our work introduces the imbalanced new intent discovery (i-NID) task, which seeks to identify familiar and novel intent categories within long-tailed distributions. A new benchmark (ImbaNID-Bench) comprised of three datasets is created to simulate the real-world long-tail distributions. ImbaNID-Bench ranges from broad cross-domain to specific single-domain intent categories, providing a thorough representation of practical use cases. Besides, a robust baseline model ImbaNID is proposed to achieve cluster-friendly intent representations. It includes three stages: model pre-training, generation of reliable pseudo-labels, and robust representation learning that strengthens the model performance to handle the intricacies of real-world data distributions. Our extensive experiments on previous benchmarks and the newly established benchmark demonstrate the superior performance of ImbaNID in addressing the i-NID task, highlighting its potential as a powerful baseline for uncovering and categorizing user intents in imbalanced and long-tailed distributions\footnote{\url{https://github.com/Zkdc/i-NID}}.
Abstract:Recently, there has been increasing interest in exploring the capabilities of advanced large language models (LLMs) in the field of information extraction (IE), specifically focusing on tasks related to named entity recognition (NER) and relation extraction (RE). Although researchers are exploring the use of few-shot information extraction through in-context learning with LLMs, they tend to focus only on using correct or positive examples for demonstration, neglecting the potential value of incorporating incorrect or negative examples into the learning process. In this paper, we present c-ICL, a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. This approach enhances the ability of LLMs to extract entities and relations by utilizing prompts that incorporate not only the positive samples but also the reasoning behind them. This method allows for the identification and correction of potential interface errors. Specifically, our proposed method taps into the inherent contextual information and valuable information in hard negative samples and the nearest positive neighbors to the test and then applies the in-context learning demonstrations based on LLMs. Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods, delivering substantial enhancements in performance across a broad spectrum of related tasks. These improvements are noteworthy, showcasing the versatility of our approach in miscellaneous scenarios.
Abstract:Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora, especially for non-English data. While prior efforts mainly focus on data-driven transfer methods, a significant aspect that has not been fully explored is aligning both semantic and token-level representations across diverse languages. In this paper, we propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER). Specifically, we reframe the CrossNER task into a problem of recognizing relationships between pairs of tokens. This approach taps into the inherent contextual nuances of token-to-token connections within entities, allowing us to align representations across different languages. A multi-view contrastive learning framework is introduced to encompass semantic contrasts between source, codeswitched, and target sentences, as well as contrasts among token-to-token relations. By enforcing agreement within both semantic and relational spaces, we minimize the gap between source sentences and their counterparts of both codeswitched and target sentences. This alignment extends to the relationships between diverse tokens, enhancing the projection of entities across languages. We further augment CrossNER by combining self-training with labeled source data and unlabeled target data. Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches. It achieves a substantial increase of nearly +2.0 $F_1$ scores across a broad spectrum and establishes itself as the new state-of-the-art performer.
Abstract:Named entity recognition (NER) is an important research problem in natural language processing. There are three types of NER tasks, including flat, nested and discontinuous entity recognition. Most previous sequential labeling models are task-specific, while recent years have witnessed the rising of generative models due to the advantage of unifying all NER tasks into the seq2seq model framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. This paper proposes a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self and cross-attention mechanisms. We perform experiments on an extensive set of NER benchmarks, including two flat, three nested, and three discontinuous NER datasets. Experimental results show that our approach considerably improves the generative NER model's performance.
Abstract:Entity and relation extraction is a key task in information extraction, where the output can be used for downstream NLP tasks. Existing approaches for entity and relation extraction tasks mainly focus on the English corpora and ignore other languages. Thus, it is critical to improving performance in a multilingual setting. Meanwhile, multilingual training is usually used to boost cross-lingual performance by transferring knowledge from languages (e.g., high-resource) to other (e.g., low-resource) languages. However, language interference usually exists in multilingual tasks as the model parameters are shared among all languages. In this paper, we propose a two-stage multilingual training method and a joint model called Multilingual Entity and Relation Extraction framework (mERE) to mitigate language interference across languages. Specifically, we randomly concatenate sentences in different languages to train a Language-universal Aggregator (LA), which narrows the distance of embedding representations by obtaining the unified language representation. Then, we separate parameters to mitigate interference via tuning a Language-specific Switcher (LS), which includes several independent sub-modules to refine the language-specific feature representation. After that, to enhance the relational triple extraction, the sentence representations concatenated with the relation feature are used to recognize the entities. Extensive experimental results show that our method outperforms both the monolingual and multilingual baseline methods. Besides, we also perform detailed analysis to show that mERE is lightweight but effective on relational triple extraction and mERE{} is easy to transfer to other backbone models of multi-field tasks, which further demonstrates the effectiveness of our method.