Abstract:Citation field learning is to segment a citation string into fields of interest such as author, title, and venue. Extracting such fields from citations is crucial for citation indexing, researcher profile analysis, etc. User-generated resources like academic homepages and Curriculum Vitae, provide rich citation field information. However, extracting fields from these resources is challenging due to inconsistent citation styles, incomplete sentence syntax, and insufficient training data. To address these challenges, we propose a novel algorithm, CIFAL (citation field learning by anchor learning), to boost the citation field learning performance. CIFAL leverages the anchor learning, which is model-agnostic for any Pre-trained Language Model, to help capture citation patterns from the data of different citation styles. The experiments demonstrate that CIFAL outperforms state-of-the-art methods in citation field learning, achieving a 2.83% improvement in field-level F1-scores. Extensive analysis of the results further confirms the effectiveness of CIFAL quantitatively and qualitatively.
Abstract:Text classification is a very classic NLP task, but it has two prominent shortcomings: On the one hand, text classification is deeply domain-dependent. That is, a classifier trained on the corpus of one domain may not perform so well in another domain. On the other hand, text classification models require a lot of annotated data for training. However, for some domains, there may not exist enough annotated data. Therefore, it is valuable to investigate how to efficiently utilize text data from different domains to improve the performance of models in various domains. Some multi-domain text classification models are trained by adversarial training to extract shared features among all domains and the specific features of each domain. We noted that the distinctness of the domain-specific features is different, so in this paper, we propose to use a curriculum learning strategy based on keyword weight ranking to improve the performance of multi-domain text classification models. The experimental results on the Amazon review and FDU-MTL datasets show that our curriculum learning strategy effectively improves the performance of multi-domain text classification models based on adversarial learning and outperforms state-of-the-art methods.