Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fanjin Zhang

Small Language Model Makes an Effective Long Text Extractor

Feb 11, 2025

Yelin Chen, Fanjin Zhang, Jie Tang

Figure 1 for Small Language Model Makes an Effective Long Text Extractor

Figure 2 for Small Language Model Makes an Effective Long Text Extractor

Figure 3 for Small Language Model Makes an Effective Long Text Extractor

Figure 4 for Small Language Model Makes an Effective Long Text Extractor

Abstract:Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: https://github.com/THUDM/scholar-profiling/tree/main/sener

* AAAI'25, 9 pages, 1 appendix pages

Via

Access Paper or Ask Questions

MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model

Dec 05, 2024

Yunhe Pang, Bo Chen, Fanjin Zhang, Yanghui Rao, Jie Tang

Figure 1 for MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model

Figure 2 for MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model

Figure 3 for MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model

Figure 4 for MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model

Abstract:The rapid growth of academic publications has exacerbated the issue of author name ambiguity in online digital libraries. Despite advances in name disambiguation algorithms, cumulative errors continue to undermine the reliability of academic systems. It is estimated that over 10% paper-author assignments are rectified when constructing the million-scale WhoIsWho benchmark. Existing endeavors to detect incorrect assignments are either semantic-based or graph-based approaches, which fall short of making full use of the rich text attributes of papers and implicit structural features defined via the co-occurrence of paper attributes. To this end, this paper introduces a structure-enhanced language model that combines key structural features from graph-based methods with fine-grained semantic features from rich paper attributes to detect incorrect assignments. The proposed model is trained with a highly effective multi-modal multi-turn instruction tuning framework, which incorporates task-guided instruction tuning, text-attribute modality, and structural modality. Experimental results demonstrate that our model outperforms previous approaches, achieving top performance on the leaderboard of KDD Cup 2024. Our code has been publicly available.

Via

Access Paper or Ask Questions

BOND: Bootstrapping From-Scratch Name Disambiguation with Multi-task Promoting

Apr 12, 2024

Yuqing Cheng, Bo Chen, Fanjin Zhang, Jie Tang

Abstract:From-scratch name disambiguation is an essential task for establishing a reliable foundation for academic platforms. It involves partitioning documents authored by identically named individuals into groups representing distinct real-life experts. Canonically, the process is divided into two decoupled tasks: locally estimating the pairwise similarities between documents followed by globally grouping these documents into appropriate clusters. However, such a decoupled approach often inhibits optimal information exchange between these intertwined tasks. Therefore, we present BOND, which bootstraps the local and global informative signals to promote each other in an end-to-end regime. Specifically, BOND harnesses local pairwise similarities to drive global clustering, subsequently generating pseudo-clustering labels. These global signals further refine local pairwise characterizations. The experimental results establish BOND's superiority, outperforming other advanced baselines by a substantial margin. Moreover, an enhanced version, BOND+, incorporating ensemble and post-match techniques, rivals the top methods in the WhoIsWho competition.

* Proceedings of TheWebConf 2024 (WWW '24), May 13--17, 2024, Singapore
* TheWebConf 2024 (WWW '24)

Via

Access Paper or Ask Questions

PST-Bench: Tracing and Benchmarking the Source of Publications

Feb 25, 2024

Fanjin Zhang, Kun Cao, Yukuo Cen, Jifan Yu, Da Yin, Jie Tang

Figure 1 for PST-Bench: Tracing and Benchmarking the Source of Publications

Figure 2 for PST-Bench: Tracing and Benchmarking the Source of Publications

Figure 3 for PST-Bench: Tracing and Benchmarking the Source of Publications

Figure 4 for PST-Bench: Tracing and Benchmarking the Source of Publications

Abstract:Tracing the source of research papers is a fundamental yet challenging task for researchers. The billion-scale citation relations between papers hinder researchers from understanding the evolution of science efficiently. To date, there is still a lack of an accurate and scalable dataset constructed by professional researchers to identify the direct source of their studied papers, based on which automatic algorithms can be developed to expand the evolutionary knowledge of science. In this paper, we study the problem of paper source tracing (PST) and construct a high-quality and ever-increasing dataset PST-Bench in computer science. Based on PST-Bench, we reveal several intriguing discoveries, such as the differing evolution patterns across various topics. An exploration of various methods underscores the hardness of PST-Bench, pinpointing potential directions on this topic. The dataset and codes have been available at https://github.com/THUDM/paper-source-trace.

* 8 pages, 3 appendix pages

Via

Access Paper or Ask Questions

OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining

Feb 24, 2024

Fanjin Zhang, Shijie Shi, Yifan Zhu, Bo Chen, Yukuo Cen, Jifan Yu, Yelin Chen, Lulu Wang, Qingfei Zhao, Yuqing Cheng(+12 more)

Figure 1 for OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining

Figure 2 for OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining

Figure 3 for OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining

Figure 4 for OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining

Abstract:With the rapid proliferation of scientific literature, versatile academic knowledge services increasingly rely on comprehensive academic graph mining. Despite the availability of public academic graphs, benchmarks, and datasets, these resources often fall short in multi-aspect and fine-grained annotations, are constrained to specific task types and domains, or lack underlying real academic graphs. In this paper, we present OAG-Bench, a comprehensive, multi-aspect, and fine-grained human-curated benchmark based on the Open Academic Graph (OAG). OAG-Bench covers 10 tasks, 20 datasets, 70+ baselines, and 120+ experimental results to date. We propose new data annotation strategies for certain tasks and offer a suite of data pre-processing codes, algorithm implementations, and standardized evaluation protocols to facilitate academic graph mining. Extensive experiments reveal that even advanced algorithms like large language models (LLMs) encounter difficulties in addressing key challenges in certain tasks, such as paper source tracing and scholar profiling. We also introduce the Open Academic Graph Challenge (OAG-Challenge) to encourage community input and sharing. We envisage that OAG-Bench can serve as a common ground for the community to evaluate and compare algorithms in academic graph mining, thereby accelerating algorithm development and advancement in this field. OAG-Bench is accessible at https://www.aminer.cn/data/.

* 8 pages, 5 appendix pages

Via

Access Paper or Ask Questions

Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit

Feb 23, 2023

Bo Chen, Jing Zhang, Fanjin Zhang, Tianyi Han, Yuqing Cheng, Xiaoyan Li, Yuxiao Dong, Jie Tang

Figure 1 for Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit

Figure 2 for Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit

Figure 3 for Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit

Figure 4 for Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit

Abstract:Name disambiguation -- a fundamental problem in online academic systems -- is now facing greater challenges with the increasing growth of research papers. For example, on AMiner, an online academic search platform, about 10% of names own more than 100 authors. Such real-world hard cases cannot be fully addressed by existing research efforts, because of the small-scale or low-quality datasets that they use to build algorithms. The development of effective algorithms is further hampered by a variety of tasks and evaluation protocols designed on top of diverse datasets. To this end, we present WhoIsWho owning, a large-scale benchmark with over 1,000,000 papers built using an interactive annotation process, a regular leaderboard with comprehensive tasks, and an easy-to-use toolkit encapsulating the entire pipeline as well as the most powerful features and baseline models for tackling the tasks. Our developed strong baseline has already been deployed online in the AMiner system to enable daily arXiv paper assignments. The documentation and regular leaderboards are publicly available at http://whoiswho.biendata.xyz/.

Via

Access Paper or Ask Questions

Self-supervised Learning: Generative or Contrastive

Jun 29, 2020

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Zhaoyu Wang, Li Mian, Jing Zhang, Jie Tang

Figure 1 for Self-supervised Learning: Generative or Contrastive

Figure 2 for Self-supervised Learning: Generative or Contrastive

Figure 3 for Self-supervised Learning: Generative or Contrastive

Figure 4 for Self-supervised Learning: Generative or Contrastive

Abstract:Deep supervised learning has achieved great success in the last decade. However, its deficiencies of dependence on manual labels and vulnerability to attacks have driven people to explore a better solution. As an alternative, self-supervised learning attracts many researchers for its soaring performance on representation learning in the last several years. Self-supervised representation learning leverages input data itself as supervision and benefits almost all types of downstream tasks. In this survey, we take a look into new self-supervised learning methods for representation in computer vision, natural language processing, and graph learning. We comprehensively review the existing empirical methods and summarize them into three main categories according to their objectives: generative, contrastive, and generative-contrastive (adversarial). We further investigate related theoretical analysis work to provide deeper thoughts on how self-supervised learning works. Finally, we briefly discuss open problems and future directions for self-supervised learning.

Via

Access Paper or Ask Questions