Abstract:State-of-the-art deep learning entity linking methods rely on extensive human-labelled data, which is costly to acquire. Current datasets are limited in size, leading to inadequate coverage of biomedical concepts and diminished performance when applied to new data. In this work, we propose to automatically generate data to create large-scale training datasets, which allows the exploration of approaches originally developed for the task of extreme multi-label ranking in the biomedical entity linking task. We propose the hybrid X-Linker pipeline that includes different modules to link disease and chemical entity mentions to concepts in the MEDIC and the CTD-Chemical vocabularies, respectively. X-Linker was evaluated on several biomedical datasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical, BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969, 0.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated superior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and BioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining three datasets. Both models rely only on the mention string for their operations. The source code of X-Linker and its associated data are publicly available for performing biomedical entity linking without requiring pre-labelled entities with identifiers from specific knowledge organization systems.
Abstract:Advances in natural language processing techniques, such as named entity recognition and normalization to widely used standardized terminologies like UMLS or SNOMED-CT, along with the digitalization of electronic health records, have significantly advanced clinical text analysis. This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking that leverages the potential of in-domain adapted language models for biomedical text mining: initial candidate retrieval using a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish. This methodology, focused initially on content in Spanish, substantially outperforming multilingual language models designed for the same purpose. This is true even for complex scenarios involving heterogeneous medical terminologies and being trained on a subset of the original data. Our results, evaluated using top-k accuracy at 25 and other top-k metrics, demonstrate our approach's performance on two distinct clinical entity linking Gold Standard corpora, DisTEMIST (diseases) and MedProcNER (clinical procedures), outperforming previous benchmarks by 40 points in DisTEMIST and 43 points in MedProcNER, both normalized to SNOMED-CT codes. These findings highlight our approach's ability to address language-specific nuances and set a new benchmark in entity linking, offering a potent tool for enhancing the utility of digital medical records. The resulting system is of practical value, both for large scale automatic generation of structured data derived from clinical records, as well as for exhaustive extraction and harmonization of predefined clinical variables of interest.