Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Information Extraction

What is Information Extraction? Information extraction is the process of automatically extracting structured information from unstructured text data.

Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

Jul 10, 2025

Hein Htet, Amgad Ahmed Ali Ibrahim, Yutaka Sasaki, Ryoji Asahi

Abstract:The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.

* 28 pages, 12 figures, 6 tables

Via

Access Paper or Ask Questions

SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data

Jul 10, 2025

Zonghai Yao, Youxia Zhao, Avijit Mitra, David A. Levy, Emily Druhl, Jack Tsai, Hong Yu

Abstract:Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.

* Equal contribution for the first two authors

Via

Access Paper or Ask Questions

"So, Tell Me About Your Policy...": Distillation of interpretable policies from Deep Reinforcement Learning agents

Jul 10, 2025

Giovanni Dispoto, Paolo Bonetti, Marcello Restelli

Abstract:Recent advances in Reinforcement Learning (RL) largely benefit from the inclusion of Deep Neural Networks, boosting the number of novel approaches proposed in the field of Deep Reinforcement Learning (DRL). These techniques demonstrate the ability to tackle complex games such as Atari, Go, and other real-world applications, including financial trading. Nevertheless, a significant challenge emerges from the lack of interpretability, particularly when attempting to comprehend the underlying patterns learned, the relative importance of the state features, and how they are integrated to generate the policy's output. For this reason, in mission-critical and real-world settings, it is often preferred to deploy a simpler and more interpretable algorithm, although at the cost of performance. In this paper, we propose a novel algorithm, supported by theoretical guarantees, that can extract an interpretable policy (e.g., a linear policy) without disregarding the peculiarities of expert behavior. This result is obtained by considering the advantage function, which includes information about why an action is superior to the others. In contrast to previous works, our approach enables the training of an interpretable policy using previously collected experience. The proposed algorithm is empirically evaluated on classic control environments and on a financial trading scenario, demonstrating its ability to extract meaningful information from complex expert policies.

Via

Access Paper or Ask Questions

Large Language Model for Extracting Complex Contract Information in Industrial Scenes

Jul 09, 2025

Yunyang Cao, Yanjun Li, Silong Dai

Abstract:This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.

* Proceedings of the 42nd International Conference on Machine Learning Workshop NewInML, Vancouver, Canada. PMLR 267, 2025

Via

Access Paper or Ask Questions

Enhancing Vaccine Safety Surveillance: Extracting Vaccine Mentions from Emergency Department Triage Notes Using Fine-Tuned Large Language Models

Jul 10, 2025

Sedigh Khademi, Jim Black, Christopher Palmer, Muhammad Javed, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila

Abstract:This study evaluates fine-tuned Llama 3.2 models for extracting vaccine-related information from emergency department triage notes to support near real-time vaccine safety surveillance. Prompt engineering was used to initially create a labeled dataset, which was then confirmed by human annotators. The performance of prompt-engineered models, fine-tuned models, and a rule-based approach was compared. The fine-tuned Llama 3 billion parameter model outperformed other models in its accuracy of extracting vaccine names. Model quantization enabled efficient deployment in resource-constrained environments. Findings demonstrate the potential of large language models in automating data extraction from emergency department notes, supporting efficient vaccine safety surveillance and early detection of emerging adverse events following immunization issues.

* Medinfo 2025
* 5 pages

Via

Access Paper or Ask Questions

Thermodynamic Prediction Enabled by Automatic Dataset Building and Machine Learning

Jul 09, 2025

Juejing Liu, Haydn Anderson, Noah I. Waxman, Vsevolod Kovalev, Byron Fisher, Elizabeth Li, Xiaofeng Guo

Abstract:New discoveries in chemistry and materials science, with increasingly expanding volume of requisite knowledge and experimental workload, provide unique opportunities for machine learning (ML) to take critical roles in accelerating research efficiency. Here, we demonstrate (1) the use of large language models (LLMs) for automated literature reviews, and (2) the training of an ML model to predict chemical knowledge (thermodynamic parameters). Our LLM-based literature review tool (LMExt) successfully extracted chemical information and beyond into a machine-readable structure, including stability constants for metal cation-ligand interactions, thermodynamic properties, and other broader data types (medical research papers, and financial reports), effectively overcoming the challenges inherent in each domain. Using the autonomous acquisition of thermodynamic data, an ML model was trained using the CatBoost algorithm for accurately predicting thermodynamic parameters (e.g., enthalpy of formation) of minerals. This work highlights the transformative potential of integrated ML approaches to reshape chemistry and materials science research.

Via

Access Paper or Ask Questions

Revealing the Hidden Temporal Structure of HubertSoft Embeddings based on the Russian Phonetic Corpus

Jul 09, 2025

Anastasia Ananeva, Anton Tomilov, Marina Volkova

Abstract:Self-supervised learning (SSL) models such as Wav2Vec 2.0 and HuBERT have shown remarkable success in extracting phonetic information from raw audio without labelled data. While prior work has demonstrated that SSL embeddings encode phonetic features at the frame level, it remains unclear whether these models preserve temporal structure, specifically, whether embeddings at phoneme boundaries reflect the identity and order of adjacent phonemes. This study investigates the extent to which boundary-sensitive embeddings from HubertSoft, a soft-clustering variant of HuBERT, encode phoneme transitions. Using the CORPRES Russian speech corpus, we labelled 20 ms embedding windows with triplets of phonemes corresponding to their start, centre, and end segments. A neural network was trained to predict these positions separately, and multiple evaluation metrics, such as ordered, unordered accuracy and a flexible centre accuracy, were used to assess temporal sensitivity. Results show that embeddings extracted at phoneme boundaries capture both phoneme identity and temporal order, with especially high accuracy at segment boundaries. Confusion patterns further suggest that the model encodes articulatory detail and coarticulatory effects. These findings contribute to our understanding of the internal structure of SSL speech representations and their potential for phonological analysis and fine-grained transcription tasks.

* 11 pages, 5 figures, Specom 2025 conference

Via

Access Paper or Ask Questions

Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

Jul 09, 2025

Matthew Anderson Hendricks, Alice Cicirello

Abstract:This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.

Via

Access Paper or Ask Questions

Integrating Pathology Foundation Models and Spatial Transcriptomics for Cellular Decomposition from Histology Images

Jul 09, 2025

Yutong Sun, Sichen Zhu, Peng Qiu

Abstract:The rapid development of digital pathology and modern deep learning has facilitated the emergence of pathology foundation models that are expected to solve general pathology problems under various disease conditions in one unified model, with or without fine-tuning. In parallel, spatial transcriptomics has emerged as a transformative technology that enables the profiling of gene expression on hematoxylin and eosin (H&E) stained histology images. Spatial transcriptomics unlocks the unprecedented opportunity to dive into existing histology images at a more granular, cellular level. In this work, we propose a lightweight and training-efficient approach to predict cellular composition directly from H&E-stained histology images by leveraging information-enriched feature embeddings extracted from pre-trained pathology foundation models. By training a lightweight multi-layer perceptron (MLP) regressor on cell-type abundances derived via cell2location, our method efficiently distills knowledge from pathology foundation models and demonstrates the ability to accurately predict cell-type compositions from histology images, without physically performing the costly spatial transcriptomics. Our method demonstrates competitive performance compared to existing methods such as Hist2Cell, while significantly reducing computational complexity.

Via

Access Paper or Ask Questions

MeD-3D: A Multimodal Deep Learning Framework for Precise Recurrence Prediction in Clear Cell Renal Cell Carcinoma (ccRCC)

Jul 10, 2025

Hasaan Maqsood, Saif Ur Rehman Khan

Abstract:Accurate prediction of recurrence in clear cell renal cell carcinoma (ccRCC) remains a major clinical challenge due to the disease complex molecular, pathological, and clinical heterogeneity. Traditional prognostic models, which rely on single data modalities such as radiology, histopathology, or genomics, often fail to capture the full spectrum of disease complexity, resulting in suboptimal predictive accuracy. This study aims to overcome these limitations by proposing a deep learning (DL) framework that integrates multimodal data, including CT, MRI, histopathology whole slide images (WSI), clinical data, and genomic profiles, to improve the prediction of ccRCC recurrence and enhance clinical decision-making. The proposed framework utilizes a comprehensive dataset curated from multiple publicly available sources, including TCGA, TCIA, and CPTAC. To process the diverse modalities, domain-specific models are employed: CLAM, a ResNet50-based model, is used for histopathology WSIs, while MeD-3D, a pre-trained 3D-ResNet18 model, processes CT and MRI images. For structured clinical and genomic data, a multi-layer perceptron (MLP) is used. These models are designed to extract deep feature embeddings from each modality, which are then fused through an early and late integration architecture. This fusion strategy enables the model to combine complementary information from multiple sources. Additionally, the framework is designed to handle incomplete data, a common challenge in clinical settings, by enabling inference even when certain modalities are missing.

Via

Access Paper or Ask Questions

Topic:Information Extraction

Papers and Code