Abstract:Objective: Develop and evaluate machine learning (ML) models for predicting incident liver cirrhosis one, two, and three years prior to diagnosis using routinely collected electronic health record (EHR) data, and to benchmark their performance against the FIB-4 score. Methods: We conducted a retrospective cohort study using de-identified EHR data from a large academic health system. Patients with fatty liver disease were identified and categorized into cirrhosis and non-cirrhosis cohorts based on ICD-9/10 codes. Prediction scenarios were constructed using observation and prediction windows to emulate real-world clinical use. Demographics, diagnoses, laboratory results, vital signs, and comorbidity indices were aggregated from the observation window. XGBoost models were trained for 1-, 2-, and 3-year prediction horizons and evaluated on held-out test sets. Model performance was compared with FIB-4 using area under the receiver operating characteristic curve (AUC). Results: Final cohorts included 3,043 patients for the 1-year prediction, 1,981 for the 2-year prediction, and 1,470 for the 3-year prediction. Across all prediction windows, ML models consistently outperformed FIB-4. The XGBoost models achieved AUCs of 0.81, 0.73, and 0.69 for 1-, 2-, and 3-year predictions, respectively, compared with 0.71, 0.63, and 0.57 for FIB-4. Performance gains persisted with longer prediction horizons, indicating improved early risk discrimination. Conclusions: Machine learning models leveraging routine EHR data substantially outperform the traditional FIB-4 score for early prediction of liver cirrhosis. These models enable earlier and more accurate risk stratification and can be integrated into clinical workflows as automated decision-support tools to support proactive cirrhosis prevention and management.




Abstract:Extracting medical history entities (MHEs) related to a patient's chief complaint (CC), history of present illness (HPI), and past, family, and social history (PFSH) helps structure free-text clinical notes into standardized EHRs, streamlining downstream tasks like continuity of care, medical coding, and quality metrics. Fine-tuned clinical large language models (cLLMs) can assist in this process while ensuring the protection of sensitive data via on-premises deployment. This study evaluates the performance of cLLMs in recognizing CC/HPI/PFSH-related MHEs and examines how note characteristics impact model accuracy. We annotated 1,449 MHEs across 61 outpatient-related clinical notes from the MTSamples repository. To recognize these entities, we fine-tuned seven state-of-the-art cLLMs. Additionally, we assessed the models' performance when enhanced by integrating, problems, tests, treatments, and other basic medical entities (BMEs). We compared the performance of these models against GPT-4o in a zero-shot setting. To further understand the textual characteristics affecting model accuracy, we conducted an error analysis focused on note length, entity length, and segmentation. The cLLMs showed potential in reducing the time required for extracting MHEs by over 20%. However, detecting many types of MHEs remained challenging due to their polysemous nature and the frequent involvement of non-medical vocabulary. Fine-tuned GatorTron and GatorTronS, two of the most extensively trained cLLMs, demonstrated the highest performance. Integrating pre-identified BME information improved model performance for certain entities. Regarding the impact of textual characteristics on model performance, we found that longer entities were harder to identify, note length did not correlate with a higher error rate, and well-organized segments with headings are beneficial for the extraction.
Abstract:Function is increasingly recognized as an important indicator of whole-person health, although it receives little attention in clinical natural language processing research. We introduce the first public annotated dataset specifically on the Mobility domain of the International Classification of Functioning, Disability and Health (ICF), aiming to facilitate automatic extraction and analysis of functioning information from free-text clinical notes. We utilize the National NLP Clinical Challenges (n2c2) research dataset to construct a pool of candidate sentences using keyword expansion. Our active learning approach, using query-by-committee sampling weighted by density representativeness, selects informative sentences for human annotation. We train BERT and CRF models, and use predictions from these models to guide the selection of new sentences for subsequent annotation iterations. Our final dataset consists of 4,265 sentences with a total of 11,784 entities, including 5,511 Action entities, 5,328 Mobility entities, 306 Assistance entities, and 639 Quantification entities. The inter-annotator agreement (IAA), averaged over all entity types, is 0.72 for exact matching and 0.91 for partial matching. We also train and evaluate common BERT models and state-of-the-art Nested NER models. The best F1 scores are 0.84 for Action, 0.7 for Mobility, 0.62 for Assistance, and 0.71 for Quantification. Empirical results demonstrate promising potential of NER models to accurately extract mobility functioning information from clinical text. The public availability of our annotated dataset will facilitate further research to comprehensively capture functioning information in electronic health records (EHRs).