Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wolfgang Nejdl

Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency

Jan 29, 2026

Alexander Blezinger, Wolfgang Nejdl, Ming Tang

Abstract:Foundation models pretrained on large-scale histopathology data have found great success in various fields of computational pathology, but their impact on regressive biomarker prediction remains underexplored. In this work, we systematically evaluate histopathological foundation models for regression-based tasks, demonstrated through the prediction of homologous recombination deficiency (HRD) score - a critical biomarker for personalized cancer treatment. Within multiple instance learning frameworks, we extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models, and evaluate their impact compared to contrastive learning-based features. Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts from two public medical data collections. Extensive experiments demonstrate that models trained on foundation model features consistently outperform the baseline in terms of predictive accuracy and generalization capabilities while exhibiting systematic differences among the foundation models. Additionally, we propose a distribution-based upsampling strategy to mitigate target imbalance in these datasets, significantly improving the recall and balanced accuracy for underrepresented but clinically important patient populations. Furthermore, we investigate the impact of different sampling strategies and instance bagsizes by ablation studies. Our results highlight the benefits of large-scale histopathological pretraining for more precise and transferable regressive biomarker prediction, showcasing its potential to advance AI-driven precision oncology.

* 9 pages, 7 figures and 5 tables. Initialy submitted for IJCAI 2026

Via

Access Paper or Ask Questions

MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Dec 12, 2025

Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl

Figure 1 for MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Figure 2 for MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Figure 3 for MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Figure 4 for MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Abstract:Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

* 7 pages, 3 figures

Via

Access Paper or Ask Questions

Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability

Nov 05, 2025

Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella

Abstract:Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward unseen targets. Existing research using contrastive, meta-learning, or data augmentation suffers from generalizability issues or lack of coherence between text and target. Recent works leveraging large language models (LLMs) for ZSSD focus either on improving unseen target-specific knowledge or generating explanations for stance analysis. However, most of these works are limited by their over-reliance on explicit reasoning, provide coarse explanations that lack nuance, and do not explicitly model the reasoning process, making it difficult to interpret the model's predictions. To address these issues, in our study, we develop a novel interpretable ZSSD framework, IRIS. We provide an interpretable understanding of the attitude of the input towards the target implicitly based on sequences within the text (implicit rationales) and explicitly based on linguistic measures (explicit rationales). IRIS considers stance detection as an information retrieval ranking task, understanding the relevance of implicit rationales for different stances to guide the model towards correct predictions without requiring the ground-truth of rationales, thus providing inherent interpretability. In addition, explicit rationales based on communicative features help decode the emotional and cognitive dimensions of stance, offering an interpretable understanding of the author's attitude towards the given target. Extensive experiments on the benchmark datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10% training data prove the generalizability of our model, benefiting from the proposed architecture and interpretable design.

* Accepted in AAAI CONFERENCE ON WEB AND SOCIAL MEDIA (ICWSM 2026)

Via

Access Paper or Ask Questions

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Oct 14, 2025

Hailay Kidu Teklehaymanot, Wolfgang Nejdl

Figure 1 for Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Figure 2 for Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Figure 3 for Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Figure 4 for Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Abstract:Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using established evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and morphologically complex languages incur significantly greater token inflation, often 3-5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems.

* 6 pages 4 figures

Via

Access Paper or Ask Questions

MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Sep 10, 2025

Hailay Kidu Teklehaymanot, Dren Fazlija, Wolfgang Nejdl

Figure 1 for MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Figure 2 for MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Figure 3 for MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Figure 4 for MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Abstract:Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC

* This submission is approximately 10 pages in length and includes 1 figure and 6 tables

Via

Access Paper or Ask Questions

"Where does it hurt?" -- Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues

Aug 26, 2025

Tom Röhr, Soumyadeep Roy, Fares Al Mohamad, Jens-Michalis Papaioannou, Wolfgang Nejdl, Felix Gers, Alexander Löser

Abstract:In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the `Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify transitions between SOAP categories. We also report for the first time common trajectories in medical dialogue structures that provide valuable insights for designing `differential diagnosis' systems. Finally, we extensively study the impact of intent filtering for medical dialogue summarization and observe a significant boost in performance. We make the codes and data, including annotation guidelines, publicly available at https://github.com/DATEXIS/medical-intent-classification.

* Accepted at ECAI 2025

Via

Access Paper or Ask Questions

FlexTSF: A Universal Forecasting Model for Time Series with Variable Regularities

Oct 30, 2024

Jingge Xiao, Yile Chen, Gao Cong, Wolfgang Nejdl, Simon Gottschalk

Abstract:Developing a foundation model for time series forecasting across diverse domains has attracted significant attention in recent years. Existing works typically assume regularly sampled, well-structured data, limiting their applicability to more generalized scenarios where time series often contain missing values, unequal sequence lengths, and irregular time intervals between measurements. To cover diverse domains and handle variable regularities, we propose FlexTSF, a universal time series forecasting model that possesses better generalization and natively support both regular and irregular time series. FlexTSF produces forecasts in an autoregressive manner and incorporates three novel designs: VT-Norm, a normalization strategy to ablate data domain barriers, IVP Patcher, a patching module to learn representations from flexibly structured time series, and LED attention, an attention mechanism to seamlessly integrate these two and propagate forecasts with awareness of domain and time information. Experiments on 12 datasets show that FlexTSF outperforms state-of-the-art forecasting models respectively designed for regular and irregular time series. Furthermore, after self-supervised pre-training, FlexTSF shows exceptional performance in both zero-shot and few-show settings for time series forecasting.

Via

Access Paper or Ask Questions

Towards Precision Healthcare: Robust Fusion of Time Series and Image Data

May 24, 2024

Ali Rasekh, Reza Heidari, Amir Hosein Haji Mohammad Rezaie, Parsa Sharifi Sedeh, Zahra Ahmadi, Prasenjit Mitra, Wolfgang Nejdl

Figure 1 for Towards Precision Healthcare: Robust Fusion of Time Series and Image Data

Figure 2 for Towards Precision Healthcare: Robust Fusion of Time Series and Image Data

Figure 3 for Towards Precision Healthcare: Robust Fusion of Time Series and Image Data

Figure 4 for Towards Precision Healthcare: Robust Fusion of Time Series and Image Data

Abstract:With the increasing availability of diverse data types, particularly images and time series data from medical experiments, there is a growing demand for techniques designed to combine various modalities of data effectively. Our motivation comes from the important areas of predicting mortality and phenotyping where using different modalities of data could significantly improve our ability to predict. To tackle this challenge, we introduce a new method that uses two separate encoders, one for each type of data, allowing the model to understand complex patterns in both visual and time-based information. Apart from the technical challenges, our goal is to make the predictive model more robust in noisy conditions and perform better than current methods. We also deal with imbalanced datasets and use an uncertainty loss function, yielding improved results while simultaneously providing a principled means of modeling uncertainty. Additionally, we include attention mechanisms to fuse different modalities, allowing the model to focus on what's important for each task. We tested our approach using the comprehensive multimodal MIMIC dataset, combining MIMIC-IV and MIMIC-CXR datasets. Our experiments show that our method is effective in improving multimodal deep learning for clinical applications. The code will be made available online.

Via

Access Paper or Ask Questions

TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

Apr 26, 2024

Hailay Teklehaymanot, Dren Fazlija, Niloy Ganguly, Gourab K. Patro, Wolfgang Nejdl

Abstract:The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

* LREC-COLING 2024
* 9 pages,3 figures, 7 tables,2 listings

Via

Access Paper or Ask Questions

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Apr 20, 2024

Soumyadeep Roy, Aparup Khatua, Fatemeh Ghoochani, Uwe Hadler, Wolfgang Nejdl, Niloy Ganguly

Figure 1 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Figure 2 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Figure 3 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Figure 4 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Abstract:GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4," by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy .

* 10 pages, 4 figures. Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)

Via

Access Paper or Ask Questions