Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Georg Groh

It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

Sep 19, 2025

Lukas Ellinger, Georg Groh

Abstract:Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs' handling of ambiguity and to ensure robust performance across diverse communication styles.

* Accepted by UncertaiNLP workshop @ EMNLP 2025

Via

Access Paper or Ask Questions

German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German

Aug 25, 2025

Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah, Maximilian Müller, Cristian-George Craciun, Georg Groh

Abstract:The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing

* Accepted to INLG 2025

Via

Access Paper or Ask Questions

Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions

Jul 16, 2025

Lukas Ellinger, Miriam Anschütz, Georg Groh

Abstract:Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms, words with multiple meanings, where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.

* Accepted by RANLP 2025

Via

Access Paper or Ask Questions

From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

May 29, 2025

Tobias Lindenbauer, Georg Groh, Hinrich Schütze

Abstract:We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

* Short Paper, REALM '25 camera-ready

Via

Access Paper or Ask Questions

To Bias or Not to Bias: Detecting bias in News with bias-detector

May 19, 2025

Himel Ghosh, Ahmed Mosharafa, Georg Groh

Abstract:Media bias detection is a critical task in ensuring fair and balanced information dissemination, yet it remains challenging due to the subjectivity of bias and the scarcity of high-quality annotated data. In this work, we perform sentence-level bias classification by fine-tuning a RoBERTa-based model on the expert-annotated BABE dataset. Using McNemar's test and the 5x2 cross-validation paired t-test, we show statistically significant improvements in performance when comparing our model to a domain-adaptively pre-trained DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model avoids common pitfalls like oversensitivity to politically charged terms and instead attends more meaningfully to contextually relevant tokens. For a comprehensive examination of media bias, we present a pipeline that combines our model with an already-existing bias-type classifier. Our method exhibits good generalization and interpretability, despite being constrained by sentence-level analysis and dataset size because of a lack of larger and more advanced bias corpora. We talk about context-aware modeling, bias neutralization, and advanced bias type classification as potential future directions. Our findings contribute to building more robust, explainable, and socially responsible NLP systems for media bias detection.

* 7 pages, 5 figures, 2 tables

Via

Access Paper or Ask Questions

Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Nov 25, 2024

Carolin M. Schuster, Maria-Alexandra Dinisor, Shashwat Ghatiwala, Georg Groh

Figure 1 for Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Figure 2 for Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Figure 3 for Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Figure 4 for Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Abstract:Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

Via

Access Paper or Ask Questions

Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

Nov 25, 2024

Alexander Fichtl, Juraj Vladika, Georg Groh

Figure 1 for Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

Figure 2 for Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

Figure 3 for Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

Figure 4 for Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

Abstract:Knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap between large-scale language models and domain-specific knowledge. KELMs can achieve higher factual accuracy and mitigate hallucinations by leveraging knowledge graphs (KGs). They are frequently combined with adapter modules to reduce the computational load and risk of catastrophic forgetting. In this paper, we conduct a systematic literature review (SLR) on adapter-based approaches to KELMs. We provide a structured overview of existing methodologies in the field through quantitative and qualitative analysis and explore the strengths and potential shortcomings of individual approaches. We show that general knowledge and domain-specific approaches have been frequently explored along with various adapter architectures and downstream tasks. We particularly focused on the popular biomedical domain, where we provided an insightful performance comparison of existing KELMs. We outline the main trends and propose promising future directions.

* In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KEOD 2024; ISBN 978-989-758-716-0; ISSN 2184-3228, SciTePress, pages 95-107
* 12 pages, 4 figures. Published at KEOD24 via SciTePress

Via

Access Paper or Ask Questions

Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Oct 28, 2024

Florian Eichin, Carolin Schuster, Georg Groh, Michael A. Hedderich

Figure 1 for Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Figure 2 for Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Figure 3 for Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Figure 4 for Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

Abstract:Topic modeling is a key method in text analysis, but existing approaches are limited by assuming one topic per document or fail to scale efficiently for large, noisy datasets of short texts. We introduce Semantic Component Analysis (SCA), a novel topic modeling technique that overcomes these limitations by discovering multiple, nuanced semantic components beyond a single topic in short texts which we accomplish by introducing a decomposition step to the clustering-based topic modeling framework. Evaluated on multiple Twitter datasets, SCA matches the state-of-the-art method BERTopic in coherence and diversity, while uncovering at least double the semantic components and maintaining a noise rate close to zero while staying scalable and effective across languages, including an underrepresented one.

* 5 pages, 3 figures, code: https://github.com/mainlp/semantic_components

Via

Access Paper or Ask Questions

A Comprehensive Evaluation of Cognitive Biases in LLMs

Oct 20, 2024

Simon Malberg, Roman Poletukhin, Carolin M. Schuster, Georg Groh

Figure 1 for A Comprehensive Evaluation of Cognitive Biases in LLMs

Figure 2 for A Comprehensive Evaluation of Cognitive Biases in LLMs

Figure 3 for A Comprehensive Evaluation of Cognitive Biases in LLMs

Figure 4 for A Comprehensive Evaluation of Cognitive Biases in LLMs

Abstract:We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms

Via

Access Paper or Ask Questions

Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

Oct 04, 2024

Miriam Anschütz, Tringa Sylaj, Georg Groh

Figure 1 for Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

Figure 2 for Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

Figure 3 for Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

Figure 4 for Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication

Abstract:Explanatory images play a pivotal role in accessible and easy-to-read (E2R) texts. However, the images available in online databases are not tailored toward the respective texts, and the creation of customized images is expensive. In this large-scale study, we investigated whether text-to-image generation models can close this gap by providing customizable images quickly and easily. We benchmarked seven, four open- and three closed-source, image generation models and provide an extensive evaluation of the resulting images. In addition, we performed a user study with people from the E2R target group to examine whether the images met their requirements. We find that some of the models show remarkable performance, but none of the models are ready to be used at a larger scale without human supervision. Our research is an important step toward facilitating the creation of accessible information for E2R creators and tailoring accessible images to the target group's needs.

* To be published at TSAR workshop 2024 (https://tsar-workshop.github.io/)

Via

Access Paper or Ask Questions