Abstract:Empathetic Conversational Systems (ECS) are built to respond empathetically to the user's emotions and sentiments, regardless of the application domain. Current ECS studies evaluation approaches are restricted to offline evaluation experiments primarily for gold standard comparison & benchmarking, and user evaluation studies for collecting human ratings on specific constructs. These methods are inadequate in measuring the actual quality of empathy in conversations. In this paper, we propose a multidimensional empathy evaluation framework with three new methods for measuring empathy at (i) structural level using three empathy-related dimensions, (ii) behavioral level using empathy behavioral types, and (iii) overall level using an empathy lexicon, thereby fortifying the evaluation process. Experiments were conducted with the state-of-the-art ECS models and large language models (LLMs) to show the framework's usefulness.
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various information-seeking and reasoning tasks. These computational systems drive state-of-the-art dialogue systems, such as ChatGPT and Bard. They also carry substantial promise in meeting the growing demands of mental health care, albeit relatively unexplored. As such, this study sought to examine LLMs' capability to generate empathetic responses in conversations that emulate those in a mental health counselling setting. We selected five LLMs: version 3.5 and version 4 of the Generative Pre-training (GPT), Vicuna FastChat-T5, Pathways Language Model (PaLM) version 2, and Falcon-7B-Instruct. Based on a simple instructional prompt, these models responded to utterances derived from the EmpatheticDialogues (ED) dataset. Using three empathy-related metrics, we compared their responses to those from traditional response generation dialogue systems, which were fine-tuned on the ED dataset, along with human-generated responses. Notably, we discovered that responses from the LLMs were remarkably more empathetic in most scenarios. We position our findings in light of catapulting advancements in creating empathetic conversational systems.