Abstract:The study of multimodal interaction in therapy can yield a comprehensive understanding of therapist and patient behavior that can be used to develop a multimodal virtual agent supporting therapy. This investigation aims to uncover how therapists skillfully blend therapy's task goal (employing classical steps of Motivational Interviewing) with the social goal (building a trusting relationship and expressing empathy). Furthermore, we seek to categorize patients into various ``types'' requiring tailored therapeutic approaches. To this intent, we present multimodal annotations of a corpus consisting of simulated motivational interviewing conversations, wherein actors portray the roles of patients and therapists. We introduce EMMI, composed of two publicly available MI corpora, AnnoMI and the Motivational Interviewing Dataset, for which we add multimodal annotations. We analyze these annotations to characterize functional behavior for developing a virtual agent performing motivational interviews emphasizing social and empathic behaviors. Our analysis found three clusters of patients expressing significant differences in behavior and adaptation of the therapist's behavior to those types. This shows the importance of a therapist being able to adapt their behavior depending on the current situation within the dialog and the type of user.
Abstract:Motivational Interviewing (MI) is an approach to therapy that emphasizes collaboration and encourages behavioral change. To evaluate the quality of an MI conversation, client utterances can be classified using the MISC code as either change talk, sustain talk, or follow/neutral talk. The proportion of change talk in a MI conversation is positively correlated with therapy outcomes, making accurate classification of client utterances essential. In this paper, we present a classifier that accurately distinguishes between the three MISC classes (change talk, sustain talk, and follow/neutral talk) leveraging multimodal features such as text, prosody, facial expressivity, and body expressivity. To train our model, we perform annotations on the publicly available AnnoMI dataset to collect multimodal information, including text, audio, facial expressivity, and body expressivity. Furthermore, we identify the most important modalities in the decision-making process, providing valuable insights into the interplay of different modalities during a MI conversation.