Abstract:Advances in machine intelligence have enabled conversational interfaces that have the potential to radically change the way humans interact with machines. However, even with the progress in the abilities of these agents, there remain critical gaps in their capacity for natural interactions. One limitation is that the agents are often monotonic in behavior and do not adapt to their partner. We built two end-to-end conversational agents: a voice-based agent that can engage in naturalistic, multi-turn dialogue and align with the interlocutor's conversational style, and a 2nd, expressive, embodied conversational agent (ECA) that can recognize human behavior during open-ended conversations and automatically align its responses to the visual and conversational style of the other party. The embodied conversational agent leverages multimodal inputs to produce rich and perceptually valid vocal and facial responses (e.g., lip syncing and expressions) during the conversation. Based on empirical results from a set of user studies, we highlight several significant challenges in building such systems and provide design guidelines for multi-turn dialogue interactions using style adaptation for future research.