Abstract:This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models, with the aim of tackling the challenges of preference-aligned data annotation while reducing the reliance on medical experts. We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods and the difficulties in accurately representing physician preferences. To address these challenges, we present a new evaluation framework based on standardized patient examinations. This framework is designed to objectively assess the effectiveness of large language models (LLMs) in guiding users and following instructions, enabling a comprehensive comparison across different models. Furthermore, our investigation of effective ways to express physician preferences using Constitutional AI algorithms highlighted the particular effectiveness of flowcharts. Utilizing this finding, we introduce an innovative agent-based approach for annotating preference data. This approach autonomously creates medical dialogue flows tailored to the patient's condition, demonstrates strong generalization abilities, and reduces the need for expert involvement. Our results show that the agent-based approach outperforms existing RLAIF annotation methods in standardized patient examinations and surpasses current open source medical dialogue LLMs in various test scenarios.
Abstract:The use of large language models in medical dialogue generation has garnered significant attention, with a focus on improving response quality and fluency. While previous studies have made progress in optimizing model performance for single-round medical Q&A tasks, there is a need to enhance the model's capability for multi-round conversations to avoid logical inconsistencies. To address this, we propose an approach called preference learning from process feedback~(PLPF), which integrates the doctor's diagnostic logic into LLMs. PLPF involves rule modeling, preference data generation, and preference alignment to train the model to adhere to the diagnostic process. Experimental results using Standardized Patient Testing show that PLPF enhances the diagnostic accuracy of the baseline model in medical conversations by 17.6%, outperforming traditional reinforcement learning from human feedback. Additionally, PLPF demonstrates effectiveness in both multi-round and single-round dialogue tasks, showcasing its potential for improving medical dialogue generation.
Abstract:It is a grand challenge to model the emergence of swarm intelligence and many principles or models had been proposed. However, existing models do not catch the nature of swarm intelligence and they are not generic enough to describe various types of emergence phenomena. In this work, we propose a contradiction-centric model for emergence of swarm intelligence, in which individuals' contradictions dominate their appearances whilst they are associated and interacting to update their contradictions. This model hypothesizes that 1) the emergence of swarm intelligence is rooted in the development of contradictions of individuals and the interactions among associated individuals and 2) swarm intelligence is essentially a combinative reflection of the configurations of contradictions inside individuals and the distributions of contradictions among individuals. To verify the feasibility of the model, we simulate four types of swarm intelligence. As the simulations show, our model is truly generic and can describe the emergence of a variety of swarm intelligence, and it is also very simple and can be easily applied to demonstrate the emergence of swarm intelligence without needing complicated computations.