Abstract:The integration of Large Language Models (LLMs) into medical applications has sparked widespread interest across the healthcare industry, from drug discovery and development to clinical decision support, assisting telemedicine, medical devices, and healthcare insurance applications. This perspective paper aims to discuss the inner workings of building LLM-powered medical AI applications and introduces a comprehensive framework for their development. We review existing literature and outline the unique challenges of applying LLMs in specialized medical contexts. Additionally, we introduce a three-step framework to organize medical LLM research activities: 1) Modeling: breaking down complex medical workflows into manageable steps for developing medical-specific models; 2) Optimization: optimizing the model performance with crafted prompts and integrating external knowledge and tools, and 3) System engineering: decomposing complex tasks into subtasks and leveraging human expertise for building medical AI applications. Furthermore, we offer a detailed use case playbook that describes various LLM-powered medical AI applications, such as optimizing clinical trial design, enhancing clinical decision support, and advancing medical imaging analysis. Finally, we discuss various challenges and considerations for building medical AI applications with LLMs, such as handling hallucination issues, data ownership and compliance, privacy, intellectual property considerations, compute cost, sustainability issues, and responsible AI requirements.
Abstract:Data science plays a critical role in clinical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, these tests do not assess LLMs' ability to handle data science tasks in medicine, nor do they explore their practical utility in clinical research. To address this, we developed a dataset consisting of 293 real-world data science coding tasks, based on 39 published clinical studies, covering 128 tasks in Python and 165 tasks in R. This dataset simulates realistic clinical research scenarios using patient data. Our findings reveal that cutting-edge LLMs struggle to generate perfect solutions, frequently failing to follow input instructions, understand target data, and adhere to standard analysis practices. Consequently, LLMs are not yet ready to fully automate data science tasks. We benchmarked advanced adaptation methods and found two to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 60% improvement in code accuracy; and self-reflection, enabling LLMs to iteratively refine their code, yielding a 38% accuracy improvement. Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical doctors, we found that while LLMs cannot fully automate coding tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs, when integrated into expert workflows, to enhance data science efficiency in clinical research.
Abstract:Automatic medical discovery by AI is a dream of many. One step toward that goal is to create an AI model to understand clinical studies and synthesize clinical evidence from the literature. Clinical evidence synthesis currently relies on systematic reviews of clinical trials and retrospective analyses from medical literature. However, the rapid expansion of publications presents challenges in efficiently identifying, summarizing, and updating evidence. We introduce TrialMind, a generative AI-based pipeline for conducting medical systematic reviews, encompassing study search, screening, and data extraction phases. We utilize large language models (LLMs) to drive each pipeline component while incorporating human expert oversight to minimize errors. To facilitate evaluation, we also create a benchmark dataset TrialReviewBench, a custom dataset with 870 annotated clinical studies from 25 meta-analysis papers across various medical treatments. Our results demonstrate that TrialMind significantly improves the literature review process, achieving high recall rates (0.897-1.000) in study searching from over 20 million PubMed studies and outperforming traditional language model embeddings-based methods in screening (Recall@20 of 0.227-0.246 vs. 0.000-0.102). Furthermore, our approach surpasses direct GPT-4 performance in result extraction, with accuracy ranging from 0.65 to 0.84. We also support clinical evidence synthesis in forest plots, as validated by eight human annotators who preferred TrialMind over the GPT-4 baseline with a winning rate of 62.5%-100% across the involved reviews. Our findings suggest that an LLM-based clinical evidence synthesis approach, such as TrialMind, can enable reliable and high-quality clinical evidence synthesis to improve clinical research efficiency.