Abstract:The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
Abstract:To make the earlier medical intervention of infants' cerebral palsy (CP), early diagnosis of brain damage is critical. Although general movements assessment(GMA) has shown promising results in early CP detection, it is laborious. Most existing works take videos as input to make fidgety movements(FMs) classification for the GMA automation. Those methods require a complete observation of videos and can not localize video frames containing normal FMs. Therefore we propose a novel approach named WO-GMA to perform FMs localization in the weakly supervised online setting. Infant body keypoints are first extracted as the inputs to WO-GMA. Then WO-GMA performs local spatio-temporal extraction followed by two network branches to generate pseudo clip labels and model online actions. With the clip-level pseudo labels, the action modeling branch learns to detect FMs in an online fashion. Experimental results on a dataset with 757 videos of different infants show that WO-GMA can get state-of-the-art video-level classification and cliplevel detection results. Moreover, only the first 20% duration of the video is needed to get classification results as good as fully observed, implying a significantly shortened FMs diagnosis time. Code is available at: https://github.com/scofiedluo/WO-GMA.