Abstract:Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model's ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at https://github.com/uni-medical/GMAI-VL.
Abstract:The 3D reconstruction of the surgical field in minimally invasive endoscopic surgery has posed a formidable challenge when using conventional monocular endoscopes. Existing 3D reconstruction methodologies are frequently encumbered by suboptimal accuracy and limited generalization capabilities. In this study, we introduce an innovative pipeline using Neural Radiance Fields (NeRF) for 3D reconstruction. Our approach utilizes a preliminary NeRF reconstruction that yields a coarse model, then creates a binocular scene within the reconstructed environment, which derives an initial depth map via stereo vision. This initial depth map serves as depth supervision for subsequent NeRF iterations, progressively refining the 3D reconstruction with enhanced accuracy. The binocular depth is iteratively recalculated, with the refinement process continuing until the depth map converges, and exhibits negligible variations. Through this recursive process, high-fidelity depth maps are generated from monocular endoscopic video of a realistic cranial phantom. By repeated measures of the final 3D reconstruction compared to X-ray computed tomography, all differences of relevant clinical distances result in sub-millimeter accuracy.
Abstract:Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace "losing images" in preference pairs. This approach allows us to optimize using only off-policy "winning images." Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in https://github.com/DwanZhang-AI/SePPO.
Abstract:Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 285 datasets across 39 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 52\%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.
Abstract:In this paper, we propose a movable antenna (MA) enhanced scheme for wireless powered mobile edge computing (WP-MEC) system, where the hybrid access point (HAP) equipped with multiple MAs first emits wireless energy to charge wireless devices (WDs), and then receives the offloaded tasks from the WDs for edge computing. The MAs deployed at the HAP enhance the spatial degrees of freedom (DoFs) by flexibly adjusting the positions of MAs within an available region, thereby improving the efficiency of both downlink wireless energy transfer (WPT) and uplink task offloading. To balance the performance enhancement against the implementation intricacy, we further propose three types of MA positioning configurations, i.e., dynamic MA positioning, semi-dynamic MA positioning, and static MA positioning. In addition, the non-linear power conversion of energy harvesting (EH) circuits at the WDs and the finite computing capability at the edge server are taken into account. Our objective is to maximize the sum computational rate (SCR) by jointly optimizing the time allocation, positions of MAs, energy beamforming matrix, receive combing vectors, and offloading strategies of WDs. To solve the non-convex problems, efficient alternating optimization (AO) frameworks are proposed. Moreover, we propose a hybrid algorithm of particle swarm optimization with variable local search (PSO-VLS) to solve the sub-problem of MA positioning. Numerical results validate the superiority of exploiting MAs over the fixed-position antennas (FPAs) for enhancing the SCR performance of WP-MEC systems.
Abstract:Wireless powered and backscattering mobile edge computing (WPB-MEC) network is a novel network paradigm to supply energy supplies and computing resource to wireless sensors (WSs). However, its performance is seriously affected by severe attenuations and inappropriate assumptions of infinite computing capability at the hybrid access point (HAP). To address the above issues, in this paper, we propose a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided scheme for boosting the performance of WPB-MEC network under the constraint of finite computing capability. Specifically, energy-constrained WSs are able to offload tasks actively or passively from them to the HAP. In this process, the STAR-RIS is utilized to improve the quantity of harvested energy and strengthen the offloading efficiency by adapting its operating protocols. We then maximize the sum computational bits (SCBs) under the finite computing capability constraint. To handle the solving challenges, we first present interesting results in closed-form and then design a block coordinate descent (BCD) based algorithm, ensuring a near-optimal solution. Finally, simulation results are provided to confirm that our proposed scheme can improve the SCBs by 9.9 times compared to the local computing only scheme.
Abstract:OpenAI's latest large vision-language model (LVLM), GPT-4V(ision), has piqued considerable interest for its potential in medical applications. Despite its promise, recent studies and internal reviews highlight its underperformance in specialized medical tasks. This paper explores the boundary of GPT-4V's capabilities in medicine, particularly in processing complex imaging data from endoscopies, CT scans, and MRIs etc. Leveraging open-source datasets, we assessed its foundational competencies, identifying substantial areas for enhancement. Our research emphasizes prompt engineering, an often-underutilized strategy for improving AI responsiveness. Through iterative testing, we refined the model's prompts, significantly improving its interpretative accuracy and relevance in medical imaging. From our comprehensive evaluations, we distilled 10 effective prompt engineering techniques, each fortifying GPT-4V's medical acumen. These methodical enhancements facilitate more reliable, precise, and clinically valuable insights from GPT-4V, advancing its operability in critical healthcare environments. Our findings are pivotal for those employing AI in medicine, providing clear, actionable guidance on harnessing GPT-4V's full diagnostic potential.
Abstract:In this paper, we propose an active reconfigurable intelligent surface (RIS) enabled hybrid relaying scheme for a multi-antenna wireless powered communication network (WPCN), where the active RIS is employed to assist both wireless energy transfer (WET) from the power station (PS) to energy-constrained users and wireless information transmission (WIT) from users to the receiving station (RS). For further performance enhancement, we propose to employ both transmit beamforming at the PS and receive beamforming at the RS. We formulate a sum-rate maximization problem by jointly optimizing the RIS phase shifts and amplitude reflection coefficients for both the WET and the WIT, transmit and receive beamforming vectors, and network resource allocation. To solve this non-convex problem, we propose an efficient alternating optimization algorithm with linear minimum mean squared error criterion, semi-definite relaxation (SDR) and successive convex approximation techniques. Specifically, the tightness of applying the SDR is proved. Simulation results demonstrate that our proposed scheme with 10 reflecting elements (REs) and 4 antennas can achieve 17.78% and 415.48% performance gains compared to the single-antenna scheme with 10 REs and passive RIS scheme with 100 REs, respectively.