Senior Member, IEEE
Abstract:Text-to-image (T2I) generation has made significant advances in recent years, but challenges still remain in the generation of perceptual artifacts, misalignment with complex prompts, and safety. The prevailing approach to address these issues involves collecting human feedback on generated images, training reward models to estimate human feedback, and then fine-tuning T2I models based on the reward models to align them with human preferences. However, while existing reward fine-tuning methods can produce images with higher rewards, they may change model behavior in unexpected ways. For example, fine-tuning for one quality aspect (e.g., safety) may degrade other aspects (e.g., prompt alignment), or may lead to reward hacking (e.g., finding a way to increase rewards without having the intended effect). In this paper, we propose Focus-N-Fix, a region-aware fine-tuning method that trains models to correct only previously problematic image regions. The resulting fine-tuned model generates images with the same high-level structure as the original model but shows significant improvements in regions where the original model was deficient in safety (over-sexualization and violence), plausibility, or other criteria. Our experiments demonstrate that Focus-N-Fix improves these localized quality aspects with little or no degradation to others and typically imperceptible changes in the rest of the image. Disclaimer: This paper contains images that may be overly sexual, violent, offensive, or harmful.
Abstract:With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we highlight emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.
Abstract:Enterprises possess a vast array of API assets scattered across various functions, forming the backbone of existing business processes. By leveraging these APIs as functional tools, enterprises can design diverse, scenario-specific agent applications, driven by on-premise function-calling models as the core engine. However, generic models often fail to meet enterprise requirements in terms of computational efficiency, output accuracy, and stability, necessitating scenario-specific adaptation. In this paper, we propose a training pipeline for function-calling capabilities tailored to real-world business scenarios. This pipeline includes the synthesis and augmentation of scenario-specific function-calling data, model fine-tuning, and performance evaluation and analysis. Using this pipeline, we generated 1,260 fully AI-generated samples and 1,035 augmented manually-labeled samples in digital HR agent scenario. The Qwen2.5-Coder-7B-Instruct model was employed as the base model and fine-tuned using the LoRA method on four GPUs with 24GB VRAM. Our fine-tuned model demonstrated outstanding performance in evaluations and practical applications, surpassing GPT-4 and GPT-4o in accuracy on the test set. These results validate the reliability of the proposed pipeline for training scenario-specific function-calling models.
Abstract:Recent advancements in text-to-video (T2V) generation have leveraged diffusion models to enhance the visual coherence of videos generated from textual descriptions. However, most research has primarily focused on object motion, with limited attention given to cinematic language in videos, which is crucial for cinematographers to convey emotion and narrative pacing. To address this limitation, we propose a threefold approach to enhance the ability of T2V models to generate controllable cinematic language. Specifically, we introduce a cinematic language dataset that encompasses shot framing, angle, and camera movement, enabling models to learn diverse cinematic styles. Building on this, to facilitate robust cinematic alignment evaluation, we present CameraCLIP, a model fine-tuned on the proposed dataset that excels in understanding complex cinematic language in generated videos and can further provide valuable guidance in the multi-shot composition process. Finally, we propose CLIPLoRA, a cost-guided dynamic LoRA composition method that facilitates smooth transitions and realistic blending of cinematic language by dynamically fusing multiple pre-trained cinematic LoRAs within a single video. Our experiments demonstrate that CameraCLIP outperforms existing models in assessing the alignment between cinematic language and video, achieving an R@1 score of 0.81. Additionally, CLIPLoRA improves the ability for multi-shot composition, potentially bridging the gap between automatically generated videos and those shot by professional cinematographers.
Abstract:Coronary artery disease poses a significant global health challenge, often necessitating percutaneous coronary intervention (PCI) with stent implantation. Assessing stent apposition holds pivotal importance in averting and identifying PCI complications that lead to in-stent restenosis. Here we proposed a novel three-dimensional (3D) distance-color-coded assessment (DccA)for PCI stent apposition via deep-learning-based 3D multi-object segmentation in intravascular optical coherence tomography (IV-OCT). Our proposed 3D DccA accurately segments 3D vessel lumens and stents in IV-OCT images, using a spatial matching network and dual-layer training with style transfer. It quantifies and maps stent-lumen distances into a 3D color space, facilitating 3D visual assessment of PCI stent apposition. Achieving over 95% segmentation precision, our proposed DccA enhances clinical evaluation of PCI stent deployment and supports personalized treatment planning.
Abstract:In the real world, a learning-enabled system usually undergoes multiple cycles of model development to enhance the system's ability to handle difficult or emerging tasks. This continual model development process raises a significant issue that the model development for acquiring new or improving existing capabilities may inadvertently lose capabilities of the old model, also known as catastrophic forgetting. Existing continual learning studies focus on mitigating catastrophic forgetting by trading off performance on previous tasks and new tasks to ensure good average performance. However, they are inadequate for many applications especially in safety-critical domains, as failure to strictly preserve the performance of the old model not only introduces safety risks and uncertainties but also imposes substantial expenses in the re-improving and re-validation of existing properties. To address this issue, we introduce model developmental safety as a guarantee of a learning system such that in the model development process the new model should strictly preserve the existing protected capabilities of the old model while improving its performance on target tasks. To ensure the model developmental safety, we present a safety-centric framework by formulating the model developmental safety as data-dependent constraints. Under this framework, we study how to develop a pretrained vision-language model (aka the CLIP model) for acquiring new capabilities or improving existing capabilities of image classification. We propose an efficient constrained optimization algorithm with theoretical guarantee and use its insights to finetune a CLIP model with task-dependent heads for promoting the model developmental safety. Our experiments on improving vision perception capabilities on autonomous driving and scene recognition datasets demonstrate the efficacy of the proposed approach.
Abstract:With the improvement in the quantity and quality of remote sensing images, content-based remote sensing object retrieval (CBRSOR) has become an increasingly important topic. However, existing CBRSOR methods neglect the utilization of global statistical information during both training and test stages, which leads to the overfitting of neural networks to simple sample pairs of samples during training and suboptimal metric performance. Inspired by the Neyman-Pearson theorem, we propose a generalized likelihood ratio test-based metric learning (GLRTML) approach, which can estimate the relative difficulty of sample pairs by incorporating global data distribution information during training and test phases. This guides the network to focus more on difficult samples during the training process, thereby encourages the network to learn more discriminative feature embeddings. In addition, GLRT is a more effective than traditional metric space due to the utilization of global data distribution information. Accurately estimating the distribution of embeddings is critical for GLRTML. However, in real-world applications, there is often a distribution shift between the training and target domains, which diminishes the effectiveness of directly using the distribution estimated on training data. To address this issue, we propose the clustering pseudo-labels-based fast parameter adaptation (CPLFPA) method. CPLFPA efficiently estimates the distribution of embeddings in the target domain by clustering target domain instances and re-estimating the distribution parameters for GLRTML. We reorganize datasets for CBRSOR tasks based on fine-grained ship remote sensing image slices (FGSRSI-23) and military aircraft recognition (MAR20) datasets. Extensive experiments on these datasets demonstrate the effectiveness of our proposed GLRTML and CPLFPA.
Abstract:Obstructive Sleep Apnea-Hypopnea Syndrome (OSAHS) is a sleep-related breathing disorder associated with significant morbidity and mortality worldwide. The gold standard for OSAHS diagnosis, polysomnography (PSG), faces challenges in popularization due to its high cost and complexity. Recently, radar has shown potential in detecting sleep apnea-hypopnea events (SAE) with the advantages of low cost and non-contact monitoring. However, existing studies, especially those using deep learning, employ segment-based classification approach for SAE detection, making the task of event quantity estimation difficult. Additionally, radar-based SAE detection is susceptible to interference from body movements and the environment. Oxygen saturation (SpO2) can offer valuable information about OSAHS, but it also has certain limitations and cannot be used alone for diagnosis. In this study, we propose a method using millimeter-wave radar and pulse oximeter to detect SAE, called ROSA. It fuses information from both sensors, and directly predicts the temporal localization of SAE. Experimental results demonstrate a high degree of consistency (ICC=0.9864) between AHI from ROSA and PSG. This study presents an effective method with low-load device for the diagnosis of OSAHS.
Abstract:Study Objectives: To evaluate the agreement between the millimeter-wave radar-based device and polysomnography (PSG) in diagnosis of obstructive sleep apnea (OSA) and classification of sleep stage in children. Methods: 281 children, aged 1 to 18 years, who underwent sleep monitoring between September and November 2023 at the Sleep Center of Beijing Children's Hospital, Capital Medical University, were recruited in the study. All enrolled children underwent sleep monitoring by PSG and the millimeter-wave radar-based device, QSA600, simultaneously. QSA600 recordings were automatically analyzed using a deep learning model meanwhile the PSG data was manually scored. Results: The Obstructive Apnea-Hypopnea Index (OAHI) obtained from QSA600 and PSG demonstrates a high level of agreement with an intraclass correlation coefficient of 0.945 (95% CI: 0.93 to 0.96). Bland-Altman analysis indicates that the mean difference of OAHI between QSA600 and PSG is -0.10 events/h (95% CI: -11.15 to 10.96). The deep learning model evaluated through cross-validation showed good sensitivity (81.8%, 84.3% and 89.7%) and specificity (90.5%, 95.3% and 97.1%) values for diagnosing children with OAHI>1, OAHI>5 and OAHI>10. The area under the receiver operating characteristic curve is 0.923, 0.955 and 0.988, respectively. For sleep stage classification, the model achieved Kappa coefficients of 0.854, 0.781, and 0.734, with corresponding overall accuracies of 95.0%, 84.8%, and 79.7% for Wake-sleep classification, Wake-REM-Light-Deep classification, and Wake-REM-N1-N2 N3 classification, respectively. Conclusions: QSA600 has demonstrated high agreement with PSG in diagnosing OSA and performing sleep staging in children. The device is portable, low-load and suitable for follow up and long-term pediatric sleep assessment.
Abstract:This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.