Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mingcheng Li

Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Nov 05, 2024

Mingcheng Li, Dingkang Yang, Yang Liu, Shunli Wang, Jiawei Chen, Shuaibing Wang, Jinjie Wei, Yue Jiang, Qingyao Xu, Xiaolu Hou(+4 more)

Figure 1 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Figure 2 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Figure 3 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Figure 4 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Abstract:Multimodal Sentiment Analysis (MSA) is an important research area that aims to understand and recognize human sentiment through multiple modalities. The complementary information provided by multimodal fusion promotes better sentiment analysis compared to utilizing only a single modality. Nevertheless, in real-world applications, many unavoidable factors may lead to situations of uncertain modality missing, thus hindering the effectiveness of multimodal modeling and degrading the model's performance. To this end, we propose a Hierarchical Representation Learning Framework (HRLF) for the MSA task under uncertain missing modalities. Specifically, we propose a fine-grained representation factorization module that sufficiently extracts valuable sentiment information by factorizing modality into sentiment-relevant and modality-specific representations through crossmodal translation and sentiment semantic reconstruction. Moreover, a hierarchical mutual information maximization mechanism is introduced to incrementally maximize the mutual information between multi-scale representations to align and reconstruct the high-level semantics in the representations. Ultimately, we propose a hierarchical adversarial learning mechanism that further aligns and adapts the latent distribution of sentiment-relevant representations to produce robust joint multimodal representations. Comprehensive experiments on three datasets demonstrate that HRLF significantly improves MSA performance under uncertain modality missing cases.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

Oct 17, 2024

Jinjie Wei, Dingkang Yang, Yanshu Li, Qingyao Xu, Zhaoyu Chen, Mingcheng Li, Yue Jiang, Xiaolu Hou, Lihua Zhang

Figure 1 for MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

Figure 2 for MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

Figure 3 for MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

Figure 4 for MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

Abstract:Large Language Model (LLM)-driven interactive systems currently show potential promise in healthcare domains. Despite their remarkable capabilities, LLMs typically lack personalized recommendations and diagnosis analysis in sophisticated medical applications, causing hallucinations and performance bottlenecks. To address these challenges, this paper proposes MedAide, an LLM-based omni medical multi-agent collaboration framework for specialized healthcare services. Specifically, MedAide first performs query rewriting through retrieval-augmented generation to accomplish accurate medical intent understanding. Immediately, we devise a contextual encoder to obtain intent prototype embeddings, which are used to recognize fine-grained intents by similarity matching. According to the intent relevance, the activated agents collaborate effectively to provide integrated decision analysis. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.

* LLM-based Multi-Agent Collaboration for Medical Applications

Via

Access Paper or Ask Questions

Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Aug 22, 2024

Dingkang Yang, Dongling Xiao, Jinjie Wei, Mingcheng Li, Zhaoyu Chen, Ke Li, Lihua Zhang

Figure 1 for Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Figure 2 for Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Figure 3 for Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Figure 4 for Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Abstract:Despite their remarkable capabilities, Large Language Models (LLMs) are prone to generate responses that contradict verifiable facts, i.e., unfaithful hallucination content. Existing efforts generally focus on optimizing model parameters or editing semantic representations, which compromise the internal factual knowledge of target LLMs. In addition, hallucinations typically exhibit multifaceted patterns in downstream tasks, limiting the model's holistic performance across tasks. In this paper, we propose a Comparator-driven Decoding-Time (CDT) framework to alleviate the response hallucination. Firstly, we construct hallucinatory and truthful comparators with multi-task fine-tuning samples. In this case, we present an instruction prototype-guided mixture of experts strategy to enhance the ability of the corresponding comparators to capture different hallucination or truthfulness patterns in distinct task instructions. CDT constrains next-token predictions to factuality-robust distributions by contrasting the logit differences between the target LLMs and these comparators. Systematic experiments on multiple downstream tasks show that our framework can significantly improve the model performance and response factuality.

* Hallucination Mitigation in LLMs

Via

Access Paper or Ask Questions

HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

Aug 17, 2024

Xiao Zhao, Bo Chen, Mingyang Sun, Dingkang Yang, Youxing Wang, Xukun Zhang, Mingcheng Li, Dongliang Kou, Xiaoyi Wei, Lihua Zhang

Figure 1 for HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

Figure 2 for HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

Figure 3 for HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

Figure 4 for HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

Abstract:Vision-based 3D semantic scene completion (SSC) describes autonomous driving scenes through 3D volume representations. However, the occlusion of invisible voxels by scene surfaces poses challenges to current SSC methods in hallucinating refined 3D geometry. This paper proposes HybridOcc, a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation and refined in a coarse-to-fine SSC prediction framework. HybridOcc aggregates contextual features through the Transformer paradigm based on hybrid query proposals while combining it with NeRF representation to obtain depth supervision. The Transformer branch contains multiple scales and uses spatial cross-attention for 2D to 3D transformation. The newly designed NeRF branch implicitly infers scene occupancy through volume rendering, including visible and invisible voxels, and explicitly captures scene depth rather than generating RGB color. Furthermore, we present an innovative occupancy-aware ray sampling method to orient the SSC task instead of focusing on the scene surface, further improving the overall performance. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our HybridOcc on the SSC task.

* Accepted to IEEE RAL

Via

Access Paper or Ask Questions

MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Aug 17, 2024

Xiao Zhao, Xukun Zhang, Dingkang Yang, Mingyang Sun, Mingcheng Li, Shunli Wang, Lihua Zhang

Figure 1 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Figure 2 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Figure 3 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Figure 4 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Abstract:Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of complementary learning among tasks and decreased performance in multi-task learning (MTL) due to joint training. In this paper, we propose MaskBEV, a masked attention-based MTL paradigm that unifies 3D object detection and bird's eye view (BEV) map segmentation. MaskBEV introduces a task-agnostic Transformer decoder to process these diverse tasks, enabling MTL to be completed in a unified decoder without requiring additional design of specific task heads. To fully exploit the complementary information between BEV map segmentation and 3D object detection tasks in BEV space, we propose spatial modulation and scene-level context aggregation strategies. These strategies consider the inherent dependencies between BEV segmentation and 3D detection, naturally boosting MTL performance. Extensive experiments on nuScenes dataset show that compared with previous state-of-the-art MTL methods, MaskBEV achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation, while also demonstrating slightly leading inference speed.

* Accepted to ACM MM 2024

Via

Access Paper or Ask Questions

Faster Diffusion Action Segmentation

Aug 04, 2024

Shuaibing Wang, Shunli Wang, Mingcheng Li, Dingkang Yang, Haopeng Kuang, Ziyun Qian, Lihua Zhang

Figure 1 for Faster Diffusion Action Segmentation

Figure 2 for Faster Diffusion Action Segmentation

Figure 3 for Faster Diffusion Action Segmentation

Figure 4 for Faster Diffusion Action Segmentation

Abstract:Temporal Action Segmentation (TAS) is an essential task in video analysis, aiming to segment and classify continuous frames into distinct action segments. However, the ambiguous boundaries between actions pose a significant challenge for high-precision segmentation. Recent advances in diffusion models have demonstrated substantial success in TAS tasks due to their stable training process and high-quality generation capabilities. However, the heavy sampling steps required by diffusion models pose a substantial computational burden, limiting their practicality in real-time applications. Additionally, most related works utilize Transformer-based encoder architectures. Although these architectures excel at capturing long-range dependencies, they incur high computational costs and face feature-smoothing issues when processing long video sequences. To address these challenges, we propose EffiDiffAct, an efficient and high-performance TAS algorithm. Specifically, we develop a lightweight temporal feature encoder that reduces computational overhead and mitigates the rank collapse phenomenon associated with traditional self-attention mechanisms. Furthermore, we introduce an adaptive skip strategy that allows for dynamic adjustment of timestep lengths based on computed similarity metrics during inference, thereby further enhancing computational efficiency. Comprehensive experiments on the 50Salads, Breakfast, and GTEA datasets demonstrated the effectiveness of the proposed algorithm.

* 25 pages, 6 figures

Via

Access Paper or Ask Questions

Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Jul 06, 2024

Dingkang Yang, Mingcheng Li, Linhao Qu, Kun Yang, Peng Zhai, Song Wang, Lihua Zhang

Figure 1 for Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Figure 2 for Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Figure 3 for Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Figure 4 for Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations

Abstract:Understanding human intentions (e.g., emotions) from videos has received considerable attention recently. Video streams generally constitute a blend of temporal data stemming from distinct modalities, including natural language, facial expressions, and auditory clues. Despite the impressive advancements of previous works via attention-based paradigms, the inherent temporal asynchrony and modality heterogeneity challenges remain in multimodal sequence fusion, causing adverse performance bottlenecks. To tackle these issues, we propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA) to refine multimodal features and leverage the complementarity across distinct modalities. On the one hand, MEA introduces a predictive self-attention module to capture reliable context dynamics within modalities and reinforce unique features over the modality-exclusive spaces. On the other hand, a hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities over the modality-agnostic space. Meanwhile, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, we propose a decoupled graph fusion mechanism to enhance knowledge exchange across heterogeneous modalities and learn robust multimodal representations for downstream tasks. Numerous experiments are implemented on three multimodal datasets with asynchronous sequences. Systematic analyses show the necessity of our approach.

* TCSVT 2024

Via

Access Paper or Ask Questions

MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Jun 18, 2024

Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, Lihua Zhang

Figure 1 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Figure 2 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Figure 3 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Figure 4 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Abstract:When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effective way to address model hallucination. Moreover, the limited availability of training data in the medical field and privacy concerns greatly hinder the model's accuracy and generalization capabilities. In this paper, we introduce a method that mimics human cognitive processes to construct fine-grained instruction pairs and apply the concept of chain-of-thought (CoT) from inference scenarios to training scenarios, thereby proposing a method called MedThink. Our experiments on various LVLMs demonstrate that our novel data construction method tailored for the medical domain significantly improves the model's performance in medical image report generation tasks and substantially mitigates the hallucinations. All resources of this work will be released soon.

Via

Access Paper or Ask Questions

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jun 14, 2024

Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang

Figure 1 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Figure 2 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Figure 3 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Figure 4 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Abstract:Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.

Via

Access Paper or Ask Questions

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

May 29, 2024

Dingkang Yang, Jinjie Wei, Dongling Xiao, Shunli Wang, Tong Wu, Gang Li, Mingcheng Li, Shuaibing Wang, Jiawei Chen, Yue Jiang(+4 more)

Figure 1 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Figure 2 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Figure 3 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Figure 4 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Abstract:Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300,000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline. In the continuous pre-training phase, we introduce a hybrid instruction pre-training mechanism to mitigate the internal-injected knowledge inconsistency of LLMs for medical domain adaptation. Immediately, the full-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the general medical knowledge schema into the models. After that, we devise a direct following preference optimization to enhance the generation of pediatrician-like humanistic responses. In the parameter-efficient secondary SFT phase, a mixture of universal-specific experts strategy is presented to resolve the competency conflict between medical generalist and pediatric expertise mastery. Extensive results based on the metrics, GPT-4, and doctor evaluations on distinct doctor downstream tasks show that PediatricsGPT consistently outperforms previous Chinese medical LLMs. Our model and dataset will be open-source for community development.

* A Technical Report on a Powerful Chinese Medical Large Language Model

Via

Access Paper or Ask Questions