Abstract:Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 $\times$ 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image's long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient.
Abstract:Large language models (LLMs), with advanced linguistic capabilities, have been employed in reranking tasks through a sequence-to-sequence approach. In this paradigm, multiple passages are reranked in a listwise manner and a textual reranked permutation is generated. However, due to the limited context window of LLMs, this reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. This not only increases computational costs but also restricts the LLM from fully capturing all the comparison information for all candidates. To address these challenges, we propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking. To achieve it, we first propose the relevance-aware listwise reranking framework, which incorporates explicit list-view relevance scores to improve reranking efficiency and enable global comparison across the entire candidate set. Second, to ensure the comparability of the computed scores, we propose self-calibrated training that uses point-view relevance assessments generated internally by the LLM itself to calibrate the list-view relevance assessments. Extensive experiments and comprehensive analysis on the BEIR benchmark and TREC Deep Learning Tracks demonstrate the effectiveness and efficiency of our proposed method.
Abstract:Dimensionality reduction (DR) plays a crucial role in various fields, including data engineering and visualization, by simplifying complex datasets while retaining essential information. However, the challenge of balancing DR accuracy and interpretability remains crucial, particularly for users dealing with high-dimensional data. Traditional DR methods often face a trade-off between precision and transparency, where optimizing for performance can lead to reduced interpretability, and vice versa. This limitation is especially prominent in real-world applications such as image, tabular, and text data analysis, where both accuracy and interpretability are critical. To address these challenges, this work introduces the MOE-based Hyperbolic Interpretable Deep Manifold Transformation (DMT-HI). The proposed approach combines hyperbolic embeddings, which effectively capture complex hierarchical structures, with Mixture of Experts (MOE) models, which dynamically allocate tasks based on input features. DMT-HI enhances DR accuracy by leveraging hyperbolic embeddings to represent the hierarchical nature of data, while also improving interpretability by explicitly linking input data, embedding outcomes, and key features through the MOE structure. Extensive experiments demonstrate that DMT-HI consistently achieves superior performance in both DR accuracy and model interpretability, making it a robust solution for complex data analysis. The code is available at \url{https://github.com/zangzelin/code_dmthi}.
Abstract:In certain brain volumetric studies, synthetic T1-weighted magnetization-prepared rapid gradient-echo (MP-RAGE) contrast, derived from quantitative T1 MRI (T1-qMRI), proves highly valuable due to its clear white/gray matter boundaries for brain segmentation. However, generating synthetic MP-RAGE (syn-MP-RAGE) typically requires pairs of high-quality, artifact-free, multi-modality inputs, which can be challenging in retrospective studies, where missing or corrupted data is common. To overcome this limitation, our research explores the feasibility of employing a deep learning-based approach to synthesize syn-MP-RAGE contrast directly from a single channel turbo spin-echo (TSE) input, renowned for its resistance to metal artifacts. We evaluated this deep learning-based synthetic MP-RAGE (DL-Syn-MPR) on 31 non-artifact and 11 metal-artifact subjects. The segmentation results, measured by the Dice Similarity Coefficient (DSC), consistently achieved high agreement (DSC values above 0.83), indicating a strong correlation with reference segmentations, with lower input requirements. Also, no significant difference in segmentation performance was observed between the artifact and non-artifact groups.
Abstract:The vision-language modeling capability of multi-modal large language models has attracted wide attention from the community. However, in medical domain, radiology report generation using vision-language models still faces significant challenges due to the imbalanced data distribution caused by numerous negated descriptions in radiology reports and issues such as rough alignment between radiology reports and radiography. In this paper, we propose a truthful radiology report generation framework, namely TRRG, based on stage-wise training for cross-modal disease clue injection into large language models. In pre-training stage, During the pre-training phase, contrastive learning is employed to enhance the ability of visual encoder to perceive fine-grained disease details. In fine-tuning stage, the clue injection module we proposed significantly enhances the disease-oriented perception capability of the large language model by effectively incorporating the robust zero-shot disease perception. Finally, through the cross-modal clue interaction module, our model effectively achieves the multi-granular interaction of visual embeddings and an arbitrary number of disease clue embeddings. This significantly enhances the report generation capability and clinical effectiveness of multi-modal large language models in the field of radiology reportgeneration. Experimental results demonstrate that our proposed pre-training and fine-tuning framework achieves state-of-the-art performance in radiology report generation on datasets such as IU-Xray and MIMIC-CXR. Further analysis indicates that our proposed method can effectively enhance the model to perceive diseases and improve its clinical effectiveness.
Abstract:In current web environment, fake news spreads rapidly across online social networks, posing serious threats to society. Existing multimodal fake news detection (MFND) methods can be classified into knowledge-based and semantic-based approaches. However, these methods are overly dependent on human expertise and feedback, lacking flexibility. To address this challenge, we propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search (MCTS) algorithm to leverage the self-reflective capabilities of large language models (LLMs) for prompt optimization, providing richer, domain-specific details and guidance to the LLMs, while enabling more flexible integration of LLM comment on news content. For semantic-based methods, we define four typical deceit patterns: emotional exaggeration, logical inconsistency, image manipulation, and semantic inconsistency, to reveal the mechanisms behind fake news creation. To detect these patterns, we carefully design four discriminators and expand them in depth and breadth, using the soft-routing mechanism to explore optimal detection models. Experimental results on three real-world datasets demonstrate the superiority of our approach. The code will be available at: https://github.com/SuXinqi/DAAD.
Abstract:The application of the Multi-modal Large Language Models (MLLMs) in medical clinical scenarios remains underexplored. Previous benchmarks only focus on the capacity of the MLLMs in medical visual question-answering (VQA) or report generation and fail to assess the performance of the MLLMs on complex clinical multi-modal tasks. In this paper, we propose a novel Medical Personalized Multi-modal Consultation (Med-PMC) paradigm to evaluate the clinical capacity of the MLLMs. Med-PMC builds a simulated clinical environment where the MLLMs are required to interact with a patient simulator to complete the multi-modal information-gathering and decision-making task. Specifically, the patient simulator is decorated with personalized actors to simulate diverse patients in real scenarios. We conduct extensive experiments to access 12 types of MLLMs, providing a comprehensive view of the MLLMs' clinical performance. We found that current MLLMs fail to gather multimodal information and show potential bias in the decision-making task when consulted with the personalized patient simulators. Further analysis demonstrates the effectiveness of Med-PMC, showing the potential to guide the development of robust and reliable clinical MLLMs. Code and data are available at https://github.com/LiuHC0428/Med-PMC.
Abstract:Language, as an information medium created by advanced organisms, has always been a concern of neuroscience regarding how it is represented in the brain. Decoding linguistic representations in the evoked brain has shown groundbreaking achievements, thanks to the rapid improvement of neuroimaging, medical technology, life sciences and artificial intelligence. In this work, we present a taxonomy of brain-to-language decoding of both textual and speech formats. This work integrates two types of research: neuroscience focusing on language understanding and deep learning-based brain decoding. Generating discernible language information from brain activity could not only help those with limited articulation, especially amyotrophic lateral sclerosis (ALS) patients but also open up a new way for the next generation's brain-computer interface (BCI). This article will help brain scientists and deep-learning researchers to gain a bird's eye view of fine-grained language perception, and thus facilitate their further investigation and research of neural process and language decoding.
Abstract:Predicting influencers' views and public sentiment on social media is crucial for anticipating societal trends and guiding strategic responses. This study introduces a novel computational framework to predict opinion leaders' perspectives and the emotive reactions of the populace, addressing the inherent challenges posed by the unstructured, context-sensitive, and heterogeneous nature of online communication. Our research introduces an innovative module that starts with the automatic 5W1H (Where, Who, When, What, Why, and How) questions formulation engine, tailored to emerging news stories and trending topics. We then build a total of 60 anonymous opinion leader agents in six domains and realize the views generation based on an enhanced large language model (LLM) coupled with retrieval-augmented generation (RAG). Subsequently, we synthesize the potential views of opinion leaders and predicted the emotional responses to different events. The efficacy of our automated 5W1H module is corroborated by an average GPT-4 score of 8.83/10, indicative of high fidelity. The influencer agents exhibit a consistent performance, achieving an average GPT-4 rating of 6.85/10 across evaluative metrics. Utilizing the 'Russia-Ukraine War' as a case study, our methodology accurately foresees key influencers' perspectives and aligns emotional predictions with real-world sentiment trends in various domains.
Abstract:To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency. With our library, users can easily reproduce existing methods, train new models, and conduct comprehensive performance comparisons. To rigorously test LLMBox, we conduct extensive experiments in a diverse coverage of evaluation settings, and experimental results demonstrate the effectiveness and efficiency of our library in supporting various implementations related to LLMs. The detailed introduction and usage guidance can be found at https://github.com/RUCAIBox/LLMBox.