Abstract:Artificial intelligence (AI) is revolutionizing scientific research, particularly in computational materials science, by enabling more accurate and efficient simulations. Machine learning force fields (MLFFs) have emerged as powerful tools for molecular dynamics (MD) simulations, potentially offering quantum-mechanical accuracy with the efficiency of classical MD. This Perspective evaluates the viability of universal MLFFs for simulating complex materials systems from the standpoint of a potential practitioner. Using the temperature-driven ferroelectric-paraelectric phase transition of PbTiO$_3$ as a benchmark, we assess leading universal force fields, including CHGNet, MACE, M3GNet, and GPTFF, alongside specialized models like UniPero. While universal MLFFs trained on PBE-derived datasets perform well in predicting equilibrium properties, they largely fail to capture realistic finite-temperature phase transitions under constant-pressure MD, often exhibiting unphysical instabilities. These shortcomings stem from inherited biases in exchange-correlation functionals and limited generalization to anharmonic interactions governing dynamic behavior. However, fine-tuning universal models or employing system-specific MLFFs like UniPero successfully restores predictive accuracy. We advocates for hybrid approaches combining universal pretraining with targeted optimization, improved error quantification frameworks, and community-driven benchmarks to advance MLFFs as robust tools for computational materials discovery.
Abstract:Recent advancements in pre-trained Vision-Language Models (VLMs) have highlighted the significant potential of prompt tuning for adapting these models to a wide range of downstream tasks. However, existing prompt tuning methods typically map an image to a single representation, limiting the model's ability to capture the diverse ways an image can be described. To address this limitation, we investigate the impact of visual prompts on the model's generalization capability and introduce a novel method termed Multi-Representation Guided Prompt Tuning (MePT). Specifically, MePT employs a three-branch framework that focuses on diverse salient regions, uncovering the inherent knowledge within images which is crucial for robust generalization. Further, we employ efficient self-ensemble techniques to integrate these versatile image representations, allowing MePT to learn all conditional, marginal, and fine-grained distributions effectively. We validate the effectiveness of MePT through extensive experiments, demonstrating significant improvements on both base-to-novel class prediction and domain generalization tasks.
Abstract:Existing Large Vision-Language Models (LVLMs) primarily align image features of vision encoder with Large Language Models (LLMs) to leverage their superior text generation capabilities. However, the scale disparity between vision encoder and language model may led to LLMs assuming a predominant role in multi-modal comprehension. This imbalance in LVLMs may result in the instances of hallucinatory. Concretely, LVLMs may generate consistent descriptions with or without visual input, indicating that certain outputs are influenced solely by context text. We refer to this phenomenon as "text inertia." To counteract this issue, we introduce a training-free algorithm to find an equilibrium point between image comprehension and language inference. Specifically, we adaptively involve adjusting and amplifying the attention weights assigned to image tokens, thereby granting greater prominence to visual elements. Meanwhile, we subtract the logits of multi-modal inputs from ones of pure text input, which can help LVLMs be not biased towards LLMs. By enhancing images tokens and reducing the stubborn output of LLM, we can let LVLM pay more attention to images, towards alleviating text inertia and reducing the hallucination in LVLMs. Our extensive experiments shows that this method substantially reduces the frequency of hallucinatory outputs in various LVLMs in terms of different metrics. Project page is available at https://lalbj.github.io/projects/PAI/.
Abstract:Natural language interfaces (NLIs) enable users to flexibly specify analytical intentions in data visualization. However, diagnosing the visualization results without understanding the underlying generation process is challenging. Our research explores how to provide explanations for NLIs to help users locate the problems and further revise the queries. We present XNLI, an explainable NLI system for visual data analysis. The system introduces a Provenance Generator to reveal the detailed process of visual transformations, a suite of interactive widgets to support error adjustments, and a Hint Generator to provide query revision hints based on the analysis of user queries and interactions. Two usage scenarios of XNLI and a user study verify the effectiveness and usability of the system. Results suggest that XNLI can significantly enhance task accuracy without interrupting the NLI-based analysis process.