Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sameera Horawalavithana

Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models

Aug 26, 2024

Ian Stewart, Sameera Horawalavithana, Brendan Kennedy, Sai Munikoti, Karl Pazdernik

Figure 1 for Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models

Figure 2 for Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models

Figure 3 for Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models

Figure 4 for Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models

Abstract:Multimodal foundation models (MFMs) such as OFASys show the potential to unlock analysis of complex data such as images, videos, and audio data via text prompts alone. However, their performance may suffer in the face of text input that differs even slightly from their training distribution, which is surprising considering the use of modality-specific data to "ground" the text input. This study demonstrates that prompt instability is a major concern for MFMs, leading to a consistent drop in performance across all modalities, but that instability can be mitigated with additional training with augmented data. We evaluate several methods for grounded prompt perturbation, where we generate perturbations and filter based on similarity to text and/or modality data. After re-training the models on the augmented data, we find improved accuracy and more stable performance on the perturbed test data regardless of perturbation condition, suggesting that the data augmentation strategy helps the models handle domain shifts more effectively. In error analysis, we find consistent patterns of performance improvement across domains, suggesting that retraining on prompt perturbations tends to help general reasoning capabilities in MFMs.

* in submission

Via

Access Paper or Ask Questions

PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain

Aug 21, 2024

Rounak Meyur, Hung Phan, Sridevi Wagle, Jan Strube, Mahantesh Halappanavar, Sameera Horawalavithana, Anurag Acharya, Sai Munikoti

Figure 1 for PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain

Figure 2 for PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain

Figure 3 for PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain

Figure 4 for PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain

Abstract:In the rapidly evolving landscape of Natural Language Processing (NLP) and text generation, the emergence of Retrieval Augmented Generation (RAG) presents a promising avenue for improving the quality and reliability of generated text by leveraging information retrieved from user specified database. Benchmarking is essential to evaluate and compare the performance of the different RAG configurations in terms of retriever and generator, providing insights into their effectiveness, scalability, and suitability for the specific domain and applications. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI Large Language Model (LLM) teaming. As a case study, we demonstrate the framework by introducing PermitQA, a first-of-its-kind benchmark on the wind siting and permitting domain which comprises of multiple scientific documents/reports related to environmental impact of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level. We also demonstrate the performance of different models on our benchmark.

Via

Access Paper or Ask Questions

RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Jul 10, 2024

Hung Phan, Anurag Acharya, Sarthak Chaturvedi, Shivam Sharma, Mike Parker, Dan Nally, Ali Jannesari, Karl Pazdernik, Mahantesh Halappanavar, Sai Munikoti(+1 more)

Figure 1 for RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Figure 2 for RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Figure 3 for RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Figure 4 for RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Abstract:Large Language Models (LLMs) have been applied to many research problems across various domains. One of the applications of LLMs is providing question-answering systems that cater to users from different fields. The effectiveness of LLM-based question-answering systems has already been established at an acceptable level for users posing questions in popular and public domains such as trivia and literature. However, it has not often been established in niche domains that traditionally require specialized expertise. To this end, we construct the NEPAQuAD1.0 benchmark to evaluate the performance of three frontier LLMs -- Claude Sonnet, Gemini, and GPT-4 -- when answering questions originating from Environmental Impact Statements prepared by U.S. federal government agencies in accordance with the National Environmental Environmental Act (NEPA). We specifically measure the ability of LLMs to understand the nuances of legal, technical, and compliance-related information present in NEPA documents in different contextual scenarios. For example, we test the LLMs' internal prior NEPA knowledge by providing questions without any context, as well as assess how LLMs synthesize the contextual information present in long NEPA documents to facilitate the question/answering task. We compare the performance of the long context LLMs and RAG powered models in handling different types of questions (e.g., problem-solving, divergent). Our results suggest that RAG powered models significantly outperform the long context models in the answer accuracy regardless of the choice of the frontier LLM. Our further analysis reveals that many models perform better answering closed questions than divergent and problem-solving questions.

* 14 pages

Via

Access Paper or Ask Questions

Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities

Jun 08, 2024

Sai Munikoti, Ian Stewart, Sameera Horawalavithana, Henry Kvinge, Tegan Emerson, Sandra E Thompson, Karl Pazdernik

Figure 1 for Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities

Figure 2 for Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities

Figure 3 for Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities

Figure 4 for Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities

Abstract:Multimodal models are expected to be a critical component to future advances in artificial intelligence. This field is starting to grow rapidly with a surge of new design elements motivated by the success of foundation models in natural language processing (NLP) and vision. It is widely hoped that further extending the foundation models to multiple modalities (e.g., text, image, video, sensor, time series, graph, etc.) will ultimately lead to generalist multimodal models, i.e. one model across different data modalities and tasks. However, there is little research that systematically analyzes recent multimodal models (particularly the ones that work beyond text and vision) with respect to the underling architecture proposed. Therefore, this work provides a fresh perspective on generalist multimodal models (GMMs) via a novel architecture and training configuration specific taxonomy. This includes factors such as Unifiability, Modularity, and Adaptability that are pertinent and essential to the wide adoption and application of GMMs. The review further highlights key challenges and prospects for the field and guide the researchers into the new advancements.

* 25 pages, 3 figures, 5 tables

Via

Access Paper or Ask Questions

ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science

Nov 21, 2023

Sai Munikoti, Anurag Acharya, Sridevi Wagle, Sameera Horawalavithana

Figure 1 for ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science

Figure 2 for ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science

Figure 3 for ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science

Figure 4 for ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science

Abstract:Large language models record impressive performance on many natural language processing tasks. However, their knowledge capacity is limited to the pretraining corpus. Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources to complement the language model. However, existing retrieval augmentation techniques ignore the structural relationships between these documents. Furthermore, retrieval models are not explored much in scientific tasks, especially in regard to the faithfulness of retrieved documents. In this paper, we propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation. We create a heterogeneous document graph capturing multiple types of relationships (e.g., citation, co-authorship, etc.) that connect documents from more than 15 scientific disciplines (e.g., Physics, Medicine, Chemistry, etc.). We train a graph neural network on the curated document graph to act as a structural encoder for the corresponding passages retrieved during the model pretraining. Particularly, along with text embeddings of the retrieved passages, we obtain structural embeddings of the documents (passages) and fuse them together before feeding them to the language model. We evaluate our model extensively on various scientific benchmarks that include science question-answering and scientific document classification tasks. Experimental results demonstrate that structure-aware retrieval improves retrieving more coherent, faithful and contextually relevant passages, while showing a comparable performance in the overall accuracy.

Via

Access Paper or Ask Questions

Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science

Nov 15, 2023

Sridevi Wagle, Sai Munikoti, Anurag Acharya, Sara Smith, Sameera Horawalavithana

Figure 1 for Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science

Figure 2 for Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science

Figure 3 for Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science

Figure 4 for Empirical evaluation of Uncertainty Quantification in Retrieval-Augmented Language Models for Science

Abstract:Large language models (LLMs) have shown remarkable achievements in natural language processing tasks, producing high-quality outputs. However, LLMs still exhibit limitations, including the generation of factually incorrect information. In safety-critical applications, it is important to assess the confidence of LLM-generated content to make informed decisions. Retrieval Augmented Language Models (RALMs) is relatively a new area of research in NLP. RALMs offer potential benefits for scientific NLP tasks, as retrieved documents, can serve as evidence to support model-generated content. This inclusion of evidence enhances trustworthiness, as users can verify and explore the retrieved documents to validate model outputs. Quantifying uncertainty in RALM generations further improves trustworthiness, with retrieved text and confidence scores contributing to a comprehensive and reliable model for scientific applications. However, there is limited to no research on UQ for RALMs, particularly in scientific contexts. This study aims to address this gap by conducting a comprehensive evaluation of UQ in RALMs, focusing on scientific tasks. This research investigates how uncertainty scores vary when scientific knowledge is incorporated as pretraining and retrieval data and explores the relationship between uncertainty scores and the accuracy of model-generated outputs. We observe that an existing RALM finetuned with scientific knowledge as the retrieval data tends to be more confident in generating predictions compared to the model pretrained only with scientific knowledge. We also found that RALMs are overconfident in their predictions, making inaccurate predictions more confidently than accurate ones. Scientific knowledge provided either as pretraining or retrieval corpus does not help alleviate this issue. We released our code, data and dashboards at https://github.com/pnnl/EXPERT2.

Via

Access Paper or Ask Questions

Evaluating the Effectiveness of Retrieval-Augmented Large Language Models in Scientific Document Reasoning

Nov 07, 2023

Sai Munikoti, Anurag Acharya, Sridevi Wagle, Sameera Horawalavithana

Abstract:Despite the dramatic progress in Large Language Model (LLM) development, LLMs often provide seemingly plausible but not factual information, often referred to as hallucinations. Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources and augment the training process. These models help to trace evidence from an externally provided knowledge base allowing the model predictions to be better interpreted and verified. In this work, we critically evaluate these models in their ability to perform in scientific document reasoning tasks. To this end, we tuned multiple such model variants with science-focused instructions and evaluated them on a scientific document reasoning benchmark for the usefulness of the retrieved document passages. Our findings suggest that models justify predictions in science tasks with fabricated evidence and leveraging scientific corpus as pretraining data does not alleviate the risk of evidence fabrication.

* 5 pages

Via

Access Paper or Ask Questions

NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain

Oct 17, 2023

Anurag Acharya, Sai Munikoti, Aaron Hellinger, Sara Smith, Sridevi Wagle, Sameera Horawalavithana

Figure 1 for NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain

Figure 2 for NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain

Figure 3 for NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain

Figure 4 for NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain

Abstract:As LLMs have become increasingly popular, they have been used in almost every field. But as the application for LLMs expands from generic fields to narrow, focused science domains, there exists an ever-increasing gap in ways to evaluate their efficacy in those fields. For the benchmarks that do exist, a lot of them focus on questions that don't require proper understanding of the subject in question. In this paper, we present NuclearQA, a human-made benchmark of 100 questions to evaluate language models in the nuclear domain, consisting of a varying collection of questions that have been specifically designed by experts to test the abilities of language models. We detail our approach and show how the mix of several types of questions makes our benchmark uniquely capable of evaluating models in the nuclear domain. We also present our own evaluation metric for assessing LLM's performances due to the limitations of existing ones. Our experiments on state-of-the-art models suggest that even the best LLMs perform less than satisfactorily on our benchmark, demonstrating the scientific knowledge gap of existing LLMs.

* 9 pages

Via

Access Paper or Ask Questions

Anticipating Technical Expertise and Capability Evolution in Research Communities using Dynamic Graph Transformers

Jul 18, 2023

Sameera Horawalavithana, Ellyn Ayton, Anastasiya Usenko, Robin Cosbey, Svitlana Volkova

Abstract:The ability to anticipate technical expertise and capability evolution trends globally is essential for national and global security, especially in safety-critical domains like nuclear nonproliferation (NN) and rapidly emerging fields like artificial intelligence (AI). In this work, we extend traditional statistical relational learning approaches (e.g., link prediction in collaboration networks) and formulate a problem of anticipating technical expertise and capability evolution using dynamic heterogeneous graph representations. We develop novel capabilities to forecast collaboration patterns, authorship behavior, and technical capability evolution at different granularities (e.g., scientist and institution levels) in two distinct research fields. We implement a dynamic graph transformer (DGT) neural architecture, which pushes the state-of-the-art graph neural network models by (a) forecasting heterogeneous (rather than homogeneous) nodes and edges, and (b) relying on both discrete -- and continuous -- time inputs. We demonstrate that our DGT models predict collaboration, partnership, and expertise patterns with 0.26, 0.73, and 0.53 mean reciprocal rank values for AI and 0.48, 0.93, and 0.22 for NN domains. DGT model performance exceeds the best-performing static graph baseline models by 30-80% across AI and NN domains. Our findings demonstrate that DGT models boost inductive task performance, when previously unseen nodes appear in the test data, for the domains with emerging collaboration patterns (e.g., AI). Specifically, models accurately predict which established scientists will collaborate with early career scientists and vice-versa in the AI domain.

Via

Access Paper or Ask Questions

SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Jul 03, 2023

Sameera Horawalavithana, Sai Munikoti, Ian Stewart, Henry Kvinge

Figure 1 for SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Figure 2 for SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Figure 3 for SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Figure 4 for SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Abstract:Instruction finetuning is a popular paradigm to align large language models (LLM) with human intent. Despite its popularity, this idea is less explored in improving the LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions. To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.

* Preprint. Work in progress

Via

Access Paper or Ask Questions