Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lidiya Murakhovs'ka

Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

Nov 14, 2023

Philippe Laban, Lidiya Murakhovs'ka, Caiming Xiong, Chien-Sheng Wu

Figure 1 for Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

Figure 2 for Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

Figure 3 for Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

Figure 4 for Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment

Abstract:The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop experiment: in the first round of the conversation, an LLM responds to a prompt containing a classification task. In a second round, the LLM is challenged with a follow-up phrase like "Are you sure?", offering an opportunity for the model to reflect on its initial answer, and decide whether to confirm or flip its answer. A systematic study of nine LLMs on seven classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17%. The FlipFlop experiment illustrates the universality of sycophantic behavior in LLMs and provides a robust framework to analyze model behavior and evaluate potential solutions.

Via

Access Paper or Ask Questions

Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

Oct 26, 2023

Lidiya Murakhovs'ka, Philippe Laban, Tian Xie, Caiming Xiong, Chien-Sheng Wu

Figure 1 for Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

Figure 2 for Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

Figure 3 for Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

Figure 4 for Salespeople vs SalesBot: Exploring the Role of Educational Value in Conversational Recommender Systems

Abstract:Making big purchases requires consumers to research or consult a salesperson to gain domain expertise. However, existing conversational recommender systems (CRS) often overlook users' lack of background knowledge, focusing solely on gathering preferences. In this work, we define a new problem space for conversational agents that aim to provide both product recommendations and educational value through mixed-type mixed-initiative dialog. We introduce SalesOps, a framework that facilitates the simulation and evaluation of such systems by leveraging recent advancements in large language models (LLMs). We build SalesBot and ShopperBot, a pair of LLM-powered agents that can simulate either side of the framework. A comprehensive human study compares SalesBot against professional salespeople, revealing that although SalesBot approaches professional performance in terms of fluency and informativeness, it lags behind in recommendation quality. We emphasize the distinct limitations both face in providing truthful information, highlighting the challenges of ensuring faithfulness in the CRS context. We release our code and make all data available.

Via

Access Paper or Ask Questions

XGen-7B Technical Report

Sep 07, 2023

Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause(+15 more)

Abstract:Large Language Models (LLMs) have become ubiquitous across various domains, transforming the way we interact with information and conduct research. However, most high-performing LLMs remain confined behind proprietary walls, hindering scientific progress. Most open-source LLMs, on the other hand, are limited in their ability to support longer sequence lengths, which is a key requirement for many tasks that require inference over an input context. To address this, we have trained XGen, a series of 7B parameter models on up to 8K sequence length for up to 1.5T tokens. We have also finetuned the XGen models on public-domain instructional data, creating their instruction-tuned counterparts (XGen-Inst). We open-source our models for both research advancements and commercial applications. Our evaluation on standard benchmarks shows that XGen models achieve comparable or better results when compared with state-of-the-art open-source LLMs. Our targeted evaluation on long sequence modeling tasks shows the benefits of our 8K-sequence models over 2K-sequence open-source LLMs.

Via

Access Paper or Ask Questions

Designing and Evaluating Interfaces that Highlight News Coverage Diversity Using Discord Questions

Feb 17, 2023

Philippe Laban, Chien-Sheng Wu, Lidiya Murakhovs'ka, Xiang 'Anthony' Chen, Caiming Xiong

Abstract:Modern news aggregators do the hard work of organizing a large news stream, creating collections for a given news story with tens of source options. This paper shows that navigating large source collections for a news story can be challenging without further guidance. In this work, we design three interfaces -- the Annotated Article, the Recomposed Article, and the Question Grid -- aimed at accompanying news readers in discovering coverage diversity while they read. A first usability study with 10 journalism experts confirms the designed interfaces all reveal coverage diversity and determine each interface's potential use cases and audiences. In a second usability study, we developed and implemented a reading exercise with 95 novice news readers to measure exposure to coverage diversity. Results show that Annotated Article users are able to answer questions 34% more completely than with two existing interfaces while finding the interface equally easy to use.

* CHI2023 Accepted Paper

Via

Access Paper or Ask Questions

Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

Nov 09, 2022

Philippe Laban, Chien-Sheng Wu, Lidiya Murakhovs'ka, Xiang 'Anthony' Chen, Caiming Xiong

Figure 1 for Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

Figure 2 for Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

Figure 3 for Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

Figure 4 for Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

Abstract:There are many potential benefits to news readers accessing diverse sources. Modern news aggregators do the hard work of organizing the news, offering readers a plethora of source options, but choosing which source to read remains challenging. We propose a new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity. The framework is based on the generation of Discord Questions: questions with a diverse answer pool, explicitly illustrating source differences. To assemble a prototype of the framework, we focus on two components: (1) discord question generation, the task of generating questions answered differently by sources, for which we propose an automatic scoring method, and create a model that improves performance from current question generation (QG) methods by 5%, (2) answer consolidation, the task of grouping answers to a question that are semantically similar, for which we collect data and repurpose a method that achieves 81% balanced accuracy on our realistic test set. We illustrate the framework's feasibility through a prototype interface. Even though model performance at discord QG still lags human performance by more than 15%, generated questions are judged to be more interesting than factoid questions and can reveal differences in the level of detail, sentiment, and reasoning of sources in news coverage.

* EMNLP 2022 Findings - Long Paper

Via

Access Paper or Ask Questions

Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation

May 03, 2022

Philippe Laban, Chien-Sheng Wu, Lidiya Murakhovs'ka, Wenhao Liu, Caiming Xiong

Figure 1 for Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation

Figure 2 for Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation

Figure 3 for Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation

Figure 4 for Quiz Design Task: Helping Teachers Create Quizzes with Automated Question Generation

Abstract:Question generation (QGen) models are often evaluated with standardized NLG metrics that are based on n-gram overlap. In this paper, we measure whether these metric improvements translate to gains in a practical setting, focusing on the use case of helping teachers automate the generation of reading comprehension quizzes. In our study, teachers building a quiz receive question suggestions, which they can either accept or refuse with a reason. Even though we find that recent progress in QGen leads to a significant increase in question acceptance rates, there is still large room for improvement, with the best model having only 68.4% of its questions accepted by the ten teachers who participated in our study. We then leverage the annotations we collected to analyze standard NLG metrics and find that model performance has reached projected upper-bounds, suggesting new automatic metrics are needed to guide QGen research forward.

* Accepted at NAACL 2022 Special HCI Theme (Findings, short paper), 10 pages, 6 figures

Via

Access Paper or Ask Questions

MixQG: Neural Question Generation with Mixed Answer Types

Oct 15, 2021

Lidiya Murakhovs'ka, Chien-Sheng Wu, Tong Niu, Wenhao Liu, Caiming Xiong

Figure 1 for MixQG: Neural Question Generation with Mixed Answer Types

Figure 2 for MixQG: Neural Question Generation with Mixed Answer Types

Figure 3 for MixQG: Neural Question Generation with Mixed Answer Types

Figure 4 for MixQG: Neural Question Generation with Mixed Answer Types

Abstract:Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on the short factoid type of answers. In this paper, we propose a neural question generator, MixQG, to bridge this gap. We combine 9 question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. We show with empirical results that our model outperforms existing work in both seen and unseen domains and can generate questions with different cognitive levels when conditioned on different answer types. Our code is released and well-integrated with the Huggingface library to facilitate various downstream applications.

Via

Access Paper or Ask Questions