Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jennifer Healey

Adobe

Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries

May 28, 2025

Vishakh Padmakumar, Zichao Wang, David Arbour, Jennifer Healey

Abstract:While large language models (LLMs) are increasingly capable of handling longer contexts, recent work has demonstrated that they exhibit the "lost in the middle" phenomenon (Liu et al., 2024) of unevenly attending to different parts of the provided context. This hinders their ability to cover diverse source material in multi-document summarization, as noted in the DiverseSumm benchmark (Huang et al., 2024). In this work, we contend that principled content selection is a simple way to increase source coverage on this task. As opposed to prompting an LLM to perform the summarization in a single step, we explicitly divide the task into three steps -- (1) reducing document collections to atomic key points, (2) using determinantal point processes (DPP) to perform select key points that prioritize diverse content, and (3) rewriting to the final summary. By combining prompting steps, for extraction and rewriting, with principled techniques, for content selection, we consistently improve source coverage on the DiverseSumm benchmark across various LLMs. Finally, we also show that by incorporating relevance to a provided user intent into the DPP kernel, we can generate personalized summaries that cover relevant source information while retaining coverage.

* To appear at ACL 2025 - Main Conference

Via

Access Paper or Ask Questions

Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text

May 05, 2025

Jennifer Healey, Laurie Byrum, Md Nadeem Akhtar, Surabhi Bhargava, Moumita Sinha

Abstract:LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs' deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.

* 6 pages, no figures, presented at CHI 2025 workshop for Human Evaluation and Auditing of Language Models

Via

Access Paper or Ask Questions

Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs

Apr 30, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie

Abstract:Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy. Our findings highlight generation length as a meaningful signal for reasoning behavior and motivate further exploration into LLMs' self-awareness in reasoning length adaptation.

Via

Access Paper or Ask Questions

Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control

Feb 17, 2025

Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to mitigate large language model (LLM) hallucinations by incorporating external knowledge retrieval. However, existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or failing to retrieve iteratively when required for complex reasoning. Recent adaptive retrieval strategies, though adaptively navigates these retrieval strategies, predict only based on query complexity and lacks user-driven flexibility, making them infeasible for diverse user application needs. In this paper, we introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two classifiers: one trained to prioritize accuracy and another to prioritize retrieval efficiency. Via an interpretable control parameter $\alpha$, users can seamlessly navigate between minimal-cost retrieval and high-accuracy retrieval based on their specific requirements. We empirically demonstrate that our approach effectively balances accuracy, retrieval cost, and user controllability, making it a practical and adaptable solution for real-world applications.

Via

Access Paper or Ask Questions

TextLap: Customizing Language Models for Text-to-Layout Planning

Oct 09, 2024

Jian Chen, Ruiyi Zhang, Yufan Zhou, Jennifer Healey, Jiuxiang Gu, Zhiqiang Xu, Changyou Chen

Abstract:Automatic generation of graphical layouts is crucial for many real-world applications, including designing posters, flyers, advertisements, and graphical user interfaces. Given the incredible ability of Large language models (LLMs) in both natural language understanding and generation, we believe that we could customize an LLM to help people create compelling graphical layouts starting with only text instructions from the user. We call our method TextLap (text-based layout planning). It uses a curated instruction-based layout planning dataset (InsLap) to customize LLMs as a graphic designer. We demonstrate the effectiveness of TextLap and show that it outperforms strong baselines, including GPT-4 based methods, for image generation and graphical design benchmarks.

* Accepted to the EMNLP Findings

Via

Access Paper or Ask Questions

Evaluating Nuanced Bias in Large Language Model Free Response Answers

Jul 11, 2024

Jennifer Healey, Laurie Byrum, Md Nadeem Akhtar, Moumita Sinha

Abstract:Pre-trained large language models (LLMs) can now be easily adapted for specific business purposes using custom prompts or fine tuning. These customizations are often iteratively re-engineered to improve some aspect of performance, but after each change businesses want to ensure that there has been no negative impact on the system's behavior around such critical issues as bias. Prior methods of benchmarking bias use techniques such as word masking and multiple choice questions to assess bias at scale, but these do not capture all of the nuanced types of bias that can occur in free response answers, the types of answers typically generated by LLM systems. In this paper, we identify several kinds of nuanced bias in free text that cannot be similarly identified by multiple choice tests. We describe these as: confidence bias, implied bias, inclusion bias and erasure bias. We present a semi-automated pipeline for detecting these types of bias by first eliminating answers that can be automatically classified as unbiased and then co-evaluating name reversed pairs using crowd workers. We believe that the nuanced classifications our method generates can be used to give better feedback to LLMs, especially as LLM reasoning capabilities become more advanced.

* 14 pages, 0 figures, submitted to NLDB 2024, Turin, Italy

Via

Access Paper or Ask Questions

Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models

Apr 23, 2024

Wanrong Zhu, Jennifer Healey, Ruiyi Zhang, William Yang Wang, Tong Sun

Figure 1 for Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models

Figure 2 for Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models

Figure 3 for Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models

Figure 4 for Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models

Abstract:Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing users to easily arrange visual elements into tailored layouts by specifying canvas size and design purpose, such as for book covers, posters, brochures, or menus. We developed three layout reasoning tasks to train the model in understanding and executing layout instructions. Experiments on two benchmarks show that our method not only simplifies the design process for non-professionals but also surpasses the performance of few-shot GPT-4V models, with mIoU higher by 12% on Crello. This progress highlights the potential of multimodal instruction-following models to automate and simplify the design process, providing an approachable solution for a wide range of design tasks on visually-rich documents.

Via

Access Paper or Ask Questions

Gaudí: Conversational Interactions with Deep Representations to Generate Image Collections

Dec 05, 2021

Victor S. Bursztyn, Jennifer Healey, Vishwa Vinay

Figure 1 for Gaudí: Conversational Interactions with Deep Representations to Generate Image Collections

Figure 2 for Gaudí: Conversational Interactions with Deep Representations to Generate Image Collections

Abstract:Based on recent advances in realistic language modeling (GPT-3) and cross-modal representations (CLIP), Gaud\'i was developed to help designers search for inspirational images using natural language. In the early stages of the design process, with the goal of eliciting a client's preferred creative direction, designers will typically create thematic collections of inspirational images called "mood-boards". Creating a mood-board involves sequential image searches which are currently performed using keywords or images. Gaud\'i transforms this process into a conversation where the user is gradually detailing the mood-board's theme. This representation allows our AI to generate new search queries from scratch, straight from a project briefing, following a theme hypothesized by GPT-3. Compared to previous computational approaches to mood-board creation, to the best of our knowledge, ours is the first attempt to represent mood-boards as the stories that designers tell when presenting a creative direction to a client.

* Accepted at the NeurIPS 2021 Workshop on Machine Learning for Creativity and Design

Via

Access Paper or Ask Questions

Multiscale Manifold Warping

Sep 19, 2021

Sridhar Mahadevan, Anup Rao, Georgios Theocharous, Jennifer Healey

Figure 1 for Multiscale Manifold Warping

Figure 2 for Multiscale Manifold Warping

Figure 3 for Multiscale Manifold Warping

Figure 4 for Multiscale Manifold Warping

Abstract:Many real-world applications require aligning two temporal sequences, including bioinformatics, handwriting recognition, activity recognition, and human-robot coordination. Dynamic Time Warping (DTW) is a popular alignment method, but can fail on high-dimensional real-world data where the dimensions of aligned sequences are often unequal. In this paper, we show that exploiting the multiscale manifold latent structure of real-world data can yield improved alignment. We introduce a novel framework called Warping on Wavelets (WOW) that integrates DTW with a a multi-scale manifold learning framework called Diffusion Wavelets. We present a theoretical analysis of the WOW family of algorithms and show that it outperforms previous state of the art methods, such as canonical time warping (CTW) and manifold warping, on several real-world datasets.

* 18 pages

Via

Access Paper or Ask Questions

"It doesn't look good for a date": Transforming Critiques into Preferences for Conversational Recommendation Systems

Sep 15, 2021

Victor S. Bursztyn, Jennifer Healey, Nedim Lipka, Eunyee Koh, Doug Downey, Larry Birnbaum

Figure 1 for "It doesn't look good for a date": Transforming Critiques into Preferences for Conversational Recommendation Systems

Figure 2 for "It doesn't look good for a date": Transforming Critiques into Preferences for Conversational Recommendation Systems

Figure 3 for "It doesn't look good for a date": Transforming Critiques into Preferences for Conversational Recommendation Systems

Abstract:Conversations aimed at determining good recommendations are iterative in nature. People often express their preferences in terms of a critique of the current recommendation (e.g., "It doesn't look good for a date"), requiring some degree of common sense for a preference to be inferred. In this work, we present a method for transforming a user critique into a positive preference (e.g., "I prefer more romantic") in order to retrieve reviews pertaining to potentially better recommendations (e.g., "Perfect for a romantic dinner"). We leverage a large neural language model (LM) in a few-shot setting to perform critique-to-preference transformation, and we test two methods for retrieving recommendations: one that matches embeddings, and another that fine-tunes an LM for the task. We instantiate this approach in the restaurant domain and evaluate it using a new dataset of restaurant critiques. In an ablation study, we show that utilizing critique-to-preference transformation improves recommendations, and that there are at least three general cases that explain this improved performance.

* Accepted to EMNLP 2021's main conference

Via

Access Paper or Ask Questions