Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noah Smith

LLAMAPIE: Proactive In-Ear Conversation Assistants

May 07, 2025

Tuochao Chen, Nicholas Batchelder, Alisa Liu, Noah Smith, Shyamnath Gollakota

Abstract:We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.

Via

Access Paper or Ask Questions

The Leaderboard Illusion

Apr 29, 2025

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith(+3 more)

Abstract:Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

* 68 pages, 18 figures, 9 tables

Via

Access Paper or Ask Questions

Decoding-Time Language Model Alignment with Multiple Objectives

Jun 27, 2024

Ruizhe Shi, Yifang Chen, Yushi Hu, ALisa Liu, Noah Smith, Hannaneh Hajishirzi, Simon Du

Figure 1 for Decoding-Time Language Model Alignment with Multiple Objectives

Figure 2 for Decoding-Time Language Model Alignment with Multiple Objectives

Figure 3 for Decoding-Time Language Model Alignment with Multiple Objectives

Figure 4 for Decoding-Time Language Model Alignment with Multiple Objectives

Abstract:Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a linear combination of predictions of all base models, for any given weightings over different objectives. We exploit a common form among a family of $f$-divergence regularized alignment approaches (such as PPO, DPO, and their variants) to identify a closed-form solution by Legendre transform, and derive an efficient decoding strategy. Theoretically, we show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method. Empirical results demonstrate the effectiveness of the algorithm. For example, compared to a parameter-merging baseline, MOD achieves 12.8% overall reward improvement when equally optimizing towards $3$ objectives. Moreover, we experiment with MOD on combining three fully-finetuned LLMs of different model sizes, each aimed at different objectives such as safety, coding, and general user preference. Unlike traditional methods that require careful curation of a mixture of datasets to achieve comprehensive improvement, we can quickly experiment with preference weightings using MOD to find the best combination of models. Our best combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9--33.3% improvement across other three metrics ($\textit{i.e.}$, Codex@1, GSM-COT, BBH-COT).

Via

Access Paper or Ask Questions

Competency Problems: On Finding and Removing Artifacts in Language Data

Apr 17, 2021

Matt Gardner, William Merrill, Jesse Dodge, Matthew E. Peters, Alexis Ross, Sameer Singh, Noah Smith

Figure 1 for Competency Problems: On Finding and Removing Artifacts in Language Data

Figure 2 for Competency Problems: On Finding and Removing Artifacts in Language Data

Figure 3 for Competency Problems: On Finding and Removing Artifacts in Language Data

Figure 4 for Competency Problems: On Finding and Removing Artifacts in Language Data

Abstract:Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have "spurious" instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word "amazing" on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.

Via

Access Paper or Ask Questions

Promoting Graph Awareness in Linearized Graph-to-Text Generation

Dec 31, 2020

Alexander Hoyle, Ana Marasović, Noah Smith

Figure 1 for Promoting Graph Awareness in Linearized Graph-to-Text Generation

Figure 2 for Promoting Graph Awareness in Linearized Graph-to-Text Generation

Figure 3 for Promoting Graph Awareness in Linearized Graph-to-Text Generation

Figure 4 for Promoting Graph Awareness in Linearized Graph-to-Text Generation

Abstract:Generating text from structured inputs, such as meaning representations or RDF triples, has often involved the use of specialized graph-encoding neural networks. However, recent applications of pretrained transformers to linearizations of graph inputs have yielded state-of-the-art generation results on graph-to-text tasks. Here, we explore the ability of these linearized models to encode local graph structures, in particular their invariance to the graph linearization strategy and their ability to reconstruct corrupted inputs. Our findings motivate solutions to enrich the quality of models' implicit graph encodings via scaffolding. Namely, we use graph-denoising objectives implemented in a multi-task text-to-text framework. We find that these denoising scaffolds lead to substantial improvements in downstream generation in low-resource settings.

Via

Access Paper or Ask Questions

Parameter Norm Growth During Training of Transformers

Nov 11, 2020

William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, Noah Smith

Figure 1 for Parameter Norm Growth During Training of Transformers

Figure 2 for Parameter Norm Growth During Training of Transformers

Figure 3 for Parameter Norm Growth During Training of Transformers

Figure 4 for Parameter Norm Growth During Training of Transformers

Abstract:The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically some variant of gradient descent (GD). To better understand this bias, we study the tendency of transformer parameters to grow in magnitude during training. We find, both theoretically and empirically, that, in certain contexts, GD increases the parameter $L_2$ norm up to a threshold that itself increases with training-set accuracy. This means increasing training accuracy over time enables the norm to increase. Empirically, we show that the norm grows continuously over pretraining for T5 (Raffel et al., 2019). We show that pretrained T5 approximates a semi-discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the original network family that can be described in automata-theoretic terms. This suggests saturation is a new characterization of an inductive bias implicit in GD that is of particular interest for NLP. While our experiments focus on transformers, our theoretical analysis extends to other architectures with similar formal properties, such as feedforward ReLU networks.

* Preprint. 9 body pages with appendix

Via

Access Paper or Ask Questions

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Feb 15, 2020

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, Noah Smith

Figure 1 for Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Figure 2 for Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Figure 3 for Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Figure 4 for Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Abstract:Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

Via

Access Paper or Ask Questions

Robust Navigation with Language Pretraining and Stochastic Sampling

Sep 05, 2019

Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, Yejin Choi

Figure 1 for Robust Navigation with Language Pretraining and Stochastic Sampling

Figure 2 for Robust Navigation with Language Pretraining and Stochastic Sampling

Figure 3 for Robust Navigation with Language Pretraining and Stochastic Sampling

Figure 4 for Robust Navigation with Language Pretraining and Stochastic Sampling

Abstract:Core to the vision-and-language navigation (VLN) challenge is building robust instruction representations and action decoding schemes, which can generalize well to previously unseen instructions and environments. In this paper, we report two simple but highly effective methods to address these challenges and lead to a new state-of-the-art performance. First, we adapt large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions. Second, we propose a stochastic sampling scheme to reduce the considerable gap between the expert actions in training and sampled actions in test, so that the agent can learn to correct its own mistakes during long sequential action decoding. Combining the two techniques, we achieve a new state of the art on the Room-to-Room benchmark with 6% absolute gain over the previous best result (47% -> 53%) on the Success Rate weighted by Path Length metric.

* 8 pages, 4 figures, EMNLP 2019

Via

Access Paper or Ask Questions

Polyglot Semantic Role Labeling

May 29, 2018

Phoebe Mulcaire, Swabha Swayamdipta, Noah Smith

Figure 1 for Polyglot Semantic Role Labeling

Figure 2 for Polyglot Semantic Role Labeling

Figure 3 for Polyglot Semantic Role Labeling

Figure 4 for Polyglot Semantic Role Labeling

Abstract:Previous approaches to multilingual semantic dependency parsing treat languages independently, without exploiting the similarities between semantic structures across languages. We experiment with a new approach where we combine resources from a pair of languages in the CoNLL 2009 shared task to build a polyglot semantic role labeler. Notwithstanding the absence of parallel data, and the dissimilarity in annotations between languages, our approach results in an improvement in SRL performance on multiple languages over a monolingual baseline. Analysis of the polyglot model shows it to be advantageous in lower-resource settings.

* To appear at ACL 2018

Via

Access Paper or Ask Questions

Neural Discourse Structure for Text Categorization

May 06, 2017

Yangfeng Ji, Noah Smith

Figure 1 for Neural Discourse Structure for Text Categorization

Figure 2 for Neural Discourse Structure for Text Categorization

Figure 3 for Neural Discourse Structure for Text Categorization

Figure 4 for Neural Discourse Structure for Text Categorization

Abstract:We show that discourse structure, as defined by Rhetorical Structure Theory and provided by an existing discourse parser, benefits text categorization. Our approach uses a recursive neural network and a newly proposed attention mechanism to compute a representation of the text that focuses on salient content, from the perspective of both RST and the task. Experiments consider variants of the approach and illustrate its strengths and weaknesses.

* ACL 2017 camera ready version

Via

Access Paper or Ask Questions