Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicholas Lourie

What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models

Mar 12, 2025

Abhipsha Das, Nicholas Lourie, Siavash Golkar, Mariel Pettee

Abstract:The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs' semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: https://github.com/chiral-carbon/kg-for-science.

* 9 pages, 5 pdf figures

Via

Access Paper or Ask Questions

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Nov 08, 2024

Mayee F. Chen, Michael Y. Hu, Nicholas Lourie, Kyunghyun Cho, Christopher Ré

Abstract:Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. In this paper, we study the cause of this inconsistency by unifying existing methods into a standard optimization framework. We show that all methods set proportions to minimize total loss, subject to a method-specific mixing law -- an assumption on how loss is a function of mixture proportions. We find that existing parameterizations of mixing laws can express the true loss-proportion relationship empirically, but the methods themselves often set the mixing law parameters inaccurately, resulting in poor and inconsistent performance. Finally, we leverage the insights from our framework to derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.28 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.01 test perplexity points.

Via

Access Paper or Ask Questions

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

May 30, 2024

Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel(+5 more)

Abstract:Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

Via

Access Paper or Ask Questions

Show Your Work with Confidence: Confidence Bands for Tuning Curves

Nov 16, 2023

Nicholas Lourie, Kyunghyun Cho, He He

Abstract:The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at https://github.com/nalourie/opda .

* 15 pages, 15 figures

Via

Access Paper or Ask Questions

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

Mar 24, 2021

Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi

Figure 1 for UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

Figure 2 for UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

Figure 3 for UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

Figure 4 for UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

Abstract:Commonsense AI has long been seen as a near impossible goal -- until recently. Now, research interest has sharply increased with an influx of new benchmarks and models. We propose two new ways to evaluate commonsense models, emphasizing their generality on new tasks and building on diverse, recently introduced benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote research on commonsense models that generalize well over multiple tasks and datasets. Second, we propose a novel evaluation, the cost equivalent curve, that sheds new insight on how the choice of source datasets, pretrained language models, and transfer learning methods impacts performance and data efficiency. We perform extensive experiments -- over 200 experiments encompassing 4800 models -- and report multiple valuable and sometimes surprising findings, e.g., that transfer almost always leads to better or equivalent performance if following a particular recipe, that QA-based commonsense datasets transfer well with each other, while commonsense knowledge graphs do not, and that perhaps counter-intuitively, larger models benefit more from transfer than smaller ones. Last but not least, we introduce a new universal commonsense reasoning model, UNICORN, that establishes new state-of-the-art performance across 8 popular commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA (90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA (79.3%).

* 27 pages, 19 figures, 34 tables. Accepted to AAAI 2021. For associated code and data see https://github.com/allenai/rainbow

Via

Access Paper or Ask Questions

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Jan 17, 2021

Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Smith, Daniel S. Weld

Figure 1 for GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Figure 2 for GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Figure 3 for GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Figure 4 for GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Abstract:Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.

Via

Access Paper or Ask Questions

Learning from Task Descriptions

Nov 16, 2020

Orion Weller, Nicholas Lourie, Matt Gardner, Matthew E. Peters

Figure 1 for Learning from Task Descriptions

Figure 2 for Learning from Task Descriptions

Figure 3 for Learning from Task Descriptions

Figure 4 for Learning from Task Descriptions

Abstract:Typically, machine learning systems solve new tasks by training on thousands of examples. In contrast, humans can solve new tasks by reading some instructions, with perhaps an example or two. To take a step toward closing this gap, we introduce a framework for developing NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area. We instantiate this framework with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen tasks. Formulating task descriptions as questions, we ensure each is general enough to apply to many possible inputs, thus comprehensively evaluating a model's ability to solve each task. Moreover, the dataset's structure tests specific types of systematic generalization. We find that the state-of-the-art T5 model achieves a score of 12% on ZEST, leaving a significant challenge for NLP researchers.

* EMNLP 2020

Via

Access Paper or Ask Questions

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Oct 15, 2020

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi

Figure 1 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Figure 2 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Figure 3 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Figure 4 for Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Abstract:Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example---the model's confidence in the true class, and the variability of this confidence across epochs---obtained in a single run of training. Experiments across four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of "ambiguous" regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are "easy to learn" for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds "hard to learn"; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

* Proceedings of EMNLP 2020

Via

Access Paper or Ask Questions

Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

Aug 20, 2020

Nicholas Lourie, Ronan Le Bras, Yejin Choi

Figure 1 for Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

Figure 2 for Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

Figure 3 for Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

Figure 4 for Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

Abstract:As AI systems become an increasing part of people's everyday lives, it becomes ever more important that they understand people's ethical norms. Motivated by descriptive ethics, a field of study that focuses on people's descriptive judgments rather than theoretical prescriptions on morality, we investigate a novel, data-driven approach to machine ethics. We introduce Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Each anecdote recounts a complex ethical situation, often posing moral dilemmas, paired with a distribution of judgments contributed by the community members. Our dataset presents a major challenge to state-of-the-art neural language models, leaving significant room for improvement. However, when presented with simplified moral situations, the results are considerably more promising, suggesting that neural models can effectively learn simpler ethical building blocks. A key take-away of our empirical analysis is that norms are not always clean-cut; many situations are naturally divisive. We present a new method to estimate the best possible performance on such tasks with inherently diverse label distributions, and explore likelihood functions that separate intrinsic from model uncertainty.

* 16 pages, 15 tables, 8 figures. For associated code and data, see https://github.com/allenai/scruples

Via

Access Paper or Ask Questions

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Nov 02, 2018

Alon Talmor, Jonathan Herzig, Nicholas Lourie, Jonathan Berant

Figure 1 for CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Figure 2 for CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Figure 3 for CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Figure 4 for CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Abstract:When answering a question, people often draw upon their rich world knowledge in addition to some task-specific context. Recent work has focused primarily on answering questions based on some relevant document or content, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a difficult new dataset for commonsense question answering. To capture common sense beyond associations, each question discriminates between three target concepts that all share the same relationship to a single source drawn from ConceptNet (Speer et al., 2017). This constraint encourages crowd workers to author multiple-choice questions with complex semantics, in which all candidates relate to the subject in a similar way. We create 9,500 questions through this procedure and demonstrate the dataset's difficulty with a large number of strong baselines. Our best baseline, the OpenAI GPT (Radford et al., 2018), obtains 54.8% accuracy, well below human performance, which is 95.3%.

Via

Access Paper or Ask Questions