Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthew R. Gormley

Predicting the Past: Estimating Historical Appraisals with OCR and Machine Learning

May 30, 2025

Mihir Bhaskar, Jun Tao Luo, Zihan Geng, Asmita Hajra, Junia Howell, Matthew R. Gormley

Abstract:Despite well-documented consequences of the U.S. government's 1930s housing policies on racial wealth disparities, scholars have struggled to quantify its precise financial effects due to the inaccessibility of historical property appraisal records. Many counties still store these records in physical formats, making large-scale quantitative analysis difficult. We present an approach scholars can use to digitize historical housing assessment data, applying it to build and release a dataset for one county. Starting from publicly available scanned documents, we manually annotated property cards for over 12,000 properties to train and validate our methods. We use OCR to label data for an additional 50,000 properties, based on our two-stage approach combining classical computer vision techniques with deep learning-based OCR. For cases where OCR cannot be applied, such as when scanned documents are not available, we show how a regression model based on building feature data can estimate the historical values, and test the generalizability of this model to other counties. With these cost-effective tools, scholars, community activists, and policy makers can better analyze and understand the historical impacts of redlining.

* Accepted to COMPASS 2025

Via

Access Paper or Ask Questions

A Taxonomy for Data Contamination in Large Language Models

Jul 11, 2024

Medha Palavalli, Amanda Bertsch, Matthew R. Gormley

Abstract:Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks -- summarization and question answering -- revealing how different types of contamination influence task performance during evaluation.

* 19 pages, 8 figures, accepted to CONDA Workshop on Data Contamination @ ACL 2024

Via

Access Paper or Ask Questions

In-Context Learning with Long-Context Models: An In-Depth Exploration

Apr 30, 2024

Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R. Gormley, Graham Neubig

Abstract:As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learning (ICL) at this extreme scale on multiple datasets and models. We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations. We contrast this with example retrieval and finetuning: example retrieval shows excellent performance at low context lengths but has diminished gains with more demonstrations; finetuning is more data hungry than ICL but can sometimes exceed long-context ICL performance with additional data. We use this ICL setting as a testbed to study several properties of both in-context learning and long-context models. We show that long-context ICL is less sensitive to random input shuffling than short-context ICL, that grouping of same-label examples can negatively impact performance, and that the performance boosts we see do not arise from cumulative gain from encoding many examples together. We conclude that although long-context ICL can be surprisingly effective, most of this gain comes from attending back to similar examples rather than task learning.

* 27 pages; preprint

Via

Access Paper or Ask Questions

Learning Mutually Informed Representations for Characters and Subwords

Nov 14, 2023

Yilin Wang, Xinyi Hu, Matthew R. Gormley

Figure 1 for Learning Mutually Informed Representations for Characters and Subwords

Figure 2 for Learning Mutually Informed Representations for Characters and Subwords

Figure 3 for Learning Mutually Informed Representations for Characters and Subwords

Figure 4 for Learning Mutually Informed Representations for Characters and Subwords

Abstract:Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, and POS-tagging tasks. Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. Our anonymized code is available at https://anonymous.4open.science/r/noisy-IE-A673

Via

Access Paper or Ask Questions

It's MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk

Oct 02, 2023

Amanda Bertsch, Alex Xie, Graham Neubig, Matthew R. Gormley

Abstract:Minimum Bayes Risk (MBR) decoding is a method for choosing the outputs of a machine learning system based not on the output with the highest probability, but the output with the lowest risk (expected error) among multiple candidates. It is a simple but powerful method: for an additional cost at inference time, MBR provides reliable several-point improvements across metrics for a wide variety of tasks without any additional data or training. Despite this, MBR is not frequently applied in NLP works, and knowledge of the method itself is limited. We first provide an introduction to the method and the recent literature. We show that several recent methods that do not reference MBR can be written as special cases of MBR; this reformulation provides additional theoretical justification for the performance of these methods, explaining some results that were previously only empirical. We provide theoretical and empirical results about the effectiveness of various MBR variants and make concrete recommendations for the application of MBR in NLP models, including future directions in this area.

* Under submission

Via

Access Paper or Ask Questions

MDACE: MIMIC Documents Annotated with Code Evidence

Jul 07, 2023

Hua Cheng, Rana Jafari, April Russell, Russell Klopfer, Edmond Lu, Benjamin Striner, Matthew R. Gormley

Figure 1 for MDACE: MIMIC Documents Annotated with Code Evidence

Figure 2 for MDACE: MIMIC Documents Annotated with Code Evidence

Figure 3 for MDACE: MIMIC Documents Annotated with Code Evidence

Figure 4 for MDACE: MIMIC Documents Annotated with Code Evidence

Abstract:We introduce a dataset for evidence/rationale extraction on an extreme multi-label classification task over long medical documents. One such task is Computer-Assisted Coding (CAC) which has improved significantly in recent years, thanks to advances in machine learning technologies. Yet simply predicting a set of final codes for a patient encounter is insufficient as CAC systems are required to provide supporting textual evidence to justify the billing codes. A model able to produce accurate and reliable supporting evidence for each code would be a tremendous benefit. However, a human annotated code evidence corpus is extremely difficult to create because it requires specialized knowledge. In this paper, we introduce MDACE, the first publicly available code evidence dataset, which is built on a subset of the MIMIC-III clinical records. The dataset -- annotated by professional medical coders -- consists of 302 Inpatient charts with 3,934 evidence spans and 52 Profee charts with 5,563 evidence spans. We implemented several evidence extraction methods based on the EffectiveCAN model (Liu et al., 2021) to establish baseline performance on this dataset. MDACE can be used to evaluate code evidence extraction methods for CAC systems, as well as the accuracy and interpretability of deep learning models for multi-label classification. We believe that the release of MDACE will greatly improve the understanding and application of deep learning technologies for medical coding and document classification.

Via

Access Paper or Ask Questions

SummQA at MEDIQA-Chat 2023:In-Context Learning with GPT-4 for Medical Summarization

Jun 30, 2023

Yash Mathur, Sanketh Rangreji, Raghav Kapoor, Medha Palavalli, Amanda Bertsch, Matthew R. Gormley

Figure 1 for SummQA at MEDIQA-Chat 2023:In-Context Learning with GPT-4 for Medical Summarization

Figure 2 for SummQA at MEDIQA-Chat 2023:In-Context Learning with GPT-4 for Medical Summarization

Figure 3 for SummQA at MEDIQA-Chat 2023:In-Context Learning with GPT-4 for Medical Summarization

Figure 4 for SummQA at MEDIQA-Chat 2023:In-Context Learning with GPT-4 for Medical Summarization

Abstract:Medical dialogue summarization is challenging due to the unstructured nature of medical conversations, the use of medical terminology in gold summaries, and the need to identify key information across multiple symptom sets. We present a novel system for the Dialogue2Note Medical Summarization tasks in the MEDIQA 2023 Shared Task. Our approach for section-wise summarization (Task A) is a two-stage process of selecting semantically similar dialogues and using the top-k similar dialogues as in-context examples for GPT-4. For full-note summarization (Task B), we use a similar solution with k=1. We achieved 3rd place in Task A (2nd among all teams), 4th place in Task B Division Wise Summarization (2nd among all teams), 15th place in Task A Section Header Classification (9th among all teams), and 8th place among all teams in Task B. Our results highlight the effectiveness of few-shot prompting for this task, though we also identify several weaknesses of prompting-based approaches. We compare GPT-4 performance with several finetuned baselines. We find that GPT-4 summaries are more abstractive and shorter. We make our code publicly available.

* ClinicalNLP @ ACL 2023

Via

Access Paper or Ask Questions

Unlimiformer: Long-Range Transformers with Unlimited Length Input

May 02, 2023

Amanda Bertsch, Uri Alon, Graham Neubig, Matthew R. Gormley

Abstract:Transformer-based models typically have a predefined bound to their input length, because of their need to potentially attend to every token in the input. In this work, we propose Unlimiformer: a general approach that can wrap any existing pretrained encoder-decoder transformer, and offload the attention computation across all layers to a single $k$-nearest-neighbor index; this index can be kept on either the GPU or CPU memory and queried in sub-linear time. This way, we can index extremely long input sequences, while every attention head in every decoder layer retrieves its top-$k$ keys, instead of attending to every key. We demonstrate Unlimiformers's efficacy on several long-document and multi-document summarization benchmarks, showing that it can summarize even 350k token-long inputs from the BookSum dataset, without any input truncation at test time. Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .

* Preprint

Via

Access Paper or Ask Questions

Revisiting text decomposition methods for NLI-based factuality scoring of summaries

Nov 30, 2022

John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley, Thomas Schaaf

Figure 1 for Revisiting text decomposition methods for NLI-based factuality scoring of summaries

Figure 2 for Revisiting text decomposition methods for NLI-based factuality scoring of summaries

Figure 3 for Revisiting text decomposition methods for NLI-based factuality scoring of summaries

Figure 4 for Revisiting text decomposition methods for NLI-based factuality scoring of summaries

Abstract:Scoring the factuality of a generated summary involves measuring the degree to which a target text contains factual information using the input document as support. Given the similarities in the problem formulation, previous work has shown that Natural Language Inference models can be effectively repurposed to perform this task. As these models are trained to score entailment at a sentence level, several recent studies have shown that decomposing either the input document or the summary into sentences helps with factuality scoring. But is fine-grained decomposition always a winning strategy? In this paper we systematically compare different granularities of decomposition -- from document to sub-sentence level, and we show that the answer is no. Our results show that incorporating additional context can yield improvement, but that this does not necessarily apply to all datasets. We also show that small changes to previously proposed entailment-based scoring methods can result in better performance, highlighting the need for caution in model and methodology selection for downstream tasks.

* Generation, Evaluation & Metrics (GEM) Workshop 2022

Via

Access Paper or Ask Questions

AdaFocal: Calibration-aware Adaptive Focal Loss

Nov 21, 2022

Arindam Ghosh, Thomas Schaaf, Matthew R. Gormley

Figure 1 for AdaFocal: Calibration-aware Adaptive Focal Loss

Figure 2 for AdaFocal: Calibration-aware Adaptive Focal Loss

Figure 3 for AdaFocal: Calibration-aware Adaptive Focal Loss

Figure 4 for AdaFocal: Calibration-aware Adaptive Focal Loss

Abstract:Much recent work has been devoted to the problem of ensuring that a neural network's confidence scores match the true probability of being correct, i.e. the calibration problem. Of note, it was found that training with focal loss leads to better calibration than cross-entropy while achieving similar level of accuracy \cite{mukhoti2020}. This success stems from focal loss regularizing the entropy of the model's prediction (controlled by the parameter $\gamma$), thereby reining in the model's overconfidence. Further improvement is expected if $\gamma$ is selected independently for each training sample (Sample-Dependent Focal Loss (FLSD-53) \cite{mukhoti2020}). However, FLSD-53 is based on heuristics and does not generalize well. In this paper, we propose a calibration-aware adaptive focal loss called AdaFocal that utilizes the calibration properties of focal (and inverse-focal) loss and adaptively modifies $\gamma_t$ for different groups of samples based on $\gamma_{t-1}$ from the previous step and the knowledge of model's under/over-confidence on the validation set. We evaluate AdaFocal on various image recognition and one NLP task, covering a wide variety of network architectures, to confirm the improvement in calibration while achieving similar levels of accuracy. Additionally, we show that models trained with AdaFocal achieve a significant boost in out-of-distribution detection.

* Accepted to NeurIPS 2022

Via

Access Paper or Ask Questions