Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joel Tetreault

Dataminr Inc., New York, NY

Characterizing Mamba's Selective Memory using Auto-Encoders

Dec 17, 2025

Tamanna Hossain, Robert L. Logan, Ganesh Jagadeesan, Sameer Singh, Joel Tetreault, Alejandro Jaimes

Figure 1 for Characterizing Mamba's Selective Memory using Auto-Encoders

Figure 2 for Characterizing Mamba's Selective Memory using Auto-Encoders

Figure 3 for Characterizing Mamba's Selective Memory using Auto-Encoders

Figure 4 for Characterizing Mamba's Selective Memory using Auto-Encoders

Abstract:State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM's hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M--1.4B) on sequences ranging from 4--256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba's pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba's ability to retain important information.

* AACL 2025. Oral Presentation

Via

Access Paper or Ask Questions

CEHA: A Dataset of Conflict Events in the Horn of Africa

Dec 18, 2024

Rui Bai, Di Lu, Shihao Ran, Elizabeth Olson, Hemank Lamba, Aoife Cahill, Joel Tetreault, Alex Jaimes

Figure 1 for CEHA: A Dataset of Conflict Events in the Horn of Africa

Figure 2 for CEHA: A Dataset of Conflict Events in the Horn of Africa

Figure 3 for CEHA: A Dataset of Conflict Events in the Horn of Africa

Figure 4 for CEHA: A Dataset of Conflict Events in the Horn of Africa

Abstract:Natural Language Processing (NLP) of news articles can play an important role in understanding the dynamics and causes of violent conflict. Despite the availability of datasets categorizing various conflict events, the existing labels often do not cover all of the fine-grained violent conflict event types relevant to areas like the Horn of Africa. In this paper, we introduce a new benchmark dataset Conflict Events in the Horn of Africa region (CEHA) and propose a new task for identifying violent conflict events using online resources with this dataset. The dataset consists of 500 English event descriptions regarding conflict events in the Horn of Africa region with fine-grained event-type definitions that emphasize the cause of the conflict. This dataset categorizes the key types of conflict risk according to specific areas required by stakeholders in the Humanitarian-Peace-Development Nexus. Additionally, we conduct extensive experiments on two tasks supported by this dataset: Event-relevance Classification and Event-type Classification. Our baseline models demonstrate the challenging nature of these tasks and the usefulness of our dataset for model evaluations in low-resource settings with limited number of training data.

* Accepted by COLING 2025

Via

Access Paper or Ask Questions

HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid

Oct 08, 2024

Hemank Lamba, Anton Abilov, Ke Zhang, Elizabeth M. Olson, Henry k. Dambanemuya, João c. Bárcia, David S. Batista, Christina Wille, Aoife Cahill, Joel Tetreault(+1 more)

Figure 1 for HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid

Figure 2 for HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid

Figure 3 for HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid

Figure 4 for HumVI: A Multilingual Dataset for Detecting Violent Incidents Impacting Humanitarian Aid

Abstract:Humanitarian organizations can enhance their effectiveness by analyzing data to discover trends, gather aggregated insights, manage their security risks, support decision-making, and inform advocacy and funding proposals. However, data about violent incidents with direct impact and relevance for humanitarian aid operations is not readily available. An automatic data collection and NLP-backed classification framework aligned with humanitarian perspectives can help bridge this gap. In this paper, we present HumVI - a dataset comprising news articles in three languages (English, French, Arabic) containing instances of different types of violent incidents categorized by the humanitarian sector they impact, e.g., aid security, education, food security, health, and protection. Reliable labels were obtained for the dataset by partnering with a data-backed humanitarian organization, Insecurity Insight. We provide multiple benchmarks for the dataset, employing various deep learning architectures and techniques, including data augmentation and mask loss, to address different task-related challenges, e.g., domain expansion. The dataset is publicly available at https://github.com/dataminr-ai/humvi-dataset.

Via

Access Paper or Ask Questions

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models

Feb 19, 2024

Puxuan Yu, Daniel Cohen, Hemank Lamba, Joel Tetreault, Alex Jaimes

Figure 1 for Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models

Figure 2 for Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models

Figure 3 for Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models

Figure 4 for Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from Large Language Models

Abstract:The process of scale calibration in ranking systems involves adjusting the outputs of rankers to correspond with significant qualities like click-through rates or relevance, crucial for mirroring real-world value and thereby boosting the system's effectiveness and reliability. Although there has been research on calibrated ranking losses within learning-to-rank models, the particular issue of adjusting the scale for neural rankers, which excel in handling textual information, has not been thoroughly examined. Neural ranking models are adept at processing text data, yet the application of existing scale calibration techniques to these models poses significant challenges due to their complexity and the intensive training they require, often resulting in suboptimal outcomes. This study delves into the potential of large language models (LLMs) to provide uncertainty measurements for a query and document pair that correlate with the scale-calibrated scores. By employing Monte Carlo sampling to gauge relevance probabilities from LLMs and incorporating natural language explanations (NLEs) to articulate this uncertainty, we carry out comprehensive tests on two major document ranking datasets. Our findings reveal that the approach leveraging NLEs outperforms existing calibration methods under various training scenarios, leading to better calibrated neural rankers.

Via

Access Paper or Ask Questions

Dissecting users' needs for search result explanations

Jan 29, 2024

Prerna Juneja, Wenjuan Zhang, Alison Marie Smith-Renner, Hemank Lamba, Joel Tetreault, Alex Jaimes

Figure 1 for Dissecting users' needs for search result explanations

Figure 2 for Dissecting users' needs for search result explanations

Figure 3 for Dissecting users' needs for search result explanations

Figure 4 for Dissecting users' needs for search result explanations

Abstract:There is a growing demand for transparency in search engines to understand how search results are curated and to enhance users' trust. Prior research has introduced search result explanations with a focus on how to explain, assuming explanations are beneficial. Our study takes a step back to examine if search explanations are needed and when they are likely to provide benefits. Additionally, we summarize key characteristics of helpful explanations and share users' perspectives on explanation features provided by Google and Bing. Interviews with non-technical individuals reveal that users do not always seek or understand search explanations and mostly desire them for complex and critical tasks. They find Google's search explanations too obvious but appreciate the ability to contest search results. Based on our findings, we offer design recommendations for search engines and explanations to help users better evaluate search results and enhance their search experience.

Via

Access Paper or Ask Questions

Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task

Nov 01, 2023

Neema Kotonya, Saran Krishnasamy, Joel Tetreault, Alejandro Jaimes

Figure 1 for Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task

Figure 2 for Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task

Figure 3 for Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task

Abstract:This paper describes and analyzes our participation in the 2023 Eval4NLP shared task, which focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation, particularly in the context of evaluating machine translations and summaries. We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. In addition, we integrated these approaches with zero-shot and one-shot learning methods to maximize the efficacy of our evaluation procedures. Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.

* Eval4NLP 2023 Shared Task

Via

Access Paper or Ask Questions

Event Extraction as Question Generation and Answering

Jul 10, 2023

Di Lu, Shihao Ran, Joel Tetreault, Alejandro Jaimes

Abstract:Recent work on Event Extraction has reframed the task as Question Answering (QA), with promising results. The advantage of this approach is that it addresses the error propagation issue found in traditional token-based classification approaches by directly predicting event arguments without extracting candidates first. However, the questions are typically based on fixed templates and they rarely leverage contextual information such as relevant arguments. In addition, prior QA-based approaches have difficulty handling cases where there are multiple arguments for the same role. In this paper, we propose QGA-EE, which enables a Question Generation (QG) model to generate questions that incorporate rich contextual information instead of using fixed templates. We also propose dynamic templates to assist the training of QG model. Experiments show that QGA-EE outperforms all prior single-task-based models on the ACE05 English dataset.

* Accepted to ACL 2023

Via

Access Paper or Ask Questions

A New Task and Dataset on Detecting Attacks on Human Rights Defenders

Jun 30, 2023

Shihao Ran, Di Lu, Joel Tetreault, Aoife Cahill, Alejandro Jaimes

Figure 1 for A New Task and Dataset on Detecting Attacks on Human Rights Defenders

Figure 2 for A New Task and Dataset on Detecting Attacks on Human Rights Defenders

Figure 3 for A New Task and Dataset on Detecting Attacks on Human Rights Defenders

Figure 4 for A New Task and Dataset on Detecting Attacks on Human Rights Defenders

Abstract:The ability to conduct retrospective analyses of attacks on human rights defenders over time and by location is important for humanitarian organizations to better understand historical or ongoing human rights violations and thus better manage the global impact of such events. We hypothesize that NLP can support such efforts by quickly processing large collections of news articles to detect and summarize the characteristics of attacks on human rights defenders. To that end, we propose a new dataset for detecting Attacks on Human Rights Defenders (HRDsAttack) consisting of crowdsourced annotations on 500 online news articles. The annotations include fine-grained information about the type and location of the attacks, as well as information about the victim(s). We demonstrate the usefulness of the dataset by using it to train and evaluate baseline models on several sub-tasks to predict the annotated characteristics.

Via

Access Paper or Ask Questions

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

May 02, 2023

Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Jackie Cheung, Mark Cieliebak, Elizabeth Clark, Kees van Deemter(+29 more)

Figure 1 for Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Figure 2 for Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Figure 3 for Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Figure 4 for Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Abstract:We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

* 5 pages plus appendix, 4 tables, 1 figure. To appear at "Workshop on Insights from Negative Results in NLP" (co-located with EACL2023)

Via

Access Paper or Ask Questions

Counterfactual Editing for Search Result Explanation

Jan 25, 2023

Zhichao Xu, Hemank Lamba, Qingyao Ai, Joel Tetreault, Alex Jaimes

Figure 1 for Counterfactual Editing for Search Result Explanation

Figure 2 for Counterfactual Editing for Search Result Explanation

Figure 3 for Counterfactual Editing for Search Result Explanation

Figure 4 for Counterfactual Editing for Search Result Explanation

Abstract:Recently substantial improvements in neural retrieval methods also bring to light the inherent blackbox nature of these methods, especially when viewed from an explainability perspective. Most of existing works on Search Result Explanation (SeRE) are designed to provide factual explanation, i.e. to find/generate supporting evidence about documents' relevance to search queries. However, research in cognitive sciences have shown that human explanations are contrastive i.e. people explain an observed event using some counterfactual events; such explanations reduce cognitive load, and provide actionable insights. Though already proven effective in machine learning and NLP communities, the formulation and impact of counterfactual explanations have not been well studied for search systems. In this work, we aim to investigate the effectiveness of this perspective via proposing and evaluating counterfactual explanations for the task of SeRE. Specifically, we first conduct a user study where we investigate if counterfactual explanations indeed improve search sessions' effectiveness. Taking this as a motivation, we discuss the desiderata that an ideal counterfactual explanation method for SeRE should adhere to. Next, we propose a method $\text{CFE}^2$ (\textbf{C}ounter\textbf{F}actual \textbf{E}xplanation with \textbf{E}diting) to provide pairwise explanations to search engine result page. Finally, we showcase that the proposed method when evaluated on four publicly available datasets outperforms baselines on both metrics and human evaluation.

* work in progress

Via

Access Paper or Ask Questions