Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Katerina Margatina

A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Feb 17, 2025

Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, Yassine Benajiba

Figure 1 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Figure 2 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Figure 3 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Figure 4 for A Study on Leveraging Search and Self-Feedback for Agent Reasoning

Abstract:Recent works have demonstrated that incorporating search during inference can significantly improve reasoning capabilities of language agents. Some approaches may make use of the ground truth or rely on model's own generated feedback. The search algorithm uses this feedback to then produce values that will update its criterion for exploring and exploiting various reasoning paths. In this study, we investigate how search and model's self-feedback can be leveraged for reasoning tasks. First, we explore differences in ground-truth feedback and self-feedback during search for math reasoning. Second, we observe limitations in applying search techniques to more complex tasks like tool-calling and design domain-specific approaches to address these gaps. Our experiments reveal challenges related to generalization when solely relying on self-feedback during search. For search to work effectively, either access to the ground-truth is needed or feedback mechanisms need to be carefully designed for the specific task.

* Under review

Via

Access Paper or Ask Questions

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Apr 24, 2024

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He(+2 more)

Figure 1 for The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Figure 2 for The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Figure 3 for The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Figure 4 for The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Abstract:Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.

Via

Access Paper or Ask Questions

Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Oct 26, 2023

Ahmed Alajrami, Katerina Margatina, Nikolaos Aletras

Figure 1 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Figure 2 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Figure 3 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Figure 4 for Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?

Abstract:Understanding how and what pre-trained language models (PLMs) learn about language is an open challenge in natural language processing. Previous work has focused on identifying whether they capture semantic and syntactic information, and how the data or the pre-training objective affects their performance. However, to the best of our knowledge, no previous work has specifically examined how information loss in input token characters affects the performance of PLMs. In this study, we address this gap by pre-training language models using small subsets of characters from individual tokens. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$\% and $77$\% of the full-token model in SuperGLUE and GLUE tasks, respectively.

* To appear at EMNLP 2023

Via

Access Paper or Ask Questions

Active Learning Principles for In-Context Learning with Large Language Models

May 23, 2023

Katerina Margatina, Timo Schick, Nikolaos Aletras, Jane Dwivedi-Yu

Abstract:The remarkable advancements in large language models (LLMs) have significantly enhanced the performance in few-shot learning settings. By using only a small number of labeled examples, referred to as demonstrations, LLMs can effectively grasp the task at hand through in-context learning. However, the process of selecting appropriate demonstrations has received limited attention in prior work. This paper addresses the issue of identifying the most informative demonstrations for few-shot learning by approaching it as a pool-based Active Learning (AL) problem over a single iteration. Our objective is to investigate how AL algorithms can serve as effective demonstration selection methods for in-context learning. We compare various standard AL algorithms based on uncertainty, diversity, and similarity, and consistently observe that the latter outperforms all other methods, including random sampling. Notably, uncertainty sampling, despite its success in conventional supervised learning scenarios, performs poorly in this context. Our extensive experimentation involving a diverse range of GPT and OPT models across $24$ classification and multi-choice tasks, coupled with thorough analysis, unambiguously demonstrates that in-context example selection through AL prioritizes high-quality examples that exhibit low uncertainty and bear similarity to the test examples.

Via

Access Paper or Ask Questions

On the Limitations of Simulating Active Learning

May 21, 2023

Katerina Margatina, Nikolaos Aletras

Figure 1 for On the Limitations of Simulating Active Learning

Figure 2 for On the Limitations of Simulating Active Learning

Abstract:Active learning (AL) is a human-and-model-in-the-loop paradigm that iteratively selects informative unlabeled data for human annotation, aiming to improve over random sampling. However, performing AL experiments with human annotations on-the-fly is a laborious and expensive process, thus unrealistic for academic research. An easy fix to this impediment is to simulate AL, by treating an already labeled and publicly available dataset as the pool of unlabeled data. In this position paper, we first survey recent literature and highlight the challenges across all different steps within the AL loop. We further unveil neglected caveats in the experimental setup that can significantly affect the quality of AL research. We continue with an exploration of how the simulation setting can govern empirical findings, arguing that it might be one of the answers behind the ever posed question ``why do active learning algorithms sometimes fail to outperform random sampling?''. We argue that evaluating AL algorithms on available labeled datasets might provide a lower bound as to their effectiveness in real data. We believe it is essential to collectively shape the best practices for AL research, particularly as engineering advancements in LLMs push the research focus towards data-driven approaches (e.g., data efficiency, alignment, fairness). In light of this, we have developed guidelines for future work. Our aim is to draw attention to these limitations within the community, in the hope of finding ways to address them.

* To appear at Findings of ACL 2023

Via

Access Paper or Ask Questions

Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Feb 23, 2023

Katerina Margatina, Shuai Wang, Yogarshi Vyas, Neha Anna John, Yassine Benajiba, Miguel Ballesteros

Figure 1 for Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Figure 2 for Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Figure 3 for Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Figure 4 for Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

Abstract:Temporal concept drift refers to the problem of data changing over time. In NLP, that would entail that language (e.g. new expressions, meaning shifts) and factual knowledge (e.g. new concepts, updated facts) evolve over time. Focusing on the latter, we benchmark $11$ pretrained masked language models (MLMs) on a series of tests designed to evaluate the effect of temporal concept drift, as it is crucial that widely used language models remain up-to-date with the ever-evolving factual updates of the real world. Specifically, we provide a holistic framework that (1) dynamically creates temporal test sets of any time granularity (e.g. month, quarter, year) of factual data from Wikidata, (2) constructs fine-grained splits of tests (e.g. updated, new, unchanged facts) to ensure comprehensive analysis, and (3) evaluates MLMs in three distinct ways (single-token probing, multi-token generation, MLM scoring). In contrast to prior work, our framework aims to unveil how robust an MLM is over time and thus to provide a signal in case it has become outdated, by leveraging multiple views of evaluation.

* To appear at EACL 2023. Our code will be available at https://github.com/amazon-science/temporal-robustness

Via

Access Paper or Ask Questions

Investigating Multi-source Active Learning for Natural Language Inference

Feb 14, 2023

Ard Snijders, Douwe Kiela, Katerina Margatina

Figure 1 for Investigating Multi-source Active Learning for Natural Language Inference

Figure 2 for Investigating Multi-source Active Learning for Natural Language Inference

Figure 3 for Investigating Multi-source Active Learning for Natural Language Inference

Figure 4 for Investigating Multi-source Active Learning for Natural Language Inference

Abstract:In recent years, active learning has been successfully applied to an array of NLP tasks. However, prior work often assumes that training and test data are drawn from the same distribution. This is problematic, as in real-life settings data may stem from several sources of varying relevance and quality. We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference. We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalization. When outliers are removed, strategies are found to recover and outperform random baselines. In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not always categorically harmful. Lastly, we leverage dataset cartography to introduce difficulty-stratified testing and find that different strategies are affected differently by example learnability and difficulty.

* 23 pages. Accepted for publication at the European Chapter of the Association of Computational Linguistics (EACL) 2023

Via

Access Paper or Ask Questions

Challenges and Strategies in Cross-Cultural NLP

Mar 18, 2022

Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui(+4 more)

Figure 1 for Challenges and Strategies in Cross-Cultural NLP

Abstract:Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.

* ACL 2022 - Theme track

Via

Access Paper or Ask Questions

Active Learning by Acquiring Contrastive Examples

Sep 08, 2021

Katerina Margatina, Giorgos Vernikos, Loïc Barrault, Nikolaos Aletras

Figure 1 for Active Learning by Acquiring Contrastive Examples

Figure 2 for Active Learning by Acquiring Contrastive Examples

Figure 3 for Active Learning by Acquiring Contrastive Examples

Figure 4 for Active Learning by Acquiring Contrastive Examples

Abstract:Common acquisition functions for active learning use either uncertainty or diversity sampling, aiming to select difficult and diverse data points from the pool of unlabeled data, respectively. In this work, leveraging the best of both worlds, we propose an acquisition function that opts for selecting \textit{contrastive examples}, i.e. data points that are similar in the model feature space and yet the model outputs maximally different predictive likelihoods. We compare our approach, CAL (Contrastive Active Learning), with a diverse set of acquisition functions in four natural language understanding tasks and seven datasets. Our experiments show that CAL performs consistently better or equal than the best performing baseline across all tasks, on both in-domain and out-of-domain data. We also conduct an extensive ablation study of our method and we further analyze all actively acquired datasets showing that CAL achieves a better trade-off between uncertainty and diversity compared to other strategies.

* Accepted at EMNLP 2021

Via

Access Paper or Ask Questions

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Sep 04, 2021

Atsuki Yamaguchi, George Chrysostomou, Katerina Margatina, Nikolaos Aletras

Figure 1 for Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Figure 2 for Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Figure 3 for Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Figure 4 for Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Abstract:Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced by a [MASK] placeholder in a multi-class setting over the entire vocabulary. When pretraining, it is common to use alongside MLM other auxiliary objectives on the token or sequence level to improve downstream performance (e.g. next sentence prediction). However, no previous work so far has attempted in examining whether other simpler linguistically intuitive or not objectives can be used standalone as main pretraining objectives. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of MLM. Empirical results on GLUE and SQuAD show that our proposed methods achieve comparable or better performance to MLM using a BERT-BASE architecture. We further validate our methods using smaller models, showing that pretraining a model with 41% of the BERT-BASE's parameters, BERT-MEDIUM results in only a 1% drop in GLUE scores with our best objective.

* Accepted at EMNLP 2021

Via

Access Paper or Ask Questions